DeNSe parser in Dependency Parsing as Head Selection (EACL 2017) https://arxiv.org/abs/1606.01280
Conventional graph-based dependency parsers guarantee a tree structure both during training and inference. Instead, we formalize dependency parsing as the problem of independently selecting the head of each word in a sentence. Our model which we call DeNSe (as shorthand for Dependency Neural Selection) produces a distribution over possible heads for each word using features obtained from a bidirectional recurrent neural network. Without enforcing structural constraints during training, DeNSe generates (at inference time) trees for the overwhelming majority of sentences, while non-tree outputs can be adjusted with a maximum spanning tree algorithm. We evaluate DeNSe on four languages (English, Chinese, Czech, and German) with varying degrees of non-projectivity. Despite the simplicity of the approach, our parsers are on par with the state of the art.READ FULL TEXT VIEW PDF
We reduce phrase-representation parsing to dependency parsing. Our reduc...
Chinese word segmentation and dependency parsing are two fundamental tas...
Modern graph-based syntactic dependency parsers operate by predicting, f...
In this work, we address the problem to model all the nodes (words or
The connection between the maximum spanning tree in a directed graph and...
The connection between dependency trees and spanning trees is exploited ...
Higher-order methods for dependency parsing can partially but not fully
DeNSe parser in Dependency Parsing as Head Selection (EACL 2017) https://arxiv.org/abs/1606.01280
Dependency parsing plays an important role in many natural language applications, such as relation extraction [Fundel et al.2007], machine translation [Carreras and Collins2009], language modeling [Chelba et al.1997, Zhang et al.2016] and ontology construction [Snow et al.2005]. Dependency parsers represent syntactic information as a set of head-dependent relational arcs, typically constrained to form a tree. Practically all models proposed for dependency parsing in recent years can be described as graph-based [McDonald et al.2005a] or transition-based [Yamada and Matsumoto2003, Nivre et al.2006b].
Graph-based dependency parsers are typically arc-factored, where the score of a tree is defined as the sum of the scores of all its arcs. An arc is scored with a set of local features and a linear model, the parameters of which can be effectively learned with online algorithms [Crammer and Singer2001, Crammer and Singer2003, Freund and Schapire1999, Collins2002]. In order to efficiently find the best scoring tree during training and decoding, various maximization algorithms have been developed [Eisner1996, Eisner2000, McDonald et al.2005b]
. In general, graph-based methods are optimized globally, using features of single arcs in order to make the learning and inference tractable. Transition-based algorithms factorize a tree into a set of parsing actions. At each transition state, the parser scores a candidate action conditioned on the state of the transition system and the parsing history, and greedily selects the highest-scoring action to execute. This score is typically obtained with a classifier based on non-local features defined over a rich history of parsing decisions[Yamada and Matsumoto2003, Zhang and Nivre2011].
, rely on extensive feature engineering. Feature templates are typically manually designed and aim at capturing head-dependent relationships which are notoriously sparse and difficult to estimate. More recently, a few approaches[Chen and Manning2014, Lei et al.2014a, Kiperwasser and Goldberg2016] apply neural networks for learning dense feature representations. The learned features are subsequently used in a conventional graph- or transition-based parser, or better designed variants [Dyer et al.2015].
In this work, we propose a simple neural network-based model which learns to select the head for each word in a sentence without enforcing tree structured output. Our model which we call DeNSe (as shorthand for Dependency Neural Selection) employs bidirectional recurrent neural networks to learn feature representations for words in a sentence. These features are subsequently used to predict the head of each word. Although there is nothing inherent in the model to enforce tree-structured output, when tested on an English dataset, it is able to generate trees for 95% of the sentences, 87% of which are projective. The remaining non-tree (or non-projective) outputs are post-processed with the Chu-Liu-Edmond (or Eisner) algorithm. DeNSe uses the head selection procedure to estimate arc weights during training. During testing, it essentially reduces to a standard graph-based parser when it fails to produce tree (or projective) output.
We evaluate our model on benchmark dependency parsing corpora, representing four languages (English, Chinese, Czech, and German) with varying degrees of non-projectivity. Despite the simplicity of our approach, experiments show that the resulting parsers are on par with the state of the art.
Graph-based dependency parsers employ a model for scoring possible dependency graphs for a given sentence. The graphs are typically factored into their component arcs and the score of a tree is defined as the sum of its arcs. This factorization enables tractable search for the highest scoring graph structure which is commonly formulated as the search for the maximum spanning tree (MST). The Chu-Liu-Edmonds algorithm [Chu and Liu1965, Edmonds1967, McDonald et al.2005b] is often used to extract the MST in the case of non-projective trees, and the Eisner algorithm [Eisner1996, Eisner2000] in the case of projective trees. During training, weight parameters of the scoring function can be learned with margin-based algorithms [Crammer and Singer2001, Crammer and Singer2003]
or the structured perceptron[Freund and Schapire1999, Collins2002]. Beyond basic first-order models, the literature offers a few examples of higher-order models involving sibling and grand parent relations [Carreras2007, Koo et al.2010, Zhang and McDonald2012]. Although more expressive, such models render both training and inference more challenging.
As the term implies, transition-based parsers conceptualize the process of transforming a sentence into a dependency tree as a sequence of transitions. A transition system typically includes a stack for storing partially processed tokens, a buffer containing the remaining input, and a set of arcs containing all dependencies between tokens that have been added so far [Nivre2003, Nivre et al.2006b]. A dependency tree is constructed by manipulating the stack and buffer, and appending arcs with predetermined operations. Most popular parsers employ an arc-standard [Yamada and Matsumoto2003, Nivre2004] or arc-eager transition system [Nivre2008]. Extensions of the latter include the use of non-local training methods to avoid greedy error propagation [Zhang and Clark2008, Huang and Sagae2010, Zhang and Nivre2011, Goldberg and Nivre2012].
. Recent work uses neural networks in lieu of the linear classifiers typically employed in conventional transition- or graph-based dependency parsers. For example, chen2014fast use a feed forward neural network to learn features for a transition-based parser, whereas pei2015effective do the same for a graph-based parser. lei2014low apply tensor decomposition to obtain word embeddings in their syntactic roles, which they subsequently use in a graph-based parser. dyer2015transition redesign components of a transition-based system where the buffer, stack, and action sequences are modeled separately with stack long short-term memory networks. The hidden states of these LSTMs are concatenated and used as features to a final transition classifier. DBLP:journals/corr/KiperwasserG16a use bidirectional LSTMs to extract features for a transition- and graph-based parser, whereas cross-huang:2016 build a greedy arc-standard parser using similar features.
In our work, we formalize dependency parsing as the task of finding for each word in a sentence its most probable head. Both head selection and the features it is based on are learned using neural networks. The idea of modeling child-parent relations independently dates back to hall:2007 who use an edge-factored model to generate-best parse trees which are subsequently reranked using a model based on rich global features. Later smith2010efficient show that a head selection variant of their loopy belief propagation parser performs worse than a model which incorporates tree structure constraints. Our parser is conceptually simpler: we rely on head selection to do most of the work and decode the best tree directly without using a reranker. In common with recent neural network-based dependency parsers, we aim to alleviate the need for hand-crafting feature combinations. Beyond feature learning, we further show that it is possible to simplify the training of a graph-based dependency parser in the context of bidirectional recurrent neural networks.
In this section we present our parsing model, DeNSe, which tries to predict the head of each word in a sentence. Specifically, the model takes as input a sentence of length and outputs head, dependent arcs. We describe the model focusing on unlabeled dependencies and then discuss how it can be straightforwardly extended to the labeled setting. We begin by explaining how words are represented in our model and then give details on how DeNSe makes predictions based on these learned representations. Since there is no guarantee that the outputs of DeNSe are trees (although they mostly are), we also discuss how to extend DeNSe
in order to enforce projective and non-projective tree outputs. Throughout this paper, lowercase boldface letters denote vectors (e.g.,or ), uppercase boldface letters denote matrices (e.g., or ), and lowercase letters denote scalars (e.g., or ).
Let denote a sentence of length ; following common practice in the dependency parsing literature [Kübler et al.2009], we add an artificial root token represented by . Analogously, let denote the representation of sentence , with representing word . Besides encoding information about each in isolation (e.g., its lexical meaning or POS tag), must also encode ’s positional information within the sentence. Such information has been shown to be important in dependency parsing [McDonald et al.2005a]. For example, in the following sentence:
[theme = simple, arc angle=50] [column sep=0.03em] root & a & dog & is & chasing & a & cat
32 53 54 15 76 57
the head of the first a is dog, whereas the head of the second a is cat. Without considering positional information, a model cannot easily decide which a (nearer or farther) to assign to dog.
Long short-term memory networks (Hochreiter and Schmidhuber, 1997; LSTMs), a type of recurrent neural network with a more complex computational unit, have proven effective at capturing long-term dependencies. In our case LSTMs allow to represent each word on its own and within a sequence leveraging long-range contextual information. As shown in Figure 1, we first use a forward LSTM () to read the sentence from left to right and then a backward LSTM () to read the sentence from right to left, so that the entire sentence serves as context for each word:222For more detail on LSTM networks, see e.g., Graves:2012 or Goldberg:2015.
where is the feature vector of word , and are the hidden states and memory cells for the th word in and is the hidden unit size. is also the representation for ( and its left neighboring words) and is an internal state maintained by . and are the hidden states and memory cells for the backward . Each token is represented by , the concatenation of two vectors corresponding to ’s lexical and POS tag embeddings:
where and are one-hot vector representations of token and its POS tag ; and are the word and POS tag embedding matrices, where is the vocabulary size, is the word embedding size, is the POS tag set size, and the tag embedding size. The hidden states of the forward and backward LSTMs are concatenated to obtain, the final representation of :
Note that bidirectional LSTMs are one of many possible ways of representing word . Alternative representations include embeddings obtained from feed-forward neural networks [Chen and Manning2014, Lei et al.2014a], character-based embeddings [Ballesteros et al.2015], and more conventional features such as those introduced in mcdonald2005online.
We now move on to discuss our formalization of dependency parsing as head selection. We begin with unlabeled dependencies and then explain how the model can be extended to predict labeled ones.
In a dependency tree, a head can have multiple dependents, whereas a dependent can have only one head. Based on this fact, dependency parsing can be formalized as follows. Given a sentence , we aim to find for each word the most probable head . For example, in Figure 1, to find the head for the token love, we calculate probabilities , , and , and select the highest. More formally, we estimate the probability of token being the head of token in sentence as:
where and are vector-based representations of and , respectively (described in Section 3.1); is a neural network with a single hidden layer that computes the associative score between representations and :
where , , and are weight matrices of . Note that the candidate head can be the root, while the dependent cannot. Equations (5) and (6) compute the probability of adding an arc between two words, in a fashion similar to the neural attention mechanism in sequence-to-sequence models [Bahdanau et al.2015].
We train our model by minimizing the negative log likelihood of the gold standard head, dependent arcs in all training sentences:
where is the training set, is ’s gold standard head333Note that can be root. within sentence , and the number of words in (excluding root). During inference, for each word in , we greedily choose the most likely head :
Note that the prediction for each word is made independently of the other words in the sentence.
Given our greedy inference method, there is no guarantee that predicted head, dependent arcs form a tree (maybe there are cycles). However, we empirically observed that most outputs during inference are indeed trees. For instance, on an English dataset, 95% of the arcs predicted on the development set are trees, and 87% of them are projective, whereas on a Chinese dataset, 87% of the arcs form trees, 73% of which are projective. This indicates that although the model does not explicitly model tree structure during training, it is able to figure out from the data (which consists of trees) that it should predict them.
So far we have focused on unlabeled dependencies, however it is relatively straightforward to extend DeNSe to produce labeled dependencies. We basically train an additional classifier to predict labels for the arcs which have been already identified. The classifier takes as input features representing properties of the arc . These consist of and , the LSTM-based representations for and (see Equation (4)), and their word and part-of-speech embeddings, and (see Equation (3)). Specifically, we use a trained DeNSe model to go through the training corpus and generate features and corresponding dependency labels as training data. We employ a two-layer rectifier network [Glorot et al.2011] for the classification task.
As mentioned earlier, greedy inference may not produce well-formed trees. In this case, the output of DeNSe can be adjusted with a maximum spanning tree algorithm. We use the Chu-Liu-Edmonds algorithm [Chu and Liu1965, Edmonds1967] for building non-projective trees and the Eisner eisner1996three algorithm for projective ones.
Following mcdonald2005non, we view a sentence as a graph with the sentence words and the dummy root symbol as vertices and a directed edge between every pair of distinct words and from the root symbol to every word. The directed graph is defined as:
where is the weight of edge and is known. The problem of dependency parsing now boils down to finding the tree with the highest score which is equivalent to finding a MST in [McDonald et al.2005b].
To build a non-projective parser, we solve the MST problem with the Chu-Liu-Edmonds algorithm [Chu and Liu1965, Edmonds1967]. The algorithm selects for each vertex (excluding root) the in-coming edge with the highest weight. If a tree results, it must be the maximum spanning tree and the algorithm terminates. Otherwise, there must be a cycle which the algorithm identifies, contracts into a single vertex and recalculates edge weights going into and out of the cycle. The greedy inference strategy described in Equation (8)) is essentially a sub-procedure in the Chu-Liu-Edmonds algorithm with the algorithm terminating after the first iteration. In implementation, we only run the Chu-Liu-Edmonds algorithm through graphs with cycles, i.e., non-tree outputs.
For projective parsing, we solve the MST problem with the Eisner eisner1996three algorithm. The time complexity of the Eisner algorithm is , while checking if a tree is projective can be done reasonably faster, with a algorithm. Therefore, we only apply the Eisner algorithm to the non-projective output of our greedy inference strategy. Finally, it should be noted that the training of our model does not rely on the Chu-Liu-Edmonds or Eisner algorithm, or any other graph-based algorithm. MST algorithms are only used at test time to correct non-tree outputs which are a minority; DeNSe acquires underlying tree structure constraints from the data without an explicit learning algorithm.
We evaluated our parser in a projective and non-projective setting. In the following, we describe the datasets we used and provide training details for our models. We also present comparisons against multiple previous systems and analyze the parser’s output.
In the projective setting, we assessed the performance of our parser on the English Penn Treebank (PTB) and the Chinese Treebank 5.1 (CTB). Our experimental setup closely follows chen2014fast and dyer2015transition.
For English, we adopted the Stanford basic dependencies (SD) representation [De Marneffe et al.2006].444We obtained SD representations using the Stanford parser v.3.3.0. We follow the standard splits of PTB, sections 2–21 were used for training, section 22 for development, and section 23 for testing. POS tags were assigned using the Stanford tagger [Toutanova et al.2003] with an accuracy of 97.3%. For Chinese, we follow the same split of CTB5 introduced in zhang2008tale. In particular, we used sections 001–815, 1001–1136 for training, sections 886–931, 1148–1151 for development, and sections 816–885, 1137–1147 for testing. The original constituency trees in CTB were converted to dependency trees with the Penn2Malt tool.555 http://stp.lingfil.uu.se/~nivre/research/Penn2Malt.html We used gold segmentation and gold POS tags as in chen2014fast and dyer2015transition.
In the non-projective setting, we assessed the performance of our parser on Czech and German, the largest non-projective datasets released as part of the CoNLL 2006 multilingual dependency parsing shared task. Since there is no official development set in either dataset, we used the last 374/367 sentences in the Czech/German training set as development data.666We make the number of sentences in the development and test sets comparable. Projective statistics of the four datasets are summarized in Table 1.
|Dataset||# Sentences||(%) Projective|
We trained our models on an Nvidia GPU card; training takes one to two hours. Model parameters were uniformly initialized to . We used Adam [Kingma and Ba2014] to optimize our models with hyper-parameters recommended by the authors (i.e., learning rate 0.001, first momentum coefficient 0.9, and second momentum coefficient 0.999). To alleviate the gradient exploding problem, we rescaled the gradient when its norm exceeded 5 [Pascanu et al.2013]. Dropout [Srivastava et al.2014] was applied to our model with the strategy recommended in the literature [Zaremba et al.2014, Semeniuta et al.2016]. On all datasets, we used two-layer LSTMs and set , where is the hidden unit size and is the word embedding size.
As in previous neural dependency parsing work [Chen and Manning2014, Dyer et al.2015], we used pre-trained word vectors to initialize our word embedding matrix . For the PTB experiments, we used 300 dimensional pre-trained GloVe777http://nlp.stanford.edu/projects/glove/ vectors [Pennington et al.2014]. For the CTB experiments, we trained 300 dimensional GloVe vectors on the Chinese Gigaword corpus which we segmented with the Stanford Chinese Segmenter [Tseng et al.2005]. For Czech and German, we did not use pre-trained word vectors. The POS tag embedding size was set to in the English experiments, in the Chinese experiments and in both Czech and German experiments.
For both English and Chinese experiments, we report unlabeled (UAS) and labeled attachment scores (LAS) on the development and test sets; following chen2014fast punctuation is excluded from the evaluation.
Experimental results on PTB are shown in Table 2. We compared our model with several recent papers following the same evaluation protocol and experimental settings. The first block in the table contains mostly graph-based parsers which do not use neural networks: Bohnet10 [Bohnet2010], Martins13 [Martins et al.2013], and Z&M14 [Zhang and McDonald2014]. Z&N11 [Zhang and Nivre2011] is a transition-based parser with non-local features. Accuracy results for all four parsers are reported in weiss:2015.
The second block in Table 2 presents results obtained from neural network-based parsers. C&M14 [Chen and Manning2014] is a transition-based parser using features learned with a feed forward neural network. Although very fast, its performance is inferior compared to graph-based parsers or strong non-neural transition based parsers (e.g., Z&N11). Dyer15 [Dyer et al.2015] uses (stack) LSTMs to model the states of the buffer, the stack, and the action sequence of a transition system. Weiss15 [Weiss et al.2015] is another transition-based parser, with a more elaborate training procedure. Features are learned with a neural network model similar to C&M14, but much larger with two layers. The hidden states of the neural network are then used to train a structured perceptron for better beam search decoding. Andor16 [Andor et al.2016] is similar to Weiss15, but uses a globally normalized training algorithm instead.
Unlike all models above, DeNSe does not use any kind of transition- or graph-based algorithm during training and inference. Nonetheless, it obtains a UAS of 94.02%. Around 95% of the model’s outputs after inference are trees, 87% of which are projective. When we post-process the remaining 13% of non-projective outputs with the Eisner algorithm (DeNSe+E), we obtain a slight improvement on UAS (94.10%).
DBLP:journals/corr/KiperwasserG16a extract features from bidirectional LSTMs and feed them to a graph- (K&G16 graph) and transition-based parser (K&G16 trans). Their LSTMs are jointly trained with the parser objective. DeNSe yields very similar performance to their transition-based parser while it outperforms K&G16 graph. A key difference between DeNSe and K&G16 lies in the training objective. The objective of DeNSe is log-likelihood based without tree structure constraints (the model is trained to produce a distribution over possible heads for each word, where each head selection is independent), while K&G16 employ a max-margin objective with tree structure constraints. Although our probabilistic objective is non-structured, it is perhaps easier to train compared to a margin-based one.
We also assessed the importance of the bidirectional LSTM on its own by replacing our LSTM-based features with those obtained from a feed-forward network. Specifically, we used the 1-order-atomic features introduced in pei2015effective which represent POS-tags, modifiers, heads, and their relative positions. As can be seen in Table 2 (DeNSe-Pei), these features are less effective compared to LSTM-based ones and the contribution of the MST algorithm (Eisner) during decoding is more pronounced (DeNSe-Pei+E). We observe similar trends in the Chinese, German, and Czech datasets (see Tables 3 and 5).
Results on CTB follow a similar pattern. As shown in Table 3, DeNSe outperforms all previous neural models (see the test set columns) on UAS and LAS. DeNSe performs competitively with Z&M14, a non-neural model with a complex high order decoding algorithm involving cube pruning and strategies for encouraging diversity. Post-processing the output of the parser with the Eisner algorithm generally improves performance (by 0.21%; see last row in Table 3). Again we observe that 1-order-atomic features [Lei et al.2014a] are inferior compared to the LSTM. Table 4 reports unlabeled sentence level exact match (UEM) in Table 4 for English and Chinese. Interestingly, even when using the greedy inference strategy, DeNSe yields a UEM comparable to Dyer15 on PTB. Finally, in Figure 2 we analyze the performance of our parser on sentences of different length. On both PTB and CTB, DeNSe has an advantage on long sentences compared to C&M14 and Dyer15.
For Czech and German, we closely follow the evaluation setup of CoNLL 2006. We report both UAS and LAS, although most previous work has focused on UAS. Our results are summarized in Table 5. We compare DeNSe against three non-projective graph-based dependency parsers: the MST parser [McDonald et al.2005b], the Turbo parser [Martins et al.2013], and the RBG parser [Lei et al.2014b]. We show the performance of these parsers in the first order setting (e.g., MST-1st) and in higher order settings (e.g., Turbo-3rd). The results of MST-1st, MST-2nd, RBG-1st and RBG-3rd are reported in lei2014low and the results of Turbo-1st and Turbo-3rd are reported in martins:2013. We show results for our parser with greedy inference (see DeNSe in the table) and when we use the Chu-Liu-Edmonds algorithm to post-process non-tree outputs (DeNSe+CLE).
As can been seen, DeNSe outperforms all other first (and second) order parsers on both German and Czech. As in the projective experiments, we observe slight a improvement (on both UAS and LAS) when using a MST algorithm. On German, DeNSe is comparable with the best third-order parser (Turbo-3rd), while on Czech it lags behind Turbo-3rd and RBG-3rd. This is not surprising considering that DeNSe is a first-order parser and only uses words and POS tags as features. Comparison systems use a plethora of hand-crafted features and more sophisticated high-order decoding algorithms. Finally, note that a version of DeNSe with features in [Lei et al.2014a] is consistently worse (see the second block in Table 5).
|Before MST||After MST|
Our experimental results demonstrate that using a MST algorithm during inference can slightly improve the model’s performance. We further examined the extent to which the MST algorithm is necessary for producing dependency trees. Table 6 shows the percentage of trees before and after the application of the MST algorithm across the four languages. In the majority of cases DeNSe outputs trees (ranging from 87.0% to 96.7%) and a significant proportion of them are projective (ranging from 65.5% to 86.6%). Therefore, only a small proportion of outputs (14.0% on average) need to be post-processed with the Eisner or Chu-Liu-Edmonds algorithm.
In this work we presented DeNSe, a neural dependency parser which we train without a transition system or graph-based algorithm. Experimental results show that DeNSe achieves competitive performance across four different languages and can seamlessly transfer from a projective to a non-projective parser simply by changing the post-processing MST algorithm during inference. In the future, we plan to increase the coverage of our parser by using tri-training techniques [Li et al.2014] and multi-task learning [Luong et al.2015].
We would like to thank Adam Lopez and Frank Keller for their valuable feedback. We acknowledge the financial support of the European Research Council (ERC; award number 681760).
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 349–359, Lisbon, Portugal.
Discriminative training methods for hidden markov models: Theory and experiments with perceptron algorithms.In Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing, pages 1–8.
Journal of Machine Learning Research, 2:265–292.
International Conference on Artificial Intelligence and Statistics, pages 315–323, Cadiz, Spain.
Labeled pseudo-projective dependency parsing with support vector machines.In Proceedings of the Tenth Conference on Computational Natural Language Learning (CoNLL-X), pages 221–225, New York City.