Effective Subtree Encoding for Easy-First Dependency Parsing

11/08/2018 ∙ by Jiaxun Cai, et al. ∙ Shanghai Jiao Tong University 0

Easy-first parsing relies on subtree re-ranking to build the complete parse tree. Whereas the intermediate state of parsing processing are represented by various subtrees, whose internal structural information is the key lead for later parsing action decisions, we explore a better representation for such subtrees. In detail, this work introduces a bottom-up subtree encoder based on the child-sum tree-LSTM. Starting from an easy-first dependency parser without other handcraft features, we show that the effective subtree encoder does promote the parsing process, and is able to make a greedy search easy-first parser achieve promising results on benchmark treebanks compared to state-of-the-art baselines.



There are no comments yet.


page 1

page 2

page 3

page 4

page 5

page 6

page 7

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Transition-based and graph-based parsers are two typical models used in dependency parsing. The former Nivre (2003) can adopt rich features in the parsing process but are subject to limited searching space, while the latter Eisner (1996); McDonald et al. (2005) searches the entire tree space but limits to local features with higher computational costs. Besides, some other variants are proposed to overcome the shortcomings of both graph and transition based approaches. Easy-first parsing approach Goldberg and Elhadad (2010) is introduced by adopting ideas from the both models and is expected to benefit from the nature of the both. Several ensemble methods Nivre and Mcdonald (2008); Zhang and Clark (2008); Kuncoro et al. (2016) are also proposed, which employ the parsing result of a parser to guide another in the parsing process.

In this paper, we will focus on the incremental easy-first parsing process. Easy-first dependency parser formalizes the parsing process as a sequence of attachments that build the dependency tree bottom-up. Inspired by the fact that humans always parse a natural language sentence starting from the easy and local attachment decisions and proceeding to harder part instead of working in fixed left-to-right order, the easy-first parser learns its own notion of easy and hard, and defers the attachment decisions it considers to be harder until sufficient information is available. In the primitive easy-first parsing process, each attachment would simply delete the child node and leave the parent node unmodified. However, as the partially built dependency structures always carry rich information to guide the parsing process, effectively encoding those structures at each attachment would hopefully improve the performance of the parser.

Figure 1: A fully built dependency tree for The test may come today. including part-of-speech (POS) tags and root token.

Some works have devoted to encoding the tree structure created in different natural language processing (NLP) tasks using either recurrent neural network or recursive neural network

Goller and Kuchler (1996); Socher et al. (2010). However, most works require the encoded tree to have a fixed maximum factors, and thus are unsuitable for encoding dependency tree where each node could have arbitrary number of children. Other attempts allow arbitrary branching factors, and have succeeded in particular NLP tasks.

Tai et al. (2015)

introduces a child-sum tree-structured Long Short-Term Memory (LSTM) to encode a completed dependency tree without limitation on branching factors, and shows that the proposed tree-LSTM is effective on semantic relatedness task and sentiment classification task.

Zhu et al. (2015)

proposes a recursive convolutional neural network (RCNN) architecture to capture syntactic and compositional-semantic representations of phrases and words in a dependency tree, and then uses it to re-rank the

-best list of candidate dependency trees. Kiperwasser and Goldberg (2016a) employs two vanilla LSTMs to encode a partially built dependency tree during parsing: one encodes the sequence of left-modifiers from the head outwards, and the other encodes the sequence of right-modifiers in the same manner.

In this paper, we inspect into the bottom-up building process of the easy-first parser, and introduce a subtree encoder using the child-sum tree-LSTM to promote the parsing process111Our code is attached with this submission. We will make it publicly available upon publication.. Unlike the work in Kiperwasser and Goldberg (2016a)

that uses two standard LSTMs to encode the dependency subtree in a sequentialized manner (which we will refer to as HT-LSTM later in the paper), we employ a structural model that provides the flexibility to incorporate and drop an individual child node of the subtree. Further, we introduce a multilayer perceptron between depths of the subtree to encode other underlying structural information like relation and distance between nodes.

From the evaluation results on the benchmark treebanks, the proposed model gives results greatly better than the baseline parser and outperforms the neural easy-first parser presented by Kiperwasser and Goldberg (2016a). Besides, our greedy bottom-up parser achieves performance comparable to those parsers that use beam search or re-ranking method Zhu et al. (2015); Andor et al. (2016).

2 Easy-First Parsing Algorithm

Easy-first parsing could be considered as a variation of transition-based parsing method, which builds the dependency tree from easy to hard instead of working in a fixed left-to-right order. The parsing process starts by making easy attachment decisions to build several dependency structures, and then proceeds to the harder and harder ones until a well-formed dependency tree is built. During the training process, the parser learns its own notion of easy and hard, and learns to defer specific kinds of decisions until more information is available Goldberg and Elhadad (2010).

The main data structure in the easy-first parser is a list of unattached nodes called pending. The parsing algorithm successively picks a series of actions from the allowed action set, and applies them upon the elements in the pending list. The parsing process stops until the pending solely contains the root node of the dependency tree.

At each step, the parser chooses a specific action on position using a scoring function score(), which assigns scores to each possible action on each location based on the current state of the parser. Given an intermediate state of parsing process with pending , the attachment action is determined as follows:

where denotes the set of the allowed actions, is the index of node in the pending. Besides distinguishing the correct attachments from the incorrect ones, the scoring function is supposed to assign the “easiest” attachment with the highest score, which in fact determines the parsing order of an input sentence. Goldberg and Elhadad (2010) employs a linear model for the scorer:


is the feature vector of attachment

, and is a parameter that can be learned jointly with other components in the model.

There are exactly two types of actions in the allowed action set: ATTACHLEFT() and ATTACHRIGHT(). Figure 2 shows examples of the two different types of attachments. Let refer to -th element in the pending, then the allowed actions can be formally defined as follows:

  • ATTACHLEFT(): attaching to which results in an arc (, ) headed by , and removing from the pending.

  • ATTACHRIGHT(): attaching to which results in an arc (, ) headed by , and removing from the pending.

Figure 2: Illustration of the pending states before and after the two type of attachment actions

3 Parsing With Subtree Encoding

3.1 Dependency Subtree

We know that the easy-first parser builds up a dependency tree incrementally. So in some intermediate state, the pending of the parser may contain two kinds of nodes:

  • subtree root: the root of a partially built dependency tree;

  • unprocessed node: the node that has not yet be attached to a parent or assigned a child.

Note that each processed node should become a subtree root (attached as a parent) or be removed from the pending (attached as a child). A subtree root in the pending actually stands for a dependency structure whose internal nodes are all processed except the root itself. Therefore it is supposed to be more informative than the unprocessed nodes to guide the later attachment decisions. For the purpose of being simple and clear, we define the notation subtree as follows:

Definition 3.1 (Dependency Subtree)

A dependency subtree is a self-contained structure that contains only one incoming edge to its root, and does not contain any outgoing edge. Namely, the sole incoming edge is the cut edge of the underlying undirected graph of the completed tree.

In the easy-first parsing process, each pending node is attached to its parent only after all its children have been collected. Thus, any structure produced in the parsing process is guaranteed to be a dependency subtree that is consistent with the above definition.

3.2 Recursive Subtree Encoding

In the primitive easy-first parsing process, the node that has been removed does not affect the parsing process anymore, thus the subtree structure in the pending is simply represented by the root node. However, motivated by the success of encoding the tree structure properly for other NLP tasks Tai et al. (2015); Kiperwasser and Goldberg (2016a); Kuncoro et al. (2017), we employ the child-sum tree-LSTM to encode the dependency subtree in the hope of further parsing performance improvement.

Child-Sum Tree-LSTM

Child-sum tree-LSTM is an extension of standard LSTM proposed by Tai et al. (2015) (hereafter referred to tree-LSTM). Like the standard LSTM unit Hochreiter and Schmidhuber (1997), each tree-LSTM unit contains an input gate , an output gate , a memory cell and a hidden state . The major difference between tree-LSTM unit and the standard one is that the memory cell updating and the calculation of gating vectors are depended on multiple child units. As shown in Figure 3, a tree-LSTM unit can be connected to numbers of child units and contains one forget gate for each child. This provides tree-LSTM the flexibility to incorporate or drop the information from each child unit.

Figure 3: Tree-LSTM neural network with arbitrary number of child nodes

Given a dependency tree, let denote the children set of node , denote the input of node . Tree-LSTM can be formulated as follow Tai et al. (2015):


where , and is the hidden state of the -th child node, is the memory cell of the head node , and is the hidden state of node . Note that in Eq.(2), a single forget gate is computed for each hidden state .

Our subtree encoder uses tree-LSTM as the basic building block. To fully capture the tree-structural information, our encoder also incorporates the distance and relation label.

Incorporating Distance and Relation Features

Distance embedding is a usual way to encode the distance information. In our model, we use vector to represent the relative distance of head word and its -th modifier :

where is the index of the word in the original input sentence, and represents the distance embeddings lookup table.

Similarly, the relation label between head-modifier pair is encoded as a vector according to the relation embeddings lookup table . Both of the two embeddings lookup tables are randomly initialized and learned jointly with other parameters in the neural network.

To incorporate the two features, our subtree encoder introduces an additional feature encoder between every connected tree-LSTM unit:

Specifically, the two feature embeddings are first concatenated to the hidden state of corresponding child node:

Then we apply an affine transformation on the resulted vector , and further pass the result through a activation: , where and are learnable parameters. After getting , it is fed into the next tree-LSTM unit. Therefore, the hidden state of child node in Eq.(1) and (2) is then replaced by .

3.3 The Bottom-Up Constructing Process

In our model, a dependency subtree is encoded by performing the tree-LSTM transformation on its root node and computing the vector representation of its children recursively until reaching the leaf nodes. More formally, given a partially built dependency tree rooted at node with children (modifiers): , which may be roots of some smaller subtree:

Then the tree can be encoded like:


where is the tree-LSTM transformation, is the above-mentioned feature encoder, refers to the vector representation of subtree rooted at node , and denotes the embedding of the root node word . In practice, is always a combination of the word embedding and POS-tag embedding or the output of a bidirectional LSTM. We can see clearly that the representation of a fully parse tree can be computed via a recursive process.

When encountering the leaf nodes, the parser regards them as a subtree without any children and thus sets the initial hidden state and memory cell to a zeros vector respectively:


In the easy-first parsing process, each dependency structure in the pending is built incrementally, namely, the parser builds several dependency subtrees separately and then combines them into some larger subtrees. So, when the parser builds a subtree rooted at , all its children have been processed by some previous steps. The subtree encoding process can be naturally incorporate into the easy-first parsing process in a bottom-up manner using dynamic programing technique.

Specifically, in the initial step, each node in the input sentence is treated like a subtree without any children. The parser initializes the pending with the tree representation of those input nodes using Eq.(4). For each node in pending, the parser maintains an additional children set to hold their processed children. Each time the parser performs an attachment on the nodes in the pending, the selected modifier is removed from pending and then added to the children set of the selected head. The vector representation of the subtree rooted at the selected head is recomputed using Eq.(3). The number of times that the easy-first parser performs updates on the subtree representations is equal to the number of actions required to build a dependency tree, namely, -1, where is the input sentence length.

3.4 Incorporating HT-LSTM and RCNN

Both HT-LSTM and RCNN can be incorporated into our framework. However, since the RCNN model employs POS tag dependent parameters, its primitive form is incompatible with the incremental easy-first parser, for which we leave a detail discussion in Section 5. To address this problem, we simplify and reformulate the RCNN model by replacing the POS tag dependent parameters with a global one. Specifically, for each head-modifier pair , we first use a convolutional hidden layer to compute the combination representation:

where is the size of the children set of node , is the global composition matrix, is the subtree representation of the child node , which can be recursively computed using the RCNN transformation. After convolution, we stack all into a matrix . Then to get the subtree representation for

, we apply a max pooling over

on rows:

4 Experiments and Results

We evaluate our parsing model on English Penn Treebank (PTB) and Chinese Penn Treebank (CTB), using unlabeled attachment scores (UAS) and labeled attachment scores (LAS) as the metrics. Punctuations are ignored as in previous work Kiperwasser and Goldberg (2016a); Dozat and Manning (2017).

4.1 Data Set

For English, we use the Stanford Dependency (SD 3.3.0) De Marneffe and Manning (2008) conversion of the Penn Treebank Marcus et al. (1993), and follow the standard splitting convention for PTB, using sections 2-21 for training, section 22 as development set and section 23 as test set. Stanford POS tagger Toutanova et al. (2003) is used to generate predicted POS tags for the dataset.

For Chinese, we adopt the splitting convention for CTB described in Zhang and Clark (2008); Dyer et al. (2015). The dependencies are converted with the Penn2Malt converter. Gold segmentation and POS tags are used as in previous work Dyer et al. (2015); Chen and Manning (2014).

4.2 Training Detail

Our implementation uses the DyNet222https://github.com/clab/dynet

library for building the dynamic computation graph of the network. Before the training process, we preprocess the input sentence by inserting an extra “ROOT” node in the index 0 of sentence. The non-projective sentences in the training data set are dropped. The training sentences are shuffled at the beginning of each epoch. We use the default parameters initialization, step sizes and regularization values provided by the DyNet toolkit. The hyper-parameters of the final networks used for all the reported experiments are detailed in Table


Hyper-parameters Value
Word embedding dimensions 100
POS-tag embedding dimensions 100
Relation embedding dimensions 50
Distance embedding dimensions 50
BiLSTM layers 2
BiLSTM dimensions 200 + 200
Tree-LSTM dimensions 200
Layers dropout rate 0.25
Table 1: Hyper-parameters used in our experiments

We use the GloVe Pennington et al. (2014) trained on Wikipedia and Gigaword as external embeddings for English parsing.

Instead of simply dropping the rare words, we employ the dropout approach used in Kiperwasser and Goldberg (2016a). Formally, a word that appears

times in the training corpus is dropped with a probability:

where is set to in our experiments.

For PTB, we apply a dropout rate of 33% on the POS tag embeddings, and replace the dropped POS tag embedding by the corresponding word embedding, while we do not perform POS tags dropout for CTB. Our parser employs the hinge loss described in Kiperwasser and Goldberg (2016a)

as loss function. An Adam optimizer

Kingma and Ba (2014) with setting both and as 0.9 and initial learning rate is used to update the parameters.

4.3 Results

Improvement over Baseline Model

To explore the effectiveness of the proposed subtree encoding model, we implement a baseline easy-first parser without additional subtree encoders and conduct experiments on PTB. The baseline model contains a BiLSTM encoder and uses pretrained word embedding, which we refer to BiLSTM parser. We also reimplement both HT-LSTM and RCNN and incorporate them in our framework for subtree encoding. All the four models share the same hyper-parameters settings.

The results show that our proposed tree-LSTM encoder model outperforms the BiLSTM parser with a margin of 1.48% in UAS and 1.66% in LAS on the test set. Though the RCNN model keeps simple by just using a single global matrix , it draws with the HT-LSTM model in UAS on both the development set and the test set, and slightly underperforms the latter one in LAS. Note that the HT-LSTM is more complicated, which contains two LSTMs. Such results demonstrate that simply sequentializing the subtree fails to effectively incorporate the structural information. A further error analysis of the three models is given in Section 4.4. Besides, we also run our model using the same hyper-parameters333It is worth noting that their weak baseline parser does not use Bi-LSTM and pretrained embeddings as Kiperwasser and Goldberg (2016a), and report the results in Table 3.

Dev (%) Test (%)
BiLSTM parser 90.73 92.87 90.67 92.83
RCNN 91.05 93.25 91.01 93.21
HT-LSTM 91.23 93.23 91.36 93.27
tree-LSTM 92.32 94.27 92.33 94.31
Table 2: Comparison with baseline easy-first parser (all with same hyper-parameters setting).
Dev (%) Test (%)
baseline parser 78.83 82.97 78.43 82.55
+tree-LSTM 91.20 93.03 91.18 92.97
  +Bi-LSTM 91.73 93.51 91.67 93.49
    +pretrain 92.17 94.09 92.19 94.13
baseline parser 79.0 83.3 78.6 82.7
+HTLSTM 90.1 92.4 89.8 92.0
  +Bi-LSTM 90.5 93.0 90.2 92.6
    +pretrain 90.8 93.3 90.9 93.0
Table 3: Results under the same hyper-parameters settings reported in reported in Kiperwasser and Goldberg (2016a). The “+” symbol denotes a specific extension over the previous line. The results with is reported in Kiperwasser and Goldberg (2016a).

Comparison with Previous Parsers

We now compare our model with some other recently proposed parsers. The results are compared in Table 4. It is worth noting that although the work in Kuncoro et al. (2017) can reach an accuracy of 95.8% (UAS) on PTB, it is not included in the table since its parsing result is converted from phrase-structure parsing, while this work focuses on the native dependency parsing method.

The work in Kiperwasser and Goldberg (2016a) (HT-LSTM) is similar to ours and achieves the best result among the recently proposed easy-first parsers444Here we directly refer to the original results reported in Kiperwasser and Goldberg (2016a). Our subtree encoding parser outperforms their model on both PTB and CTB. Besides, the proposed model also outperforms the RCNN based re-ranking model in Zhu et al. (2015), which introduces an RCNN to encode the dependency tree and re-ranks the -best trees produced by the base model. Note that although our model is based on the greedy easy-first parsing algorithm, it is also competitive to the search-based parser in Andor et al. (2016). The model in Dozat and Manning (2017) outperforms ours, however, their parser is graph-based and thus can enjoy the benefits of global optimization.

System Method LAS(%) UAS(%) LAS(%) UAS(%)
Dyer et al. (2015) Transition (g) 90.9 93.1 85.5 87.1
Kiperwasser and Goldberg (2016b) Transition (g) 91.9 93.9 86.1 87.6
Andor et al. (2016) Transition (b) 92.79 94.61 - -
Zhu et al. (2015) Transition (re) - 94.16 - 87.43
Zhang and McDonald (2014) Graph (3rd) 90.64 93.01 86.34 87.96
Wang and Chang (2016) Graph (1st) 91.82 94.08 86.23 87.55
Kiperwasser and Goldberg (2016b) Graph (1st) 90.9 93.0 84.9 86.5
Dozat and Manning (2017) Graph (1st) 94.08 95.74 88.23 89.30
Kiperwasser and Goldberg (2016a) EasyFirst (g) 90.9 93.0 85.5 87.1
This work EasyFirst (g) 92.33 94.31 86.37 88.65
Table 4: Comparison of results on the test sets. Acronyms used: (g) – greedy, (b) – beam search, (re) – re-ranking, (3rd) – 3rd-order, (1st) – 1st-order.

4.4 Error Analysis

To characterize the errors made by parsers and the performance enhancement by importing the subtree encoder, we present some analysis on the error rate with respect to the sentence length and POS tags. All analysis are conducted on the unlabeled attachment results from the PTB development set.

Error Distribution over Dependency Distance

It is well known that the dependency parsers is not good at coping with the long-distance dependency. Figure 4 shows the error rate of different subtree encoding methods with respect to sentence length.

The error rate curves of the three models share the same tendency: as the sentence length grows, the error rate increases. In most of the cases, the curve of our model lies below the other two curves, except the case that the sentence length lies in - where the proposed model underperforms the other two with a margin smaller than 1%. The curve of HT-LSTM and that of RCNN cross with each other at several points. It is not surprising since the overall results of the two models are very close. The curves further show that tree-LSTM is more suitable for incorporating the structural information carried by the subtrees produced in the easy-first parsing process.

Figure 4: Line chart of error rate against sentence length

Error Distribution over POS tags

McDonald and Nivre (2007) distinguishes noun, verb, pronoun, adjective, adverb, conjunction for POS tags to perform a linguistic factors analysis. To follow their works, we conduct a mapping on the PTB POS tags and skip those which cannot be mapped into one of the six above-mentioned POS tags. Then we evaluate the error rate with respected to the mapped POS tags and compare the performance of the three parsers in Figure 5.

The results seem contradict with the previous ones at first sight since the HT-LSTM model underperforms the RCNN one at most cases. This interesting result is caused by the overwhelming number of noun. According to statistics, the number of noun is roughly equal to the total number of verb, adverb and conjunction.

Typically, the verb, conjunction and adverb tend to be closer to the root in a parse tree, which leads to a longer-distance dependency and makes it more difficult to parse. The figure shows that our model copes better with those kinds of words than the other two models. Interestingly, the simple RCNN model outperforms the HT-LSTM model on all three categories of words, which demonstrates that the HT-LSTM hardly succeeds to capture long-distance dependency through the tree structure. This might be caused by the sequentialization of the subtree in the HT-LSTM model.

The other three categories of words are always attached lower in a parse tree and theoretically should be easier to parse. In the result, the three models perform similarly on adjective and pronoun. However, the RCNN model performs worse than the other two models on noun, which can be attributed to too simple RCNN model that is unable to cover different lengths of dependency.

Figure 5: Error rate with respect to POS tags

5 Related Work

Easy-first parsing has a special position in dependency parsing system. As mentioned above, to some extent, it is a kind of hybrid model that shares features with both transition and graph based models, though quite a lot of researchers still agree that it belongs to the transition-based type as it still builds parse tree step by step. Since easy-first parser was first proposed in Goldberg and Elhadad (2010), the most progress on this type of parsers is Kiperwasser and Goldberg (2016a) who incorporated neural network for the first time.

Recursive neural networks (RNN) Goller and Kuchler (1996); Socher et al. (2010) have been popularly used for trees encoding. However, most of the RNNs are limited to a fixed maximum number of factors Socher (2014). To release the constraint of the limitation of factors, Zhu et al. (2015) augments the RNN with a convolutional layer, resulting in a recursive convolutional neural network (RCNN). The RCNN is able to encode a tree structure with arbitrary number of factors, and is used in a re-ranking model for dependency parsing. The primitive RCNN employs POS tag dependent parameters which prevent it from being conveniently incorporated in an incremental parser. Specifically, for each head-modifier pair , the model uses a convolutional hidden layer to compute the combination representation :

where is the composition matrix that depends on the POS tags of and . In the easy-first parsing, the parser evaluates each adjacent pair of nodes in the pending, and distinguishes the correct attachments from the incorrect ones, which means that the parser with RCNN needs an individual composition matrix for each possible combination of POS tag pair. Some of those combinations may seldom occur in training process, causing a data sparsity problem. The great number of parameters and imbalance of training data would make the model hard to train.

Child-sum tree-LSTM Tai et al. (2015) is a variant of the standard LSTM which is capable of getting rid of the arity restriction, and has been shown effective on semantic relatedness and the sentiment classification tasks. We adopt the child-sum tree-LSTM in our incremental easy-first parser to promote the parsing.

The work in Kiperwasser and Goldberg (2016a) is similar to ours, which sequentializes the dependency subtree and then encodes the modifiers from the head outward. As HT-LSTM models the modifiers of a head as an ordered sequence, each time the subtree grows with a new adding modifier, its choice heavily depends on the previous one. If the parser makes an incorrect attachment on its -th right modifier, then the error will propagate to all right modifiers after through the vanilla LSTM. By contrast, in our subtree encoder model, since each modifier is assigned with a single forget gate, the encoder is able to drop an individual attachment, which can effectively alleviate the problem of error propagation.

6 Conclusion and Future Work

To enhance the easy-first dependency parsing, this paper proposes a tree-LSTM encoder for a better representation of partially built dependency subtrees. Experiments on PTB and CTB verify the effectiveness of the proposed model.

As easy-first parser builds dependency tree bottom-up and stops when there is only one node (the “ROOT”) remained in the pending, our subtree encoder would get a representation for the full built dependent tree once the parsing process is done. The last tree representation can be directly integrated into some other downstream application. One of our future works is to verify the effectiveness of our subtree encoder by applying it on some other tasks like textual inference or even using the subtree representation as an additional feature for neural machine translation.


  • Andor et al. (2016) Daniel Andor, Chris Alberti, David Weiss, Aliaksei Severyn, Alessandro Presta, Kuzman Ganchev, Slav Petrov, and Michael Collins. 2016. Globally normalized transition-based neural networks. In Proceedings of ACL, pages 2442–2452.
  • Chen and Manning (2014) Danqi Chen and Christopher Manning. 2014. A fast and accurate dependency parser using neural networks. In Proceedings of EMNLP, pages 740–750.
  • De Marneffe and Manning (2008) Marie-Catherine De Marneffe and Christopher D Manning. 2008. Stanford typed dependencies manual. Technical report, Technical report, Stanford University.
  • Dozat and Manning (2017) Timothy Dozat and Christopher D Manning. 2017. Deep biaffine attention for neural dependency parsing. In Proceedings of ICLR.
  • Dyer et al. (2015) Chris Dyer, Miguel Ballesteros, Wang Ling, Austin Matthews, and Noah A. Smith. 2015. Transition-based dependency parsing with stack long short-term memory. pages 334–343.
  • Eisner (1996) Jason Eisner. 1996. Efficient normal-form parsing for combinatory categorial grammar. In Proceedings of ACL, pages 79–86.
  • Goldberg and Elhadad (2010) Yoav Goldberg and Michael Elhadad. 2010. An efficient algorithm for easy-first non-directional dependency parsing. In Proceedings of HLT: NAACL, pages 742–750.
  • Goller and Kuchler (1996) Christoph Goller and Andreas Kuchler. 1996.

    Learning task-dependent distributed representations by backpropagation through structure.

    In Neural Networks, 1996., IEEE International Conference on, pages 347–352.
  • Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, pages 1735–1780.
  • Kingma and Ba (2014) Diederik Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. Proceedings of ICLR.
  • Kiperwasser and Goldberg (2016a) Eliyahu Kiperwasser and Yoav Goldberg. 2016a. Easy-first dependency parsing with hierarchical tree LSTMs. pages 445–461.
  • Kiperwasser and Goldberg (2016b) Eliyahu Kiperwasser and Yoav Goldberg. 2016b. Simple and accurate dependency parsing using bidirectional LSTM feature representations. TACL, pages 313–327.
  • Kuncoro et al. (2017) Adhiguna Kuncoro, Miguel Ballesteros, Lingpeng Kong, Chris Dyer, Graham Neubig, and Noah A. Smith. 2017. What do recurrent neural network grammars learn about syntax? In Proceedings of EACL.
  • Kuncoro et al. (2016) Adhiguna Kuncoro, Miguel Ballesteros, Lingpeng Kong, Chris Dyer, and Noah A. Smith. 2016. Distilling an ensemble of greedy dependency parsers into one mst parser. In Proceedings of EMNLP, pages 1744–1753.
  • Marcus et al. (1993) Mitchell P Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. 1993. Building a large annotated corpus of English: The penn treebank. Computational linguistics, pages 313–330.
  • McDonald et al. (2005) Ryan McDonald, Koby Crammer, and Fernando Pereira. 2005. Online large-margin training of dependency parsers. In Proceedings of ACL, pages 91–98.
  • McDonald and Nivre (2007) Ryan McDonald and Joakim Nivre. 2007. Characterizing the errors of data-driven dependency parsing models. In Proceedings of EMNLP-CoNLL.
  • Nivre (2003) Joakim Nivre. 2003. An efficient algorithm for projective dependency parsing. In Proceedings of IWPT.
  • Nivre and Mcdonald (2008) Joakim Nivre and Ryan T. Mcdonald. 2008. Integrating graph-based and transition-based dependency parsers. In Proceedings of ACL, pages 950–958.
  • Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proceedings of EMNLP, pages 1532–1543.
  • Socher (2014) Richard Socher. 2014.

    Recursive deep learning for natural language processing and computer vision

    Ph.D. thesis, Citeseer.
  • Socher et al. (2010) Richard Socher, Christopher D Manning, and Andrew Y Ng. 2010. Learning continuous phrase representations and syntactic parsing with recursive neural networks. In Proceedings of the NIPS, pages 1–9.
  • Tai et al. (2015) Kai Sheng Tai, Richard Socher, and Christopher D. Manning. 2015. Improved semantic representations from tree-structured long short-term memory networks. In Proceedings of ACL-IJCNLP, pages 1556–1566.
  • Toutanova et al. (2003) Kristina Toutanova, Dan Klein, Christopher D Manning, and Yoram Singer. 2003. Feature-rich part-of-speech tagging with a cyclic dependency network. In Proceedings of NAACL, pages 173–180.
  • Wang and Chang (2016) Wenhui Wang and Baobao Chang. 2016. Graph-based dependency parsing with bidirectional LSTM. In Proceedings of ACL, pages 2306–2315.
  • Zhang and McDonald (2014) Hao Zhang and Ryan McDonald. 2014. Enforcing structural diversity in cube-pruned dependency parsing. In Proceedings of ACL, pages 656–661.
  • Zhang and Clark (2008) Yue Zhang and Stephen Clark. 2008. A tale of two parsers: investigating and combining graph-based and transition-based dependency parsing using beam-search. In Proceedings of EMNLP, pages 562–571.
  • Zhu et al. (2015) Chenxi Zhu, Xipeng Qiu, Xinchi Chen, and Xuanjing Huang. 2015. A re-ranking model for dependency parser with recursive convolutional neural network.