Deep neural models for core NLP tasks (Pytorch version)
We introduce a novel architecture for dependency parsing: stack-pointer networks (StackPtr). Combining pointer networks vinyals2015pointer with an internal stack, the proposed model first reads and encodes the whole sentence, then builds the dependency tree top-down (from root-to-leaf) in a depth-first fashion. The stack tracks the status of the depth-first search and the pointer networks select one child for the word at the top of the stack at each step. The StackPtr parser benefits from the information of the whole sentence and all previously derived subtree structures, and removes the left-to-right restriction in classical transition-based parsers. Yet, the number of steps for building any (including non-projective) parse tree is linear in the length of the sentence just as other transition-based parsers, yielding an efficient decoding algorithm with O(n^2) time complexity. We evaluate our model on 29 treebanks spanning 20 languages and different dependency annotation schemas, and achieve state-of-the-art performance on 21 of them.READ FULL TEXT VIEW PDF
We propose a novel transition-based algorithm that straightforwardly par...
We propose a technique for learning representations of parser states in
Dependency parsing is a crucial step towards deep language understanding...
We introduce a novel transition system for discontinuous constituency
We introduce a novel scheme for parsing a piece of text into its Abstrac...
Transition-based top-down parsing with pointer networks has achieved
Classical non-neural dependency parsers put considerable effort on the d...
Deep neural models for core NLP tasks (Pytorch version)
Dependency parsing, which predicts the existence and type of linguistic dependency relations between words, is a first step towards deep language understanding. Its importance is widely recognized in the natural language processing (NLP) community, with it benefiting a wide range of NLP applications, such as coreference resolution(Ng, 2010; Durrett and Klein, 2013; Ma et al., 2016)2015), machine translation (Bastings et al., 2017), information extraction (Nguyen et al., 2009; Angeli et al., 2015; Peng et al., 2017), word sense disambiguation (Fauceglia et al., 2015), and low-resource languages processing (McDonald et al., 2013; Ma and Xia, 2014). There are two dominant approaches to dependency parsing (Buchholz and Marsi, 2006; Nivre et al., 2007): local and greedy transition-based algorithms (Yamada and Matsumoto, 2003; Nivre and Scholz, 2004; Zhang and Nivre, 2011; Chen and Manning, 2014), and the globally optimized graph-based algorithms (Eisner, 1996; McDonald et al., 2005a, b; Koo and Collins, 2010).
Transition-based dependency parsers read words sequentially (commonly from left-to-right) and build dependency trees incrementally by making series of multiple choice decisions. The advantage of this formalism is that the number of operations required to build any projective parse tree is linear with respect to the length of the sentence. The challenge, however, is that the decision made at each step is based on local information, leading to error propagation and worse performance compared to graph-based parsers on root and long dependencies (McDonald and Nivre, 2011). Previous studies have explored solutions to address this challenge. Stack LSTMs (Dyer et al., 2015; Ballesteros et al., 2015, 2016) are capable of learning representations of the parser state that are sensitive to the complete contents of the parser’s state. Andor et al. (2016)
proposed a globally normalized transition model to replace the locally normalized classifier. However, the parsing accuracy is still behind state-of-the-art graph-based parsers(Dozat and Manning, 2017).
Graph-based dependency parsers, on the other hand, learn scoring functions for parse trees and perform exhaustive search over all possible trees for a sentence to find the globally highest scoring tree. Incorporating this global search algorithm with distributed representations learned from neural networks, neural graph-based parsers(Kiperwasser and Goldberg, 2016; Wang and Chang, 2016; Kuncoro et al., 2016; Dozat and Manning, 2017) have achieved the state-of-the-art accuracies on a number of treebanks in different languages. Nevertheless, these models, while accurate, are usually slow (e.g. decoding is time complexity for first-order models McDonald et al. (2005a, b) and higher polynomials for higher-order models (McDonald and Pereira, 2006; Koo and Collins, 2010; Ma and Zhao, 2012b, a)).
In this paper, we propose a novel neural network architecture for dependency parsing, stack-pointer networks (StackPtr). StackPtr is a transition-based architecture, with the corresponding asymptotic efficiency, but still maintains a global view of the sentence that proves essential for achieving competitive accuracy. Our StackPtr parser has a pointer network (Vinyals et al., 2015) as its backbone, and is equipped with an internal stack to maintain the order of head words in tree structures. The StackPtr parser performs parsing in an incremental, top-down, depth-first fashion; at each step, it generates an arc by assigning a child for the head word at the top of the internal stack. This architecture makes it possible to capture information from the whole sentence and all the previously derived subtrees, while maintaining a number of parsing steps linear in the sentence length.
We evaluate our parser on 29 treebanks across 20 languages and different dependency annotation schemas, and achieve state-of-the-art performance on 21 of them. The contributions of this work are summarized as follows:
We propose a neural network architecture for dependency parsing that is simple, effective, and efficient.
Empirical evaluations on benchmark datasets over 20 languages show that our method achieves state-of-the-art performance on 21 different treebanks111Source code is publicly available at https://github.com/XuezheMax/NeuroNLP2.
Comprehensive error analysis is conducted to compare the proposed method to a strong graph-based baseline using biaffine attention (Dozat and Manning, 2017).
We first briefly describe the task of dependency parsing, setup the notation, and review Pointer Networks (Vinyals et al., 2015).
network, together with the decoding procedure of an example sentence. The BiRNN of the encoder is elided for brevity. For the inputs of decoder at each time step, vectors in red and blue boxes indicate the sibling and grandparent.
Dependency trees represent syntactic relationships between words in the sentences through labeled directed edges between head words and their dependents. Figure 1 (a) shows a dependency tree for the sentence, “But there were no buyers”.
In this paper, we will use the following notation:
Input: represents a generic sentence, where is the th word.
Output: represents a generic (possibly non-projective) dependency tree, where each path is a sequence of words from the root to a leaf. “$” is an universal virtual root that is added to each tree.
Stack: denotes a stack configuration, which is a sequence of words. We use to represent a stack configuration that pushes word into the stack .
Children: denotes the list of all the children (modifiers) of word .
Pointer Networks (Ptr-Net) (Vinyals et al., 2015)
are a variety of neural network capable of learning the conditional probability of an output sequence with elements that are discrete tokens corresponding to positions in an input sequence. This model cannot be trivially expressed by standard sequence-to-sequence networks(Sutskever et al., 2014) due to the variable number of input positions in each sentence. Ptr-Net solves the problem by using attention (Bahdanau et al., 2015; Luong et al., 2015) as a pointer to select a member of the input sequence as the output.
Formally, the words of the sentence are fed one-by-one into the encoder (a multiple-layer bi-directional RNN), producing a sequence of encoder hidden states . At each time step , the decoder (a uni-directional RNN) receives the input from last step and outputs decoder hidden state . The attention vector is calculated as follows:
where is the attention scoring function, which has several variations such as dot-product, concatenation, and biaffine (Luong et al., 2015). Ptr-Net regards the attention vector
as a probability distribution over the source words, i.e. it usesas pointers to select the input elements.
Similarly to Ptr-Net, StackPtr first reads the whole sentence and encodes each word into the encoder hidden state . The internal stack is always initialized with the root symbol $. At each time step , the decoder receives the input vector corresponding to the top element of the stack (the head word where is the word index), generates the hidden state , and computes the attention vector using Eq. (1). The parser chooses a specific position according to the attention scores in to generate a new dependency arc by selecting as a child of . Then the parser pushes onto the stack, i.e. , and goes to the next step. At one step if the parser points to itself, i.e. , it indicates that all children of the head word have already been selected. Then the parser goes to the next step by popping out of .
At test time, in order to guarantee a valid dependency tree containing all the words in the input sentences exactly once, the decoder maintains a list of “available” words. At each decoding step, the parser selects a child for the current head word, and removes the child from the list of available words to make sure that it cannot be selected as a child of other head words.
For head words with multiple children, it is possible that there is more than one valid selection for each time step. In order to define a deterministic decoding process to make sure that there is only one ground-truth choice at each step (which is necessary for simple maximum likelihood estimation), a predefined order for eachneeds to be introduced. The predefined order of children can have different alternatives, such as left-to-right or inside-out222Order the children by the distances to the head word on the left side, then the right side.. In this paper, we adopt the inside-out order333We also tried left-to-right order which obtained worse parsing accuracy than inside-out. since it enables us to utilize second-order sibling information, which has been proven beneficial for parsing performance (McDonald and Pereira, 2006; Koo and Collins, 2010) (see § 3.4 for details). Figure 1 (b) depicts the architecture of StackPtr and the decoding procedure for the example sentence in Figure 1 (a).
The encoder of our parsing model is based on the bi-directional LSTM-CNN architecture (BLSTM-CNNs) (Chiu and Nichols, 2016; Ma and Hovy, 2016) where CNNs encode character-level information of a word into its character-level representation and BLSTM models context information of each word. Formally, for each word, the CNN, with character embeddings as inputs, encodes the character-level representation. Then the character-level representation vector is concatenated with the word embedding vector to feed into the BLSTM network. To enrich word-level information, we also use POS embeddings. Finally, the encoder outputs a sequence of hidden states .
The decoder for our parser is a uni-directional LSTM. Different from previous work (Bahdanau et al., 2015; Vinyals et al., 2015) which uses word embeddings of the previous word as the input to the decoder, our decoder receives the encoder hidden state vector () of the top element in the stack (see Figure 1 (b)). Compared to word embeddings, the encoder hidden states contain more contextual information, benefiting both the training and decoding procedures. The decoder produces a sequence of decoder hidden states , one for each decoding step.
As mentioned before, our parser is capable of utilizing higher-order information. In this paper, we incorporate two kinds of higher-order structures — grandparent and sibling. A sibling structure is a head word with two successive modifiers, and a grandparent structure is a pair of dependencies connected head-to-tail:
To utilize higher-order information, the decoder’s input at each step is the sum of the encoder hidden states of three words:
where is the input vector of decoder at time and are the indices of the head word and its grandparent and sibling, respectively. Figure 1 (b) illustrates the details. Here we use the element-wise sum operation instead of concatenation because it does not increase the dimension of the input vector , thus introducing no additional model parameters.
are parameters, denoting the weight matrix of the bi-linear term, the two weight vectors of the linear terms, and the bias vector.
As discussed in dozat2017:ICLR, applying a multilayer perceptron (MLP) to the output vectors of the BLSTM before the score function can both reduce the dimensionality and overfitting of the model. We follow this work by using a one-layer perceptron toand with elu Clevert et al. (2015)
as its activation function.
Similarly, the dependency label classifier also uses a biaffine function to score each label, given the head word vector and child vector as inputs. Again, we use MLPs to transform and before feeding them into the classifier.
The StackPtr parser is trained to optimize the probability of the dependency trees given sentences: , which can be factorized as:
where represents model parameters. denotes the preceding paths that have already been generated. represents the th word in and denotes all the proceeding words on the path . Thus, the StackPtr
parser is an autoregressive model, like sequence-to-sequence models, but it factors the distribution according to a top-down tree structure as opposed to a left-to-right chain. We define, where attention vector (of dimension ) is used as the distribution over the indices of words in a sentence.
Our parser is trained by optimizing the conditional likelihood in Eq (2), which is implemented as the cross-entropy loss.
We train a separated multi-class classifier in parallel to predict the dependency labels. Following Dozat and Manning (2017), the classifier takes the information of the head word and its child as features. The label classifier is trained simultaneously with the parser by optimizing the sum of their objectives.
The number of decoding steps to build a parse tree for a sentence of length is , linear in . Together with the attention mechanism (at each step, we need to compute the attention vector , whose runtime is ), the time complexity of decoding algorithm is , which is more efficient than graph-based parsers that have or worse complexity when using dynamic programming or maximum spanning tree (MST) decoding algorithms.
When humans comprehend a natural language sentence, they arguably do it in an incremental, left-to-right manner. However, when humans consciously annotate a sentence with syntactic structure, they rarely ever process in fixed left-to-right order. Rather, they start by reading the whole sentence, then seeking the main predicates, jumping back-and-forth over the sentence and recursively proceeding to the sub-tree structures governed by certain head words. Our parser follows a similar kind of annotation process: starting from reading the whole sentence, and processing in a top-down manner by finding the main predicates first and only then search for sub-trees governed by them. When making latter decisions, the parser has access to the entire structure built in earlier steps.
Parameter optimization is performed with the Adam optimizer Kingma and Ba (2014) with . We choose an initial learning rate of . The learning rate is annealed by multiplying a fixed decay rate
when parsing performance stops increasing on validation sets. To reduce the effects of “gradient exploding”, we use gradient clipping ofPascanu et al. (2013).
To mitigate overfitting, we apply dropout Srivastava et al. (2014); Ma et al. (2017). For BLSTM, we use recurrent dropout Gal and Ghahramani (2016) with a drop rate of 0.33 between hidden states and 0.33 between layers. Following Dozat and Manning (2017), we also use embedding dropout with a rate of 0.33 on all word, character, and POS embeddings.
Some parameters are chosen from those reported in Dozat and Manning (2017). We use the same hyper-parameters across the models on different treebanks and languages, due to time constraints. The details of the chosen hyper-parameters for all experiments are summarized in Appendix A.
We evaluate our StackPtr parser mainly on three treebanks: the English Penn Treebank (PTB version 3.0) (Marcus et al., 1993), the Penn Chinese Treebank (CTB version 5.1) Xue et al. (2002), and the German CoNLL 2009 corpus Hajič et al. (2009). We use the same experimental settings as Kuncoro et al. (2016).
To make a thorough empirical comparison with previous studies, we also evaluate our system on treebanks from CoNLL shared task and the Universal Dependency (UD) Treebanks444http://universaldependencies.org/. For the CoNLL Treebanks, we use the English treebank from CoNLL-2008 shared task Surdeanu et al. (2008) and all 13 treebanks from CoNLL-2006 shared task Buchholz and Marsi (2006). The experimental settings are the same as Ma and Hovy (2015). For UD Treebanks, we select 12 languages. The details of the treebanks and experimental settings are in § 4.5 and Appendix B.
Parsing performance is measured with five metrics: unlabeled attachment score (UAS), labeled attachment score (LAS), unlabeled complete match (UCM), labeled complete match (LCM), and root accuracy (RA). Following previous work (Kuncoro et al., 2016; Dozat and Manning, 2017)
, we report results excluding punctuations for Chinese and English. For each experiment, we report the mean values with corresponding standard deviations over 5 repetitions.
For fair comparison of the parsing performance, we re-implemented the graph-based Deep Biaffine (BiAF) parser (Dozat and Manning, 2017), which achieved state-of-the-art results on a wide range of languages. Our re-implementation adds character-level information using the same LSTM-CNN encoder as our model (§ 3.2) to the original BiAF model, which boosts its performance on all languages.
We first conduct experiments to demonstrate the effectiveness of our neural architecture by comparing with the strong baseline BiAF. We compare the performance of four variations of our model with different decoder inputs — Org, +gpar, +sib and Full — where the Org model utilizes only the encoder hidden states of head words, while the +gpar and +sib models augments the original one with grandparent and sibling information, respectively. The Full model includes all the three information as inputs.
Figure 2 illustrates the performance (five metrics) of different variations of our StackPtr parser together with the results of baseline BiAF re-implemented by us, on the test sets of the three languages. On UAS and LAS, the Full variation of StackPtr with decoding beam size 10 outperforms BiAF on Chinese, and obtains competitive performance on English and German. An interesting observation is that the Full model achieves the best accuracy on English and Chinese, while performs slightly worse than +sib on German. This shows that the importance of higher-order information varies in languages. On LCM and UCM, StackPtr significantly outperforms BiAF on all languages, showing the superiority of our parser on complete sentence parsing. The results of our parser on RA are slightly worse than BiAF. More details of results are provided in Appendix C.
|Chen and Manning (2014)||T||91.8||89.6||83.9||82.4||–||–|
|Ballesteros et al. (2015)||T||91.63||89.44||85.30||83.72||88.83||86.10|
|Dyer et al. (2015)||T||93.1||90.9||87.2||85.7||–||–|
|Bohnet and Nivre (2012)||T||93.33||91.22||87.3||85.9||91.4||89.4|
|Ballesteros et al. (2016)||T||93.56||91.42||87.65||86.21||–||–|
|Kiperwasser and Goldberg (2016)||T||93.9||91.9||87.6||86.1||–||–|
|Weiss et al. (2015)||T||94.26||92.41||–||–||–||–|
|Andor et al. (2016)||T||94.61||92.79||–||–||90.91||89.15|
|Kiperwasser and Goldberg (2016)||G||93.1||91.0||86.6||85.1||–||–|
|Wang and Chang (2016)||G||94.08||91.82||87.55||86.23||–||–|
|Cheng et al. (2016)||G||94.10||91.49||88.1||85.7||–||–|
|Kuncoro et al. (2016)||G||94.26||92.06||88.87||87.30||91.60||89.24|
|Ma and Hovy (2017)||G||94.88||92.98||89.05||87.74||92.58||90.54|
|BiAF: Dozat and Manning (2017)||G||95.74||94.08||89.30||88.23||93.46||91.44|
Table 1 illustrates the UAS and LAS of the four versions of our model (with decoding beam size 10) on the three treebanks, together with previous top-performing systems for comparison. Note that the results of StackPtr and our re-implementation of BiAF are the average of 5 repetitions instead of a single run. Our Full model significantly outperforms all the transition-based parsers on all three languages, and achieves better results than most graph-based parsers. Our re-implementation of BiAF obtains better performance than the original one in Dozat and Manning (2017), demonstrating the effectiveness of the character-level information. Our model achieves state-of-the-art performance on both UAS and LAS on Chinese, and best UAS on English. On German, the performance is competitive with BiAF, and significantly better than other models.
|UAS [LAS]||UAS [LAS]||UAS [LAS]||UAS [LAS]||UAS||LAS|
|ar||80.34 [68.58]||80.80 [69.40]||82.150.34 [71.320.36]||83.040.29 [72.940.31]||81.12||–|
|bg||93.96 [89.55]||94.28 [90.60]||94.620.14 [91.560.24]||94.660.10 [91.400.08]||94.02||–|
|zh||–||93.40 [90.10]||94.050.27 [90.890.22]||93.880.24 [90.810.55]||93.04||–|
|cs||91.16 [85.14]||91.18 [85.92]||92.240.22 [87.850.21]||92.830.13 [88.750.16]||91.16||85.14|
|da||91.56 [85.53]||91.86 [87.07]||92.800.26 [88.360.18]||92.080.15 [87.290.21]||92.00||–|
|nl||87.15 [82.41]||87.85 [84.82]||90.070.18 [87.240.17]||90.100.27 [87.050.26]||87.39||–|
|en||–||94.66 [92.52]||95.190.05 [93.140.05]||93.250.05 [93.170.05]||93.25||–|
|de||92.71 [89.80]||93.62 [91.90]||94.520.11 [93.060.11]||94.770.05 [93.210.10]||92.71||89.80|
|ja||93.44 [90.67]||94.02 [92.60]||93.950.06 [92.460.07]||93.380.08 [91.920.16]||93.80||–|
|pt||92.77 [88.44]||92.71 [88.92]||93.410.08 [89.960.24]||93.570.12 [90.070.20]||93.03||–|
|sl||86.01 [75.90]||86.73 [77.56]||87.550.17 [78.520.35]||87.590.36 [78.850.53]||87.06||–|
|es||88.74 [84.03]||89.20 [85.77]||90.430.13 [87.080.14]||90.870.26 [87.800.31]||88.75||84.03|
|sv||90.50 [84.05]||91.22 [86.92]||92.220.15 [88.440.17]||92.490.21 [89.010.22]||91.85||85.26|
|tr||78.43 [66.16]||77.71 [65.81]||79.840.23 [68.630.29]||79.560.22 [68.030.15]||78.43||66.16|
In this section, we characterize the errors made by BiAF and StackPtr by presenting a number of experiments that relate parsing errors to a set of linguistic and structural properties. For simplicity, we follow McDonald and Nivre (2011) and report labeled parsing metrics (either accuracy, precision, or recall) for all experiments.
Following McDonald and Nivre (2011), we analyze parsing errors related to structural factors.
(b) measures the precision and recall relative to dependency lengths. While the graph-basedBiAF parser still performs better for longer dependency arcs and transition-based StackPtr parser does better for shorter ones, the gap between the two systems is marginal, much smaller than that shown in McDonald and Nivre (2011). One possible reason is that, unlike traditional transition-based parsers that scan the sentence from left to right, StackPtr processes in a top-down manner, thus sometimes unnecessarily creating shorter dependency arcs first.
Figure 3 (c) plots the precision and recall of each system for arcs of varying distance to the root. Different from the observation in McDonald and Nivre (2011), StackPtr does not show an obvious advantage on the precision for arcs further away from the root. Furthermore, the StackPtr parser does not have the tendency to over-predict root modifiers reported in McDonald and Nivre (2011). This behavior can be explained using the same reasoning as above: the fact that arcs further away from the root are usually constructed early in the parsing algorithm of traditional transition-based parsers is not true for the StackPtr parser.
The only prerequisite information that our parsing model relies on is POS tags. With the goal of achieving an end-to-end parser, we explore the effect of POS tags on parsing performance. We run experiments on PTB using our StackPtr parser with gold-standard and predicted POS tags, and without tags, respectively. StackPtr in these experiments is the Full model with beam10.
Table 2 gives results of the parsers with different versions of POS tags on the test data of PTB. The parser with gold-standard POS tags significantly outperforms the other two parsers, showing that dependency parsers can still benefit from accurate POS information. The parser with predicted (imperfect) POS tags, however, performs even slightly worse than the parser without using POS tags. It illustrates that an end-to-end parser that doesn’t rely on POS information can obtain competitive (or even better) performance than parsers using imperfect predicted POS tags, even if the POS tagger is relative high accuracy (accuracy in this experiment on PTB).
Table 3 summarizes the parsing results of our model on the test sets of 14 treebanks from the CoNLL shared task, along with the state-of-the-art baselines. Along with BiAF, we also list the performance of the bi-directional attention based Parser (Bi-Att) (Cheng et al., 2016) and the neural MST parser (NeuroMST) (Ma and Hovy, 2017) for comparison. Our parser achieves state-of-the-art performance on both UAS and LAS on eight languages — Arabic, Czech, English, German, Portuguese, Slovene, Spanish, and Swedish. On Bulgarian and Dutch, our parser obtains the best UAS. On other languages, the performance of our parser is competitive with BiAF, and significantly better than others. The only exception is Japanese, on which NeuroMST obtains the best scores.
For UD Treebanks, we select 12 languages — Bulgarian, Catalan, Czech, Dutch, English, French, German, Italian, Norwegian, Romanian, Russian and Spanish. For all the languages, we adopt the standard training/dev/test splits, and use the universal POS tags (Petrov et al., 2012) provided in each treebank. The statistics of these corpora are provided in Appendix B.
Table 4 summarizes the results of the StackPtr parser, along with BiAF for comparison, on both the development and test datasets for each language. First, both BiAF and StackPtr parsers achieve relatively high parsing accuracies on all the 12 languages — all with UAS are higher than 90%. On nine languages — Catalan, Czech, Dutch, English, French, German, Norwegian, Russian and Spanish — StackPtr outperforms BiAF for both UAS and LAS. On Bulgarian, StackPtr achieves slightly better UAS while LAS is slightly worse than BiAF. On Italian and Romanian, BiAF obtains marginally better parsing performance than StackPtr.
In this paper, we proposed StackPtr, a transition-based neural network architecture, for dependency parsing. Combining pointer networks with an internal stack to track the status of the top-down, depth-first search in the decoding procedure, the StackPtr parser is able to capture information from the whole sentence and all the previously derived subtrees, removing the left-to-right restriction in classical transition-based parsers, while maintaining linear parsing steps, w.r.t the length of the sentences. Experimental results on 29 treebanks show the effectiveness of our parser across 20 languages, by achieving state-of-the-art performance on 21 corpora.
There are several potential directions for future work. First, we intend to consider how to conduct experiments to improve the analysis of parsing errors qualitatively and quantitatively. Another interesting direction is to further improve our model by exploring reinforcement learning approaches to learn an optimal order for the children of head words, instead of using a predefined fixed order.
The authors thank Chunting Zhou, Di Wang and Zhengzhong Liu for their helpful discussions. This research was supported in part by DARPA grant FA8750-18-2-0018 funded under the AIDA program. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of DARPA.
Transition-based dependency parsing with stack long short-term memory.In Proceedings of ACL-2015 (Volume 1: Long Papers). Beijing, China, pages 334–343.
A theoretically grounded application of dropout in recurrent neural networks.In Advances in Neural Information Processing Systems.
Jan Hajič, Massimiliano Ciaramita, Richard Johansson, Daisuke Kawahara, Maria Antònia Martí, Lluís Màrquez, Adam Meyers, Joakim Nivre, Sebastian Padó, Jan Štěpánek, et al. 2009.The conll-2009 shared task: Syntactic and semantic dependencies in multiple languages. In Proceedings of CoNLL-2009: Shared Task. pages 1–18.
Low-rank tensors for scoring dependency structures.In Proceedings of ACL-2014 (Volume 1: Long Papers). Baltimore, Maryland, pages 1381–1391.
The Journal of Machine Learning Research15(1):1929–1958.
Statistical dependency analysis with support vector machines.In Proceedings of IWPT. Nancy, France, volume 3, pages 195–206.
Table 5 summarizes the chosen hyper-parameters used for all the experiments in this paper. Some parameters are chosen directly or similarly from those reported in Dozat and Manning (2017). We use the same hyper-parameters across the models on different treebanks and languages, due to time constraints.
|number of filters||50|
|MLP||arc MLP size||512|
|label MLP size||128|
|LSTM hidden states||0.33|
|initial learning rate||0.001|
Table 6 shows the corpora statistics of the treebanks for 12 languages. For evaluation, we report results excluding punctuation, which is any tokens with POS tags “PUNCT” or “SYM”.
|Corpora||#Sent||#Token (w.o punct)|
|Czech||PDT, CAC||Training||102,993||1,806,230 (1,542,805)|
Table 7 illustrates the details of the experimental results. For each StackPrt parsing model, we ran experiments with decoding beam size equals to 1, 5, and 10. For each experiment, we report the mean values with corresponding standard deviations over 5 runs.