1 Introduction
This work is licensed under a Creative Commons Attribution 4.0 International License. License details: http://creativecommons.org/licenses/by/4.0/. Nonlocal features have been shown crucial for statistical parsing [Huang2008a, Zhang and Nivre2011]. For dependency parsing, Highorder dynamic programs [Koo and Collins2010]
, integer linear programming
[Martins et al.2010] and dual decomposition [Koo et al.2010] techniques have been exploited by graphbased parser to integrate nonlocal features. Transitionbased parsers [Nivre2003, Nivre2008, Zhang and Nivre2011, Bohnet2010, Huang et al.2012] are also known for leveraging nonlocal features for achieving high accuracies. For most stateoftheart statistical parsers, a global training objective over the entire parse tree has been defined to avoid label bias [Lafferty et al.2001].For neural parsing, on the other hand, local models have been shown to give highly competitive accuracies [Cross and Huang2016b, Stern et al.2017] as compared to those that employ longrange features [Watanabe and Sumita2015, Zhou et al.2015, Andor et al.2016, Durrett and Klein2015]. Highly local features have been used in recent stateoftheart models [Stern et al.2017, Dozat and Manning2016, Shi et al.2017]
. In particular, Dozat2016biaffine show that a locally trained arcfactored model can give the best reported accuracies on dependency parsing. The surprising result has been largely attributed to the representation power of long shortterm memory (LSTM) encoders
[Kiperwasser and Goldberg2016b].An interesting research question is to what extent the encoding power can be leveraged for constituent parsing. We investigate the problem by building a chartbased model that is local to unlabeled constituent spans [Abney1991] and CFGrules, which have been explored by early PCFG models [Collins2003, Klein and Manning2003]. In particular, our models first predict unlabeled CFG trees leveraging biaffine modelling [Dozat and Manning2016]. Then, constituent labels are assigned on unlabeled trees by using a treeLSTM to encode the syntactic structure, and a LSTM decoder for yielding label sequences on each node, which can include unary rules.
Experiments show that our conceptually simple models give highly competitive performances compared with the stateoftheart. Our best models give labeled bracketing F1 scores of 92.4% on PTB and 87.3% on CTB 5.1 test sets, without reranking, ensembling and external parses. We release our code at https://github.com/zeeeyang/twolocalneuralconparsers.
2 Model
shows constituent span classifiers making 0/1 decisions for all possible spans. Based on the local classification probabilities, we obtain an unlabeled binarized parse tree (Figure
0(b)) using binary CKY parsing algorithms. We then hierarchically generate labels for each span (Figure 0(c)) using encoderdecoder models. Figure 0(d) shows the final output parse tree after debinarization.Our models consist of an unlabeled binarized tree parser and a label generator. Figure 1 shows a running example of our parsing model. The unlabeled parser (Figure 0(a), 0(b)) learns an unlabeled parse tree using simple BiLSTM encoders [Hochreiter and Schmidhuber1997]. The label generator (Figure 0(c), 0(d)) predicts constituent labels for each span in the unlabeled tree using treeLSTM models.
In particular, we design two different classification models for unlabeled parsing: the span model (Section 2.1) and the rule model (Section 2.2). The span model identifies the probability of an arbitrary span being a constituent span. For example, the span in Figure 0(a) belongs to the correct parse tree (Figure 0(d)). Ideally, our model assigns a high probability to this span. In contrast, the span is not a valid constituent span and our model labels it with 0. Different from the span model, the rule model considers the probability for the production rule that the span is composed by two children spans and , where . For example, in Figure 0(a), the rule model assigns high probability to the rule instead of the rule . Given the local probabilities, we use CKY algorithm to find the unlabeled binarized parses.
The label generator encodes a binarized unlabeled tree and to predict constituent labels for every span. The encoder is a binary treeLSTM [Tai et al.2015, Zhu et al.2015]
, which recursively composes the representation vectors for tree nodes bottomup. Based on the representation vector of a constituent span, a LSTM decoder
[Cho et al.2014, Sutskever et al.2014] generates chains of constituent labels, which can represent unary rules. For example, the decoder outputs “VP S /L” for the span [4, 4] and “NP /L” for the span [0,2] in Figure 0(c) where /L is a stopping symbol.2.1 Span Model
Given an unlabeled binarized tree for the sentence ,
, the span model trains a neural network model
to distinguish constituent spans from nonconstituent spans, where , , . indicates the span [i, j] is a constituent span (), and for otherwise, are model parameters. We do not model spans with length 1 since the span always belongs to .Network Structure. Figure 1(a) shows the neural network structures for the binary classification model. In the bottom, a bidirectional LSTM layer encodes the input sentence to extract nonlocal features. In particular, we append a starting symbol s and an ending symbol /s to the lefttoright LSTM and the righttoleft LSTM, respectively. We denote the output hidden vectors of the lefttoright LSTM and the righttoleft LSTM for is and , respectively. We obtain the representation vector of the span by simply concatenating the bidirectional output vectors at the input word and the input word ,
(1) 
is then passed through a nonlinear transformation layer and the probability distribution
is given by(2) 
where and are model parameters.
Input Representation. Words and partofspeech (POS) tags are integrated to obtain the input representation vectors. Given a word , its corresponding characters and POS tag , first, we obtain the word embedding , character embeddings , and POS tag embedding using lookup operations. Then a bidirectional LSTM is used to extract characterlevel features. Suppose that the last output vectors of the lefttoright and righttoleft LSTMs are and , respectively. The final input vector is given by
(3) 
where , and are model parameters.
Training objective. The training objective is to maximize the probabilities of for spans and minimize the probabilities of for spans at the same time. Formally, the training loss for binary span classification is given by
(4) 
For a sentence with length , there are terms in total in Eq 4.
Neural CKY algorithm. The unlabeled production probability for the rule given by the binary classification model is,
During decoding, we find the optimal parse tree using the CKY algorithm. Note that our CKY algorithm is different from the standard CKY algorithm mainly in that there is no explicit phrase rule probabilities being involved. Hence our model can be regarded as a zeroorder constituent tree model, which is the most local. All structural relations in a constituent tree must be implicitly captured by the BiLSTM encoder over the sentence alone.
Multiclass Span Classification Model. The previous model preforms binary classifications to identify constituent spans. In this way, the classification model only captures the existence of constituent labels but does not leverage constituent label type information. In order to incorporate the syntactic label information into the span model, we use a multiclass classification model to describe the probability that is a constituent label for span . The network structure is the same as the binary span classification model except the last layer. For the last layer, given in Eq 2, is calculated by,
(5) 
Here and are model parameters. The subscript is to pick the probability for the label . The training loss is,
(6) 
Note that there is an additional sum inside the first term in Eq 6, which is different from the first term in Eq 4. denotes the label set of span . This is to say that we treat all constituent labels equally of a unary chain. For example, suppose there is a unary chain SVP in span [4,4]. For this span, we hypothesize that both labels are plausible answers and pay equal attentions to VP and S during training. For the second term in Eq 6, we maximize the probability of the ending label for nonconstituent spans.
For decoding, we transform the multiclass probability distribution into a binary probability distribution by using,
In this way, the probability of a span being a constituent span takes all possible syntactic labels into considerations.
2.2 Rule Model
The rule model directly calculates the probabilities of all possible splitting points ( for the span . Suppose the partition score of splitting point is . The unlabeled production probability for the rule is given by a softmax distribution,
The training objective is to minimize the log probability loss of all unlabeled production rules.
The decoding algorithm is the standard CKY algorithm, which we omit here. The rule model can be regarded as a firstorder constituent model, with the probability of each phrase rule being modeled. However, unlike structured learning algorithms [Finkel et al.2008, Carreras et al.2008], which use a global score for each tree, our model learns each production rule probability individually. Such local learning has traditionally been found subjective to label bias [Lafferty et al.2001]. Our model relies on input representations solely for resolving this issue.
Span Representation. Figure 1(b) shows one possible network architecture for the rule model by taking the partition point for the span as an example. The BiLSTM encoder layer in the bottom is the same as that of the previous span classification model. We obtain the span representation vectors using difference vectors [Wang and Chang2016, Cross and Huang2016b]. Formally, the span representation vector is given by,
(7) 
We first combine the difference vectors and to obtain a simple span representation vector . In order to take more contextual information such as where and where , we concatenate , , and to produce the final span representation vector . We then transform to an output vector
using an activation function
,(8) 
where and and model parameters, and is a parameter set index. We use separate parameters for the nonlinear transforming layer. are for the parent span , the left child span and the right child span , respectively.
After obtaining the span representation vectors, we use these vectors to calculate the partition score . In particular, we investigate two scoring methods.
Linear Model. In the linear model, the partition score is calculated by a linear affine transformation. For the splitting point ,
where and are two vectors, and is a size 1 parameter.
Biaffine model. Since the possible splitting points for spans are varied with the length of span, we also try a biaffine scoring model (as shown in Figure 1(b)), which is good at handling variablesized classification problems [Dozat and Manning2016, Ma and Hovy2017]. The biaffine model produces the score between the parent span and the left child span using a biaffine scorer
(9) 
where is model parameters and denotes vector concatenation. Similarly, we calculate the score between the parent span and the right child span using and as parameters. The overall partition score is therefore given by
2.3 Label Generator
Lexicalized TreeLSTM Encoder. Shown in Figure 0(c), we use lexicalized tree LSTM [Teng and Zhang2016] for encoding, which shows good representation abilities for unlabeled trees. The encoder first propagates lexical information from two children spans to their parent using a lexical gate, then it produces the representation vectors of the parent span by composing the vectors of children spans using a binarized treeLSTM [Tai et al.2015, Zhu et al.2015]. Formally, the lexical vector for the span with the partition point at is defined by:
where , and are model parameters, is elementwise multiplication and is the logistic function. The lexical vector for the leaf node is the concatenate of the output vectors of the BiLSTM encoder and the input representation (Eq 3), as shown in Figure 0(c).
The output state vector of the span given by a binary tree LSTM encoder is,
Here the subscripts , and denote , and , respectively.
Label Decoder. Suppose that the constituent label chain for the span is (). The decoder for the span learns a conditional language model depending on the output vector from the tree LSTM encoder. Formally, the probability distribution of generating the label at time step is given by,
where is the decoding prefix, is the state vector of the decoder LSTM and is the embedding of the previous output label.
The training objective is to minimize the negative loglikelihood of the label generation distribution,
2.4 Joint training
In conclusion, each model contains an unlabeled structure predictor and a label generator. The latter is the same for all models. All the span models perform binary classification. The difference is that BinarySpan doesn’t consider label information for unlabeled tree prediction. While MultiSpan guides unlabeled tree prediction with such information, simulating binary classifications. The unlabeled parser and the label generator share parts of the network components, such as word embeddings, char embeddings, POS embeddings and the BiLSTM encoding layer. We jointly train the unlabeled parser and the label generator for each model by minimizing the overall loss
where is a regularization hyperparameter. We set or and when using the binary span classification model, the multiclass model and the rule model, respectively.
3 Experiments
3.1 Experimental Settings
Data. We perform experiments for both English and Chinese. Following standard conventions, our English data are obtained from the Wall Street Journal (WSJ) of the Penn Treebank (PTB) [Marcus et al.1993]. Sections 0221, section 22 and section 23 are used for training, development and test sets, respectively. Our Chinese data are the version 5.1 of the Penn Chinese Treebank (CTB) [Xue et al.2005]. The training set consists of articles 001270 and 4401151, the development set contains articles 301325 and the test set includes articles 271300. We use automatically reassigned POS tags in the same way as james2016span for English and dyer2016rnng for Chinese.
We use ZPar [Zhang and Clark2011]^{1}^{1}1https://github.com/SUTDNLP/ZPar to binarize both English and Chinese data with the head rules of collins2003head. The head directions of the binarization results are ignored during training. The types of English and Chinese constituent span labels after binarization are 52 and 56, respectively. The maximum number of greedy decoding steps for generating consecutive constituent labels is limited to 4 for both English and Chinese. We evaluate parsing performance in terms of both unlabeled bracketing metrics and labeled bracketing metrics including unlabeled F1 (UF)^{2}^{2}2For UF, we exclude the sentence span [0,n1] and all spans with length 1., labeled precision (LP), labeled recall (LR) and labeled bracketing F1 (LF) after debinarization using EVALB^{3}^{3}3http://nlp.cs.nyu.edu/evalb.
Unknown words.
For English, we combine the methods of dyer2016rnng, KiperwasserTACLeasyfirst and james2016span to handle unknown words. In particular, we first map all words (not just singleton words) in the training corpus into unknown word classes using the same rule as dyer2016rnng. During each training epoch, every word
in the training corpus is stochastically mapped into its corresponding unknown word class with probability , where is the frequency count and is a control parameter. Intuitively, the more times a word appears, the less opportunity it will be mapped into its unknown word type. There are 54 unknown word types for English. Following james2016span, . For Chinese, we simply use one unknown word type to dynamically replace singletons words with a probability of 0.5.hyperparameter  value  hyperparameter  value 

Word embeddings  English: 100 Chinese: 80  Word LSTM layers  2 
Word LSTM hidden units  200  Character embeddings  20 
Character LSTM layers  1  Character LSTM hidden units  25 
TreeLSTM hidden units  200  POS tag embeddings  32 
Constituent label embeddings  32  Label LSTM layers  1 
Label LSTM hidden units  200  Last output layer hidden units  128 
Maximum training epochs  50  Dropout  English: 0.5, Chinese 0.3 
Trainer  SGD  Initial learning rate  0.1 
Perepoch decay  0.05  ELU [Clevert et al.2015] 
Hyperparameters. Table 1
shows all hyperparameters. These values are tuned using the corresponding development sets. We optimize our models with stochastic gradient descent (SGD). The initial learning rate is 0.1. Our model are initialized with pretrained word embeddings both for English and Chinese. The pretrained word embeddings are the same as those used in dyer2016rnng. The other parameters are initialized according to the default settings of DyNet
[Neubig et al.2017]. We apply dropout [Srivastava et al.2014] to the inputs of every LSTM layer, including the word LSTM layers, the character LSTM layers, the treestructured LSTM layers and the constituent label LSTM layers. For Chinese, we find that 0.3 is a good choice for the dropout probability. The number of training epochs is decided by the evaluation performances on development set. In particular, we perform evaluations on development set for every 10,000 examples. The training procedure stops when the results of next 20 evaluations do not become better than the previous best record.3.2 Development Results
We study the two span representation methods, namely the simple concatenating representation (Eq 1) and the combining of three difference vectors (Eq 7), and the two representative models, i.e, the binary span classification model (BinarySpan) and the biaffine rule model (BiaffineRule). We investigate appropriate representations for different models on the English dev dataset. Table 3 shows the effects of different span representation methods, where is better for BinarySpan and is better for BiaffineRule. When using for BinarySpan, the performance drops greatly (). Similar observations can be found when replacing with for BiaffineRule. Therefore, we use for the span models and for the rule models in latter experiments.
Table 3 shows the main results on the English and Chinese dev sets. For English, BinarySpan acheives 92.17 LF score. The multiclass span classifier (MultiSpan) is much better than BinarySpan due to the awareness of label information. Similar phenomenon can be observed on the Chinese dataset. We also test the linear rule (LinearRule) methods. For English, LinearRule obtains 92.03 LF score, which is much worse than BiaffineRule. In general, the performances of BiaffineRule and MultiSpan are quite close both for English and Chinese.
For MultiSpan, both the first stage (unlabeled tree prediction) and the second stage (label generation) exploit constituent types. We design three development experiments to answer what the accuracy would be like of the predicted labels of the first stage were directly used in the second stage. The first one doesn’t include the label probabilities of the first stage for the second stage. For the second experiment, we directly use the model output from the first setting for decoding, summing up the label classification probabilities of the first stage and the label generation probabilities of the second stage in order to make label decisions. For the third setting, we do the sumup of label probabilities for the second stage both during training and decoding. These settings give LF scores of 92.44, 92.49 and 92.44, respectively, which are very similar. We choose the first one due to its simplicity.
3.3 Main Results
English. Table 4 summarizes the performances of various constituent parsers on PTB test set. BinarySpan achieves 92.1 LF score, outperforming the neural CKY parsing models [Durrett and Klein2015] and the topdown neural parser [Stern et al.2017]. MultiSpan and BiaffineRule obtain similar performances. Both are better than BianrySpan. MultiSpan obtains 92.4 LF score, which is very close to the stateoftheart result when no external parses are included. An interesting observation is that the model of sternandreasklein:2017:Long show higher LP score than our models (93.2 v.s 92.6), while our model gives better LR scores (90.4 v.s. 93.2). This potentially suggests that the global constraints such as structured label loss used in [Stern et al.2017] helps make careful decisions. Our local models are likely to gain a better balance between bold guesses and accurate scoring of constituent spans. Table 7 shows the unlabeled parsing accuracies on PTB test set. MultiSpan performs the best, showing 92.50 UF score. When the unlabeled parser is 100% correct, BiaffineRule are better than the other two, producing an oracle LF score of 97.12%, which shows the robustness of our label generator. The decoding speeds of BinarySpan and MutliSpan are similar, reaching about 21 sentences per second. BiaffineRule is much slower than the span models.
Parser  LR  LP  LF  Parser  LR  LP  LF 
zhu2013acl (S)  91.1  91.5  91.3  charniak2000maximum  89.5  89.9  89.5 
mcclosky2006rerank (S)  92.1  92.5  92.3  collins2003head  88.1  88.3  88.2 
choe2016parsinglm (S,R,E)  93.8  sagae2006combine  87.8  88.1  87.9  
durrett2015crf (S)  91.1  petrov2007unlex  90.1  90.2  90.1  
vinyals2015grammar (S, E)  92.8  carreras2008tag  90.7  91.4  91.1  
charniak2005rerank (S, R)  91.2  91.8  91.5  zhu2013acl  90.2  90.7  90.4 
huang2008forest (R)  91.7  watanabe2015transition  90.7  
huang2009selftraining (ST)  91.1  91.6  91.3  fernandezgonzalezmartins:2015:ACLIJCNLP  89.9  90.4  90.2 
huang2010product (ST)  92.7  92.2  92.5  james2016span  90.5  92.1  91.3 
shindo2012refined (E)  92.4  kuncoro2017rnng  91.2  
socher2013parsing (R)  90.4  liu2016lookahead  91.3  92.1  91.7  
dyer2016rnng (R)  93.3  sternandreasklein:2017:Long topdown  90.4  93.2  91.8  
kuncoro2017rnng (R)  93.6  BinarySpan  91.9  92.2  92.1  
liu2017inorder (R)  94.2  MultiSpan  92.2  92.5  92.4  
friedsternklein:2017:Short (ES)  94.7  BiaffineRule  92.0  92.6  92.3 
Chinese. Table 5 shows the parsing performance on CTB 5.1 test set. Under the same settings, all the three models outperform the stateoftheart neural model [Dyer et al.2016, Liu and Zhang2017a]. Compared with the inorder transitionbased parser, our best model improves the labeled F1 score by 1.2 (). In addition, MultiSpan and BiaffineRule
achieve better performance than the reranking system using recurrent neural network grammars
[Dyer et al.2016] and methods that do joint POS tagging and parsing [Wang and Xue2014, Wang et al.2015].Parser  LR  LP  LF  Parser  LR  LP  LF 
charniak2005rerank (R)  80.8  83.8  82.3  petrov2007unlex  81.9  84.8  83.3 
zhu2013acl (S)  84.4  86.8  85.6  zhang2009tran  78.6  78.0  78.3 
wang2015feature (S)  86.6  watanabe2015transition  84.3  
huang2009selftraining (ST)  85.2  dyer2016rnng  84.6  
dyer2016rnng (R)  86.9  BinarySpan  85.9  87.1  86.5  
liu2016lookahead  85.2  85.9  85.5  MultiSpan  86.6  88.0  87.3 
liu2017inorder  86.1  BiaffineRule  87.1  87.5  87.3 
4 Analysis
Constituent label. Table 7 shows the LF scores for eight major constituent labels on PTB test set. BinarySpan consistently underperforms to the other two models. The error distribution of MultiSpan and BiaffineRule are different. For constituent labels including SBAR, WHNP and QP, BiaffineRule is the winner. This is likely because the partition point distribution of these labels are less trivial than other labels. For NP, PP, ADVP and ADJP, MultiSpan obtains better scores than BiaffineRule, showing the importance of the explicit type information for correctly identifying these labels. In addition, the three models give similar performances of VP and S, indicating that simple local classifiers might be sufficient enough for these two labels.
LF v.s. Length. Figure 4 and Figure 4 show the LF score distributions against sentence length and span length on the PTB test set, respectively. We also include the output of the previous stateoftheart topdown neural parser [Stern et al.2017] and the reranking results of transitionbased neural generative parser (RNNG) [Dyer et al.2016], which represents models that can access more global information. For sentence length, the overall trends of the five models are similar. The LF score decreases as the length increases, but there is no salient difference in the downing rate (also true for span length 6), demonstrating our local models can alleviate the label bias problem. BiaffineRule outperforms the other three models (except RNNG) when the sentence length less than 30 or the span length less than 4. This suggests that when the length is short, the rule model can easily recognize the partition point. When the sentence length greater than 30 or the span length greater than 10, MultiSpan becomes the best option (except RNNG), showing that for long spans, the constituent label information are useful.
5 Related Work
Globally trained discriminative models have given highly competitive accuracies on graphbased constituent parsing. The key is to explicitly consider connections between output substructures in order to avoid label bias. Stateoftheart statistical methods use a single model to score a feature representation for all phrasestructure rules in a parse tree [Taskar et al.2004, Finkel et al.2008, Carreras et al.2008]. More sophisticated features that span over more than one rule have been used for reranking [Huang2008b]. durrett2015crf used neural networks to augment manual indicator features for CRF parsing. Structured learning has been used for transitionbased constituent parsing also [Sagae and Lavie2005, Zhang and Clark2009, Zhang and Clark2011, Zhu et al.2013], and neural network models have been used to substitute indicator features for transitionbased parsing [Watanabe and Sumita2015, Dyer et al.2016, Goldberg et al.2014, Kiperwasser and Goldberg2016b, Cross and Huang2016a, Coavoux and Crabbé2016, Shi et al.2017].
Compared to the above methods on constituent parsing, our method does not use global structured learning, but instead learns local constituent patterns, relying on a bidirectional LSTM encoder for capturing nonlocal structural relations in the input. Our work is inspired by the biaffine dependency parser of Dozat2016biaffine. Similar to our work, sternandreasklein:2017:Long show that a model that bipartitions spans locally can give high accuracies under a highlysupervised setting. Compared to their model, we build direct local span classification and CFG rule classification models instead of using span labeling and splitting features to learn a marginbased objective. Our results are better although our models are simple. In addition, they collapse unary chains as fixed patterns while we handle them with an encoderdecoder model.
6 Conclusion
We investigated two locally trained spanlevel constituent parsers using BiLSTM encoders, demonstrating empirically the strength of the local models on learning syntactic structures. On standard evaluation, our models give the best results among existing neural constituent parsers without external parses.
Acknowledgement
Yue Zhang is the corresponding author. We thank all the anonymous reviews for their thoughtful comments and suggestions.
References
 [Abney1991] Steven P. Abney. 1991. Parsing by chunks. In Principlebased parsing, pages 257–278. Springer.
 [Andor et al.2016] Daniel Andor, Chris Alberti, David Weiss, Aliaksei Severyn, Alessandro Presta, Kuzman Ganchev, Slav Petrov, and Michael Collins. 2016. Globally normalized transitionbased neural networks. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2442–2452, Berlin, Germany, August. Association for Computational Linguistics.
 [Bohnet2010] Bernd Bohnet. 2010. Top accuracy and fast dependency parsing is not a contradiction. In Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010), pages 89–97, Beijing, China, August. Coling 2010 Organizing Committee.

[Carreras et al.2008]
Xavier Carreras, Michael Collins, and Terry Koo.
2008.
Tag, dynamic programming, and the perceptron for efficient, featurerich parsing.
In Proceedings of the Twelfth Conference on Computational Natural Language Learning, pages 9–16. Association for Computational Linguistics.  [Charniak and Johnson2005] Eugene Charniak and Mark Johnson. 2005. Coarsetofine nbest parsing and maxent discriminative reranking. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05), pages 173–180, Ann Arbor, Michigan, June. Association for Computational Linguistics.
 [Charniak2000] Eugene Charniak. 2000. A maximumentropyinspired parser. In Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference, pages 132–139. Association for Computational Linguistics.
 [Cho et al.2014] Kyunghyun Cho, Bart Van Merriënboer, Dzmitry Bahdanau, and Yoshua Bengio. 2014. On the properties of neural machine translation: Encoderdecoder approaches. arXiv preprint arXiv:1409.1259.

[Choe and Charniak2016]
Do Kook Choe and Eugene Charniak.
2016.
Parsing as language modeling.
In
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing
, pages 2331–2336, Austin, Texas, November. Association for Computational Linguistics.  [Clevert et al.2015] DjorkArné Clevert, Thomas Unterthiner, and Sepp Hochreiter. 2015. Fast and accurate deep network learning by exponential linear units (elus). CoRR, abs/1511.07289.
 [Coavoux and Crabbé2016] Maximin Coavoux and Benoit Crabbé. 2016. Neural greedy constituent parsing with dynamic oracles. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 172–182, Berlin, Germany, August. Association for Computational Linguistics.
 [Collins2003] Michael Collins. 2003. Headdriven statistical models for natural language parsing. Computational linguistics, 29(4):589–637.
 [Cross and Huang2016a] James Cross and Liang Huang. 2016a. Incremental parsing with minimal features using bidirectional lstm. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 32–37, Berlin, Germany, August. Association for Computational Linguistics.
 [Cross and Huang2016b] James Cross and Liang Huang. 2016b. Spanbased constituency parsing with a structurelabel system and provably optimal dynamic oracles. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1–11, Austin, Texas, November. Association for Computational Linguistics.
 [Dozat and Manning2016] Timothy Dozat and Christopher D. Manning. 2016. Deep biaffine attention for neural dependency parsing. CoRR, abs/1611.01734.
 [Durrett and Klein2015] Greg Durrett and Dan Klein. 2015. Neural crf parsing. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 302–312, Beijing, China, July. Association for Computational Linguistics.
 [Dyer et al.2016] Chris Dyer, Adhiguna Kuncoro, Miguel Ballesteros, and Noah A. Smith. 2016. Recurrent neural network grammars. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 199–209, San Diego, California, June. Association for Computational Linguistics.
 [FernándezGonzález and Martins2015] Daniel FernándezGonzález and André F. T. Martins. 2015. Parsing as reduction. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1523–1533, Beijing, China, July. Association for Computational Linguistics.
 [Finkel et al.2008] Jenny Rose Finkel, Alex Kleeman, and Christopher D. Manning. 2008. Efficient, featurebased, conditional random field parsing. In Proceedings of ACL08: HLT, pages 959–967, Columbus, Ohio, June. Association for Computational Linguistics.
 [Fried et al.2017] Daniel Fried, Mitchell Stern, and Dan Klein. 2017. Improving neural parsing by disentangling model combination and reranking effects. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 161–166, Vancouver, Canada, July. Association for Computational Linguistics.
 [Goldberg et al.2014] Yoav Goldberg, Francesco Sartorio, and Giorgio Satta. 2014. A tabular method for dynamic oracles in transitionbased parsing. Transactions of the association for Computational Linguistics, 2:119–130.
 [Hochreiter and Schmidhuber1997] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long shortterm memory. Neural computation, 9(8):1735–1780.
 [Huang and Harper2009] Zhongqiang Huang and Mary Harper. 2009. Selftraining PCFG grammars with latent annotations across languages. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pages 832–841, Singapore, August. Association for Computational Linguistics.
 [Huang et al.2010] Zhongqiang Huang, Mary Harper, and Slav Petrov. 2010. Selftraining with products of latent variable grammars. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pages 12–22, Cambridge, MA, October. Association for Computational Linguistics.
 [Huang et al.2012] Liang Huang, Suphan Fayong, and Yang Guo. 2012. Structured perceptron with inexact search. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 142–151, Montréal, Canada, June. Association for Computational Linguistics.
 [Huang2008a] Liang Huang. 2008a. Forest reranking: Discriminative parsing with nonlocal features. In Proceedings of ACL08: HLT, pages 586–594, Columbus, Ohio, June. Association for Computational Linguistics.
 [Huang2008b] Liang Huang. 2008b. Forest reranking: Discriminative parsing with nonlocal features. In Proceedings of ACL08: HLT, pages 586–594, Columbus, Ohio, June. Association for Computational Linguistics.
 [Kiperwasser and Goldberg2016a] Eliyahu Kiperwasser and Yoav Goldberg. 2016a. Easyfirst dependency parsing with hierarchical tree lstms. Transactions of the Association for Computational Linguistics, 4:445–461.
 [Kiperwasser and Goldberg2016b] Eliyahu Kiperwasser and Yoav Goldberg. 2016b. Simple and accurate dependency parsing using bidirectional LSTM feature representations. CoRR, abs/1603.04351.
 [Klein and Manning2003] Dan Klein and Christopher D. Manning. 2003. Accurate unlexicalized parsing. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, pages 423–430, Sapporo, Japan, July. Association for Computational Linguistics.
 [Koo and Collins2010] Terry Koo and Michael Collins. 2010. Efficient thirdorder dependency parsers. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 1–11, Uppsala, Sweden, July. Association for Computational Linguistics.
 [Koo et al.2010] Terry Koo, Alexander M. Rush, Michael Collins, Tommi Jaakkola, and David Sontag. 2010. Dual decomposition for parsing with nonprojective head automata. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pages 1288–1298, Cambridge, MA, October. Association for Computational Linguistics.
 [Kuncoro et al.2017] Adhiguna Kuncoro, Miguel Ballesteros, Lingpeng Kong, Chris Dyer, Graham Neubig, and Noah A. Smith. 2017. What do recurrent neural network grammars learn about syntax? In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pages 1249–1258, Valencia, Spain, April. Association for Computational Linguistics.

[Lafferty et al.2001]
John Lafferty, Andrew McCallum, Fernando Pereira, et al.
2001.
Conditional random fields: Probabilistic models for segmenting and
labeling sequence data.
In
Proceedings of the eighteenth international conference on machine learning, ICML
, volume 1, pages 282–289.  [Liu and Zhang2017a] Jiangming Liu and Yue Zhang. 2017a. Inorder transitionbased constituent parsing. Transactions of the Association for Computational Linguistics, 5:413–424.
 [Liu and Zhang2017b] Jiangming Liu and Yue Zhang. 2017b. Shiftreduce constituent parsing with neural lookahead features. Transactions of the Association for Computational Linguistics, 5:45–58.
 [Ma and Hovy2017] Xuezhe Ma and Eduard Hovy. 2017. Neural probabilistic model for nonprojective mst parsing. arXiv preprint arXiv:1701.00874.
 [Marcus et al.1993] Mitchell P Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. 1993. Building a large annotated corpus of english: The penn treebank. Computational linguistics, 19(2):313–330.
 [Martins et al.2010] Andre Martins, Noah Smith, Eric Xing, Pedro Aguiar, and Mario Figueiredo. 2010. Turbo parsers: Dependency parsing by approximate variational inference. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pages 34–44, Cambridge, MA, October. Association for Computational Linguistics.
 [McClosky et al.2006] David McClosky, Eugene Charniak, and Mark Johnson. 2006. Reranking and selftraining for parser adaptation. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, pages 337–344, Sydney, Australia, July. Association for Computational Linguistics.
 [Neubig et al.2017] Graham Neubig, Chris Dyer, Yoav Goldberg, Austin Matthews, Waleed Ammar, Antonios Anastasopoulos, Miguel Ballesteros, David Chiang, Daniel Clothiaux, Trevor Cohn, et al. 2017. Dynet: The dynamic neural network toolkit. arXiv preprint arXiv:1701.03980.
 [Nivre2003] Joakim Nivre. 2003. An efficient algorithm for projective dependency parsing. In Proceedings of the 8th International Workshop on Parsing Technologies (IWPT, pages 149–160.
 [Nivre2008] Joakim Nivre. 2008. Algorithms for deterministic incremental dependency parsing. Comput. Linguist., 34(4):513–553, December.
 [Petrov and Klein2007] Slav Petrov and Dan Klein. 2007. Improved inference for unlexicalized parsing. In Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Proceedings of the Main Conference, pages 404–411, Rochester, New York, April. Association for Computational Linguistics.
 [Sagae and Lavie2005] Kenji Sagae and Alon Lavie. 2005. A classifierbased parser with linear runtime complexity. In Proceedings of the Ninth International Workshop on Parsing Technology, pages 125–132. Association for Computational Linguistics.
 [Sagae and Lavie2006] Kenji Sagae and Alon Lavie. 2006. Parser combination by reparsing. In Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers, pages 129–132, New York City, USA, June. Association for Computational Linguistics.
 [Shi et al.2017] Tianze Shi, Liang Huang, and Lillian Lee. 2017. Fast(er) exact decoding and global training for transitionbased dependency parsing via a minimal feature set. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 12–23, Copenhagen, Denmark, September. Association for Computational Linguistics.
 [Shindo et al.2012] Hiroyuki Shindo, Yusuke Miyao, Akinori Fujino, and Masaaki Nagata. 2012. Bayesian symbolrefined tree substitution grammars for syntactic parsing. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 440–448, Jeju Island, Korea, July. Association for Computational Linguistics.
 [Socher et al.2013] Richard Socher, John Bauer, Christopher D. Manning, and Ng Andrew Y. 2013. Parsing with compositional vector grammars. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 455–465, Sofia, Bulgaria, August. Association for Computational Linguistics.
 [Srivastava et al.2014] Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1):1929–1958.
 [Stern et al.2017] Mitchell Stern, Jacob Andreas, and Dan Klein. 2017. A minimal spanbased neural constituency parser. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 818–827, Vancouver, Canada, July. Association for Computational Linguistics.
 [Sutskever et al.2014] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104–3112.
 [Tai et al.2015] Kai Sheng Tai, Richard Socher, and Christopher D. Manning. 2015. Improved semantic representations from treestructured long shortterm memory networks. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, pages 1556–1566, Beijing, China, July. Association for Computational Linguistics.
 [Taskar et al.2004] Ben Taskar, Dan Klein, Mike Collins, Daphne Koller, and Christopher Manning. 2004. Maxmargin parsing. In Dekang Lin and Dekai Wu, editors, Proceedings of EMNLP 2004, pages 1–8, Barcelona, Spain, July. Association for Computational Linguistics.
 [Teng and Zhang2016] Zhiyang Teng and Yue Zhang. 2016. Bidirectional treestructured LSTM with head lexicalization. CoRR, abs/1611.06788.
 [Vinyals et al.2015] Oriol Vinyals, Łukasz Kaiser, Terry Koo, Slav Petrov, Ilya Sutskever, and Geoffrey Hinton. 2015. Grammar as a foreign language. In Advances in Neural Information Processing Systems, pages 2773–2781.
 [Wang and Chang2016] Wenhui Wang and Baobao Chang. 2016. Graphbased dependency parsing with bidirectional lstm. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2306–2315, Berlin, Germany, August. Association for Computational Linguistics.
 [Wang and Xue2014] Zhiguo Wang and Nianwen Xue. 2014. Joint pos tagging and transitionbased constituent parsing in chinese with nonlocal features. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 733–742, Baltimore, Maryland, June. Association for Computational Linguistics.
 [Wang et al.2015] Zhiguo Wang, Haitao Mi, and Nianwen Xue. 2015. Feature optimization for constituent parsing via neural networks. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1138–1147, Beijing, China, July. Association for Computational Linguistics.
 [Watanabe and Sumita2015] Taro Watanabe and Eiichiro Sumita. 2015. Transitionbased neural constituent parsing. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1169–1179, Beijing, China, July. Association for Computational Linguistics.
 [Xue et al.2005] Naiwen Xue, Fei Xia, FuDong Chiou, and Marta Palmer. 2005. The penn chinese treebank: Phrase structure annotation of a large corpus. Natural language engineering, 11(02):207–238.
 [Zhang and Clark2009] Yue Zhang and Stephen Clark. 2009. Transitionbased parsing of the chinese treebank using a global discriminative model. In Proceedings of the 11th International Conference on Parsing Technologies (IWPT’09), pages 162–171, Paris, France, October. Association for Computational Linguistics.
 [Zhang and Clark2011] Yue Zhang and Stephen Clark. 2011. Syntactic processing using the generalized perceptron and beam search. Computational linguistics, 37(1):105–151.
 [Zhang and Nivre2011] Yue Zhang and Joakim Nivre. 2011. Transitionbased dependency parsing with rich nonlocal features. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 188–193, Portland, Oregon, USA, June. Association for Computational Linguistics.
 [Zhou et al.2015] Hao Zhou, Yue Zhang, Shujian Huang, and Jiajun Chen. 2015. A neural probabilistic structuredprediction model for transitionbased dependency parsing. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1213–1222, Beijing, China, July. Association for Computational Linguistics.
 [Zhu et al.2013] Muhua Zhu, Yue Zhang, Wenliang Chen, Min Zhang, and Jingbo Zhu. 2013. Fast and accurate shiftreduce constituent parsing. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 434–443, Sofia, Bulgaria, August. Association for Computational Linguistics.
 [Zhu et al.2015] Xiaodan Zhu, Parinaz Sobhani, and Hongyu Guo. 2015. Long shortterm memory over recursive structures. In Proceedings of the 32nd International Conference on Machine Learning, pages 1604–1612, Lille, France, July.
Comments
There are no comments yet.