1 Introduction
Error propagation is a common problem for many NLP tasks [Song et al.2012, Quirk and CorstonOliver2006, Han et al.2013, Gildea and Palmer2002, Yang and Cardie2013]. It can occur when NLP tools applied early on in a pipeline make mistakes that have negative impact on higherlevel tasks further down the pipeline. It can also occur within the application of a specific task, when sequential decisions are taken and errors made early in the process affect decisions made later on.
When reinforcement learning is applied, a system actively tries out different sequences of actions. Most of these sequences will contain some errors. We hypothesize that a system trained in this manner will be more robust and less susceptible to error propagation.
We test our hypothesis by applying reinforcement learning to greedy transitionbased parsers [Yamada and Matsumoto2003, Nivre2004], which have been popular because of superior efficiency and accuracy nearing stateoftheart. They are also known to suffer from error propagation. Because they work by carrying out a sequence of actions without reconsideration, an erroneous action can exert a negative effect on all subsequent decisions. By rendering correct parses unreachable or promoting incorrect features, the first error induces the second error and so on. mcdonald.nirve07 argue that the observed negative correlation between parsing accuracy and sentence length indicates error propagation is at work.
We compare reinforcement learning to supervised learning on chen.manning.2014’s parser. This high performance parser is available as open source. It does not make use of alternative strategies for tackling error propagation and thus provides a clean experimental setup to test our hypothesis. Reinforcement learning increased both unlabeled and labeled accuracy on the Penn TreeBank and German part of SPMRL
[Seddah et al.2014]. This outcome shows that reinforcement learning has a positive effect, but does not yet prove that this is indeed the result of reduced error propagation. We therefore designed an experiment which identified which errors are the result of error propagation. We found that around 50% of avoided errors were cases of error propagation in our best arcstandard system. Considering that 27% of the original errors were caused by error propagation, this result confirms our hypothesis.This paper provides the following contributions:

We introduce Approximate Policy Gradient (APG), a new algorithm that is suited for dependency parsing and other structured prediction problems.

We show that this algorithm improves the accuracy of a highperformance greedy parser.

We design an experiment for analyzing error propagation in parsing.

We confirm our hypothesis that reinforcement learning reduces error propagation.
To our knowledge, this paper is the first to experimentally show that reinforcement learning can reduce error propagation in NLP.
The rest of this paper is structured as follows. We discuss related work in Section 2. This is followed by a description of the parsers used in our experiments in Section 3. Section 4 outlines our experimental setup and presents our results. The error propagation experiment and its outcome are described in Section 5. Finally, we conclude and discuss future research in Section 6.
2 Related Work
In this section, we address related work on dependency parsing, including alternative approaches for reducing error propagation, and reinforcement learning.
2.1 Dependency Parsing
We use chen.manning.2014’s parser as a basis for our experiments. Their parser is opensource and has served as a reference point for many recent publications [Dyer et al.2015, Weiss et al.2015, Alberti et al.2015, Honnibal and Johnson2015, among others]
. They provide an efficient neural network that learns dense vector representations of words, PoStags and dependency labels. This small set of features makes their parser significantly more efficient than other popular parsers, such as the Malt
[Nivre et al.2007] or MST [McDonald et al.2005] parser while obtaining higher accuracy. They acknowledge the error propagation problem of greedy parsers, but leave addressing this through (e.g.) beam search for future work.Dyer2015 introduce an approach that uses Long ShortTerm Memory (LSTM). Their parser still works incrementally and the number of required operations grows linearly with the length of the sentence, but it uses the complete buffer, stack and history of parsing decisions, giving the model access to global information. weiss15 introduce several improvements on chen.manning.2014’s parser. Most importantly, they put a globallytrained perceptron layer instead of a softmax output layer. Their model uses smaller embeddings, rectified linear instead of cubic activation function, and two hidden layers instead of one. They furthermore apply an averaged stochastic gradient descent (ASGD) learning scheme. In addition, they apply beam search and increase training data by using unlabeled data through the tritraining approach introduced by li:zha:che:14, which leads to further improvements.
kiperwasser.goldberg16 introduce a new way to represent features using a bidirectional LSTM and improve the results of a greedy parser. Andor2016 present a mathematical proof that globally normalized models are more expressive than locally normalized counterparts and propose to use global normalization with beam search at both training and testing.
Our approach differs from all of the work mentioned above, in that it manages to improve results of chen.manning.2014 without changing the architecture of the model nor the input representation. The only substantial difference lies in the way the model is trained. In this respect, our research is most similar to training approaches using dynamic oracles [Goldberg and Nivre2012]. Traditional static oracles can generate only one sequence of actions per sentence. A dynamic oracle gives all trajectories leading to the best possible result from every valid parse configuration. They can therefore be used to generate more training sequences including those containing errors. A drawback of this approach is that dynamic oracles have to be developed specifically for individual transition systems (e.g. arcstandard, arceager). Therefore, a large number of dynamic oracles have been developed in recent years [Goldberg and Nivre2012, Goldberg and Nivre2013, Goldberg et al.2014, GomezRodriguez et al.2014, Björkelund and Nivre2015]. In contrast, the reinforcement learning approach proposed in this paper is more general and can be applied to a variety of systems.
zhang.chan.2009 present the only study we are aware of that also uses reinforcement learning for dependency parsing. They compare their results to niv:hal:nil:ery:mar:06 using the same features, but they also change the model and apply beam search. It is thus unclear to what extend their improvements are due to reinforcement learning.
Even though most approaches mentioned above improve the results reported by chen.manning.2014 and even more impressive results on dependency parsing have been achieved since (notably, Andor2016), Chen and Manning’s parser provides a better baseline for our purposes. We aim at investigating the influence of reinforcement learning on error propagation and want to test this in a clean environment, where reinforcement learning does not interfere with other methods that address the same problem.
2.2 Reinforcement Learning
Reinforcement learning has been applied to several NLP tasks with success, e.g. agendabased parsing [Jiang et al.2012], semantic parsing [Berant and Liang2015] and simultaneous machine translation [Grissom II et al.2014]. To our knowledge, however, none of these studies investigated the influence of reinforcement learning on error propagation.
Learning to Search (L2S) is probably the most prominent line of research that applies reinforcement learning (more precisely, imitation learning) to NLP. Various algorithms, e.g. SEARN
[Daumé III et al.2009] and DAgger [Ross et al.2011], have been developed sharing common highlevel steps: a rollin policy is executed to generate training states from which a rolloutpolicy is used to estimate the loss of certain actions. The concrete instantiation differs from one algorithm to another with choices including a referent policy (static or dynamic oracle), learned policy, or a mixture of the two. Early work in L2S focused on reducing reinforcement learning into binary classification
[Daumé III et al.2009], but newer systems favored regressors for efficiency [Chang et al.2015, Supplementary material, Section B]. Our algorithm APG is simpler than L2S in that it uses only one policy (pretrained with standard supervised learning) and applies the existing classifier directly without reduction (the only requirement is that it is probabilistic). Nevertheless, our results demonstrate its effectiveness.
APG belongs to the family of policy gradient algorithms [Sutton et al.1999], i.e. it maximizes the expected reward directly by following its gradient w.r.t. the parameters. The advantage of using a policy gradient algorithm in NLP is that gradientbased optimization is already widely used. REINFORCE [Williams1992, Ranzato et al.2016]
is a widelyused policy gradient algorithm but it is also wellknown for suffering from high variance
[Sutton et al.1999].We directly compare our approach to REINFORCE, whereas we leave a direct comparison to L2S for future work. Our experiments show that our algorithm results in lower variance and achieves better performance than REINFORCE.
Recent work addresses the approximation of reinforcement learning gradient in the context of machine translation. Shen2016’s algorithm is roughly equivalent to the combination of an oracle and random sampling. Their approach differs from ours, because it does not retain memory across iteration as in our bestperforming model (see Section 3.4).
2.3 Reinforcement and error propagation
As mentioned above, previous work that applied reinforcement learning to NLP has, to our knowledge, not shown that it improved results by reducing error propagation.
Work on identifying the impact of error propagation in parsing is rare, Ng2015IdentifyingParsing being a notable exception. They provide a detailed error analysis for parsing and classify which kind of parsing errors are involved with error propagation. There are four main differences between their approaches and ours. First, Ng and Curran correct arcs in the tree and our algorithm corrects decisions of the parsing algorithm. Second, our approach distinguishes between cases where one erroneous action deterministically leads to multiple erroneous arcs and cases where an erroneous action leads to conditions that indirectly result in further errors (see Section 5.1 for a detailed explanation). Third, Ng and Curran’s algorithm corrects all erroneous arcs that are the same type of parsing error and point out that they cannot examine the interaction between multiple errors of the same type in a sentence. Our algorithm corrects errors incrementally and therefore avoids this issue. Finally, the classification and analysis presented in Ng2015IdentifyingParsing are more extensive and detailed than ours. Our algorithm can, however, easily be extended to perform similar analysis. Overall, Ng and Curran’s approach for error analysis and ours are complementary. Combining them and applying them to various systems would form an interesting direction for future work.
Step  Transition  Stack  Buffer  Arcs 

0  root  waves hit … Big Board  
1  shift  root waves  hit stocks … Big Board  
2  shift  root waves hit  stocks themselves … Big Board  
3  left_{nsubj}  root hit  stocks themselves … Big Board  A = { hit waves} 
4  shift  root hit stocks  themselves on the Big Board  A 
5  shift  root hit stocks themselves  on the Big Board  A 
6  right_{dep}  root hit stocks  on the Big Board  A = A 
{ stock themselves}  
7  right_{dobj}  root hit  on the Big Board  A = A { hit stock} 
3 A Reinforced Greedy Parser
This section describes the systems used in our experiments. We first describe the arcstandard algorithm, because familiarity with it helps to understand our error propagation analysis. Next, we briefly point out the main differences between the arcstandard algorithm and the alternative algorithms we experimented with (arceager and swapstandard). We then outline the traditional and our novel machine learning approaches. The features we used are identical to those described in chen.manning.2014. We are not aware of research identifying the best feature for a neural parser with arceager or swapstandard so we use the same features for all transition systems.
3.1 TransitionBased Dependency Parsing
In an arcstandard system [Nivre2004], a parsing configuration consists of a triple , where is a stack, is a buffer containing the remaining input tokens and are the dependency arcs that are created during parsing process. At initiation, the stack contains only the root symbol ( = [ROOT]), the buffer contains the tokens of the sentence () and the set of arcs is empty ().
The arcstandard system supports three transitions. When is the top element and the second element on the stack, and the first element of the buffer:^{1}^{1}1Naturally, the transitions LEFT and RIGHT can only take place if the stack contains at least two elements and SHIFT can only occur when there is at least one element on the buffer.
 LEFT

adds an arc to and removes from the stack.
 RIGHT

adds an arc to and removes from the stack.
 SHIFT

moves to the stack.
When the buffer is empty, the stack contains only the root symbol and contains a parse tree, the configuration is completed. For a sentence of tokens, a full parse takes 2 + 1 transitions to complete (including the initiation). Figure 1 provides the gold parse tree for a (simplified) example from the Penn Treebank. The steps taken to create the dependencies between the sentence’s head word hit and its subject and direct object are provided in Table 1.
To demonstrate that reinforcement learning can train different systems, we also carried out experiments with arceager [Nivre2003] and swapstandard [Nivre2009]. Arceager is designed for incremental parsing and included in the popular MaltParser [Nivre et al.2006a]. Swapstandard is a simple and effective solution to unprojective dependency trees. Because arceager does not guarantee complete parse trees, we used a variation that employs an action called UNSHIFT to resume processing of tokens that would otherwise not be attached to a head [Nivre and FernándezGonzález2014].
3.2 Training with a Static Oracle
In transitionbased dependency parsing, it is common to convert a dependency treebank into a collection of input features and corresponding goldstandard actions for training, using a static oracle . In chen.manning.2014, a neural network works as a function mapping input features to probabilities of actions: . The neural network is trained to minimize negative loglikelihood loss on the converted treebank:
(1) 
3.3 Reinforcement Learning
Following Maes2009, we view transitionbased dependency parsing as a deterministic Markov Decision Process. The problem is summarized by a tuple
where is the set of all possible states, contains all possible actions, is a mapping called transition function and is a reward function.A state corresponds to a configuration and is summarized into input features. Possible actions are defined for each transition system described in Section 3.1. We keep the training approach simple by using only one reward at the end of each parse.
Given this framework, a stochastic policy guides our parser by mapping each state to a probabilistic distribution of actions. During training, we use function described in Section 3.2 as a stochastic policy. At test time, actions are chosen in a greedy fashion following existing literature. We aim at finding the policy that maximizes the expected reward (or, equivalently, minimizes the expected loss) on the training dataset:
(2) 
where is a sequence of actions obtained by following policy until termination and are corresponding states (with being the termination state).
3.4 Approximate Policy Gradient
Gradient ascent can be used to maximize the expected reward in Equation 2. The gradient of expected reward w.r.t. parameters is (note that ):
(3) 
Because of the exponential number of possible trajectories, calculating the gradient exactly is not possible. We propose to replace it by an approximation (hence the name Approximate Policy Gradient) by summing over a small subset of trajectories. Following common practice, we also use a baseline that only depends on the correct dependency tree. The parameter is then updated by following the approximate gradient:
(4) 
Instead of sampling one trajectory at a time as in REINFORCE, Equation 4 has the advantage that sampling over multiple trajectories could lead to more stable training and higher performance. To achieve that goal, the choice of is critical. We empirically evaluate three strategies:
 RLOracle:

only includes the oracle transition sequence.
 RLRandom:

randomly samples distinct trajectories at each iteration. Every action is sampled according to , i.e. preferring trajectories for which the current policy assigns higher probability.
 RLMemory:

samples randomly as the previous method but retains trajectories with highest rewards across iterations in a separate memory. Trajectories are “forgotten” (removed) randomly with probability before each iteration.^{2}^{2}2We assign a random number (drawn uniformly from ) to each trajectory in memory and remove those assigned a number less than .
Intuitively, trajectories that are more likely and produce higher rewards are better training examples. It follows from Equation 3 that they also bear bigger weight on the true gradient. This is the rationale behind RLRandom and RLOracle. For a suboptimal parser, however, these objectives sometimes work against each other. RLMemory was designed to find the right balance between them. It is furthermore important that the parser is pretrained to ensure good samples. Algorithm 1 illustrates the procedure of training a dependency parser using the proposed algorithms.
4 Reinforcement Learning Experiments
We first train a parser using a supervised learning procedure and then improve its performance using APG. We empirically tested that training a second time with supervised learning has little to no effect on performance.
4.1 Experimental Setup
We use PENN Treebank 3 with standard split (training, development and test set) for our experiments with argstandard and argeager. Because the swapstandard parser is mainly suited for nonprojective structures, which are rare in the PENN Treebank, we evaluate this parser on the German section of the SPMRL dataset. For PENN Treebank, we follow Chen and Manning’s preprocessing steps. We also use their pretrained model^{3}^{3}3We use PTB_Stanford_params.txt.gz downloaded from http://nlp.stanford.edu/software/nndep.shtml on December 30^{th}, 2015. for arcstandard and train our own models in similar settings for other transition systems.
For reinforcement learning, we use AdaGrad for optimization. We do not use dropout because we observed that it destablized the training process. The reward is the number of correct labeled arcs (i.e. LAS multiplied by number of tokens).^{4}^{4}4Punctuation is not taken into account, following chen.manning.2014.
The baseline is fixed to half the number of tokens (corresponding to a 0.5 LAS score). As training takes a lot of time, we tried only few values of hyperparameters on the development set and picked
and . 1,000 updates were performed (except for REINFORCE which was trained for 8,000 updates) with each training batch contains 512 randomly selected sentences. The Stanford dependency scorer^{5}^{5}5Downloaded from http://nlp.stanford.edu/software/lexparser.shtml. was used for evaluation.4.2 Effectiveness of Reinforcement Learning
Arc  Arc  Swap  
standard  eager  standard  
UAS  LAS  UAS  LAS  UAS  LAS  
SL  91.3  89.4  88.3  85.8  84.3  81.3 
RE  91.9  90.2  89.7  87.2  87.5  84.4 
RLO  91.8  90.2  88.9  86.5  86.8  83.9 
RLR  92.2  90.6  89.4  87.0  87.5  84.5 
RLM  92.2  90.6  89.8  87.4  87.6  84.6 
Table 2 displays the performance of different approaches to training dependency parsers. Although we used chen.manning.2014’s pretrained model and Stanford opensource software, the results of our baseline are slightly worse than what is reported in their paper. This could be due to minor differences in settings and does not affect our conclusions.
Across transition systems and two languages, APG outperforms supervised learning, verifying our hypothesis. Moreover, it is not simply because the learners are exposed to more examples than their supervised counterparts. RLOracle is trained on exactly the same examples as the standard supervised learning system (SL), yet it is consistently superior. This can only be explained by the superiority of the reinforcement learning objective function compared to negative loglikelihood.
The results support our hypothesis that APG is better than REINFORCE (abbreviated as RE in Table 2) as RLMemory
always outperforms the classical algorithm and the other two heuristics do in two out of three cases. The usefulness of training examples that contain errors is evident through the better performance of
RLRandom and RLMemory in comparison to RLOracle.Table 3 shows the importance of samples for RLRandom. The algorithm hurts performance when only one sample is used whereas training with two or more samples improves the results. The difference cannot be explained by the total number of observed samples because onesample training is still worse after 8,000 iterations compared to a sample size of 8 after 1,000 iterations. The benefit of added samples is twofold: increased performance and decreased variance. Because these benefits saturate quickly, we did not test sample sizes beyond 32.
Dev  Test  Test std.  
UAS  LAS  UAS  LAS  UAS  LAS  
SL  91.5  89.6  91.3  89.4     
RE  92.1  90.4  91.9  90.2  0.04  0.05 
1  91.2  89.1  91.0  88.9  0.12  0.15 
2  91.8  90.0  91.6  89.9  0.09  0.09 
4  92.2  90.5  92.0  90.4  0.09  0.08 
8  92.4  90.8  92.2  90.6  0.03  0.05 
16  92.4  90.8  92.2  90.6     
32  92.4  90.8  92.3  90.6     
5 Error Propagation Experiment
We hypothesized that reinforcement learning avoids error propagation. In this section, we describe our algorithm and the experiment that identifies error propagation in the arcstandard parsers.
5.1 Error Propagation
Step  Transition  Stack  Buffer  Arcs 

4  shift  root hit stocks  themselves on the Big Board  A 
5’  right_{dobj}  root hit  themselves on the Big Board  A = A 
{hit stock}  
6’  shift  root hit themselves  on the Big Board  A 
7’  shift  root hit themselves on  the Big Board  A 
…  
10’  shift  root hit themselves on the Big Board  A  
11’  left_{nn}  root hit themselves on the Board  A = A  
{Board Big}  
12’  left_{det}  root hit themselves on Board  A = A  
{Board the}  
13’  right_{pobj}  root hit themselves on  A = A  
{on Board}  
14’  right_{dep}  root hit themselves  A = A  
{themselves on} 
Section 3.1 explained that a transitionbased parser goes through the sentence incrementally and must select a transition from [SHIFT, LEFT, RIGHT] at each step. We use the term arc error to refer to an erroneous arc in the resulting tree. The term decision error refers to a transition that leads to a loss in parsing accuracy. Decision errors in the parsing process lead to one or more arc errors in the resulting tree. There are two ways in which a single decision error may lead to multiple arc errors. First, the decision can deterministically lead to more than one arc error, because (e.g.) an erroneously formed arc also blocks other correct arcs. Second, an erroneous parse decision changes some of the features that the model uses for future decisions and these changes can lead to further (decision) errors down the road.
We illustrate both cases using two incorrect derivations presented in Figure 2. The original gold tree is repeated in (A). The dependency graph in Figure 2 (B) contains three erroneous dependency arcs (indicated by dashed arrows). The first error must have occurred when the parser executed RIGHT_{amod} creating the arc Big Board. After this error, it is impossible to create the correct relations on Board and Board the. The wrong arcs Big the and on Big are thus all the result of a single decision error.
Figure 2 (C) represents the dependency graph that is actually produced by our parser.^{6}^{6}6The example is a fragment of a more complex sentence consisting of 33 tokens. The parser does provide the correct output when is analyzes this sequence in isolation. It contains two erroneous arcs: hit themselves and themselves on. Table 4
provides a possible sequence of steps that led to this derivation, starting from the moment
stocks was added to the stack (Step 4). The first error is introduced in Step 5’, where hit combines with stocks before stocks has picked up its dependent themselves. At that point, themselves can no longer be combined with the right head. The proposition on, on the other hand, can still be combined with the correct head. This error is introduced in Step 7’, where the parser moves on to the stack rather than creating an arc from hit to themselves.^{7}^{7}7Note that technically, on can still become a dependent of hit, but this can only happen if on becomes the head of themselves which would also be an error. There are thus two decision errors that lead to the arc errors in Figure 2 (C). The second decision error can, however, be caused indirectly by the first error. If a decision error causes additional decision errors later in the parsing process, we talk of error propagation. This cannot be known just by looking at the derivation.5.2 Examining the impact of decision errors
We examine the impact of individual decision errors on the overall parse results in our test set by combining a dynamic oracle and a recursive function. We use a dynamic oracle based on goldberg2014 which gives us the overall loss at any point during the derivation. The loss is equal to the minimal number of arc errors that will have been made once the parse is complete. We can thus deduce how many arc errors are deterministically caused by a given decision error.
The propagation of decision errors cannot be determined by simply examining the increase in loss during the parsing process. We use a recursive function to identify whether a particular parse suffered from this. While parsing the sentence, we register which decisions lead to an increase in loss. We then recursively reparse the sentence correcting one additional decision error during each run until the parser produces the gold. If each erroneous decision has to be corrected in order to arrive at the gold, we assume the decision errors are independent of each other. If, on the other hand, the correction of a specific decision also fixes other decisions down the road, the original parse suffers from error propagation.
SL  RLO  RLR  RLM  
Total Loss  7069  6227  6042  6144 
Dec. Errors  5177  4410  4345  4476 
Err. Prop.  1399  1124  992  1035 
New errors  411  432  403  400 
Loss/error  1.37  1.41  1.39  1.37 
Err. Prop. (%)  27.0  25.5  22.8  23.1 
The results are presented in Table 5. Total Loss indicates the number of arc errors in the corpus, Dec. Errors the number of decision errors and Err. Prop. the number of decision errors that were the result of error propagation. This number was obtained by comparing the number of decision errors in the original parse to the number of decision errors that needed to be fixed to obtain the gold parse. If less errors had to be fixed than originally present, we counted the difference as error propagation. Note that fixing errors sometimes leads to new decision errors during the derivation. We also counted the cases where more decision errors needed to be fixed than were originally present and report them in Table 5.^{8}^{8}8We ran an alternative analysis where we counted all cases where fixing one decision error in the derivation reduced the overall number of decision errors in the parse by more than one. Under this alternative analysis, similar reductions in the proportion of error propagation were observed for reinforcement learning.
On average, decision errors deterministically lead to more than one arc error in the resulting parse tree. This remains stable across systems (around 1.4 arc errors per decision error). We furthermore observe that the proportion of decision errors that are the result of error propagation has indeed reduced for all reinforcement learning models. Among the errors avoided by APG, 35.9% were propagated errors for RLOracle, 48.9% for RLRandom, and 51.9% for RLMemory. These percentages are all higher than the proportion of propagated errors occurring in the corpus parsed by SL (27%). This outcome confirms our hypothesis that reinforcement learning is indeed more robust for making decisions in imperfect environments and therefore reduces error propagation.
6 Conclusion
This paper introduced Approximate Policy Gradient (APG), an efficient reinforcement learning algorithm for NLP, and applied it to a highperformance greedy dependency parser. We hypothesized that reinforcement learning would be more robust against error propagation and would hence improve parsing accuracy.
To verify our hypothesis, we ran experiments applying APG to three transition systems and two languages. We furthermore introduced an experiment to investigate which portion of errors were the result of error propagation and compared the output of standard supervised machine learning to reinforcement learning. Our results showed that: (a) reinforcement learning indeed improved parsing accuracy and (b) propagated errors were overrepresented in the set of avoided errors, confirming our hypothesis.
To our knowledge, this paper is the first to show experimentally that reinforcement learning can reduce error propagation in an NLP task. This result was obtained by a straightforward implementation of reinforcement learning. Furthermore, we only applied reinforcement learning in the training phase, leaving the original efficiency of the model intact. Overall, we see the outcome of our experiments as an important first step in exploring the possibilities of reinforcement learning for tackling error propagation.
Recent research on parsing has seen impressive improvement during the last year achieving UAS around 94% [Andor et al.2016]. This improvement is partially due to other approaches that, at least in theory, address error propagation, such as beam search. Both the reinforcement learning algorithm and the error propagation study we developed can be applied to other parsing approaches. There are two (related) main questions to be addressed in future work in the domain of parsing. The first addresses whether our method is complementary to alternative approaches and could also improve the current stateoftheart. The second question would address the impact of various approaches on error propagation and the kind of errors they manage to avoid (following Ng2015IdentifyingParsing).
APG is general enough for other structured prediction problems. We therefore plan to investigate whether we can apply our approach to other NLP tasks such as coreference resolution or semantic role labeling and investigate if it can also reduce error propagation for these tasks.
The source code of all experiments is publicly available at https://bitbucket.org/cltl/redepjava.
Acknowledgments
The research for this paper was supported by the Netherlands Organisation for Scientific Research (NWO) via the Spinozaprize Vossen projects (SPI 30673, 20142019) and the VENI project Reading between the lines (VENI 27589029). Experiments were carried out on the Dutch national einfrastructure with the support of SURF Cooperative. We would like to thank our friends and colleagues Piek Vossen, Roser Morante, Tommaso Caselli, Emiel van Miltenburg, and Ngoc Do for many useful comments and discussions. We would like to extend our thanks the anonymous reviewers for their feedback which helped improving this paper. All remaining errors are our own.
References
 [Alberti et al.2015] Chris Alberti, David Weiss, Greg Coppola, and Slav Petrov. 2015. Improved TransitionBased Parsing and Tagging with Neural Networks. In EMNLP 2015, pages 1354–1359. ACL.
 [Andor et al.2016] Daniel Andor, Chris Alberti, David Weiss, Aliaksei Severyn, Alessandro Presta, Kuzman Ganchev, Slav Petrov, and Michael Collins. 2016. Globally Normalized TransitionBased Neural Networks. arXiv.org, cs.CL.
 [Berant and Liang2015] Jonathan Berant and Percy Liang. 2015. Imitation Learning of Agendabased Semantic Parsers. TACL, 3:545–558.
 [Björkelund and Nivre2015] Anders Björkelund and Joakim Nivre. 2015. NonDeterministic Oracles for Unrestricted NonProjective TransitionBased Dependency Parsing. In IWPT 2015, pages 76–86. ACL.
 [Chang et al.2015] KaiWei Chang, Akshay Krishnamurthy, Alekh Agarwal, Hal Daumé III, and John Langford. 2015. Learning to search better than your teacher. In ICML 2015.
 [Chen and Manning2014] Danqi Chen and Christopher Manning. 2014. A Fast and Accurate Dependency Parser using Neural Networks. In EMNLP 2014, pages 740–750. ACL.
 [Daumé III et al.2009] Hal Daumé III, John Langford, and Daniel Marcu. 2009. Searchbased Structured Prediction. Machine Learning, 75(3):297–325, 6.
 [Dyer et al.2015] Chris Dyer, Miguel Ballesteros, Wang Ling, Austin Matthews, and Noah A Smith. 2015. TransitionBased Dependency Parsing with Stack Long ShortTerm Memory. In ACL 2015, pages 334–343.
 [Gildea and Palmer2002] Daniel Gildea and Martha Palmer. 2002. The Necessity of Parsing for Predicate Argument Recognition. In ACL 2002, pages 239–246. ACL.
 [Goldberg and Nivre2012] Yoav Goldberg and Joakim Nivre. 2012. A Dynamic Oracle for ArcEager Dependency Parsing. In COLING 2012, pages 959–976.
 [Goldberg and Nivre2013] Yoav Goldberg and Joakim Nivre. 2013. Training Deterministic Parsers with NonDeterministic Oracles. In TACL 2013, volume 1, pages 403–414.
 [Goldberg et al.2014] Yoav Goldberg, Francesco Sartorio, and Giorgio Satta. 2014. A tabular method for dynamic oracles in transitionbased parsing. In TACL 2014, volume 2, pages 119–130.
 [GomezRodriguez et al.2014] Carlos GomezRodriguez, Francesco Sartorio, and Giorgio Satta. 2014. A PolynomialTime Dynamic Oracle for NonProjective Dependency Parsing. In EMNLP 2014, pages 917–927. ACL.
 [Grissom II et al.2014] Alvin C. Grissom II, Jordan BoydGraber, He He, John Morgan, and Hal Daume III. 2014. Don’t Until the Final Verb Wait: Reinforcement Learning for Simultaneous Machine Translation. In EMNLP 2014, pages 1342–1352.
 [Han et al.2013] Dan Han, Pascual MartínezGómez, Yusuke Miyao, Katsuhito Sudoh, and Masaaki Nagata. 2013. Effects of parsing errors on prereordering performance for ChinesetoJapanese SMT. PACLIC 27, pages 267–276.
 [Honnibal and Johnson2015] Matthew Honnibal and Mark Johnson. 2015. An Improved Nonmonotonic Transition System for Dependency Parsing. In EMNLP 2015, pages 1373–1378. ACL.
 [Jiang et al.2012] Jiarong Jiang, Adam Teichert, Hal Daumé III, and Jason Eisner. 2012. Learned Prioritization for Trading Off Accuracy and Speed. ICML workshop on Inferning: Interactions between Inference and Learning, (0964681):1–9.
 [Kiperwasser and Goldberg2016] Eliyahu Kiperwasser and Yoav Goldberg. 2016. Simple and Accurate Dependency Parsing Using Bidirectional LSTM Feature Representations. CoRR, abs/1603.0.
 [Li et al.2014] Zhenghua Li, Min Zhang, and Wenliang Chen. 2014. Ambiguityaware Ensemble Training for Semisupervised Dependency Parsing. In ACL 2014, pages 457–467.
 [Maes et al.2009] Francis Maes, Ludovic Denoyer, and Patrick Gallinari. 2009. Structured prediction with reinforcement learning. Machine Learning, (77):271–301.
 [McDonald and Nivre2007] Ryan McDonald and Joakim Nivre. 2007. Characterizing the Errors of DataDriven Dependency Parsing Models. In EMNLPCoNLL 2007.
 [McDonald et al.2005] Ryan McDonald, Fernando Pereira, Kiril Ribarov, and Jan Hajič. 2005. Nonprojective dependency parsing using spanning tree algorithms. In HLTEMNLP 2005, pages 523–530. Association for Computational Linguistics.
 [Ng and Curran2015] Dominick Ng and James R Curran. 2015. Identifying Cascading Errors using Constraints in Dependency Parsing. In ACLIJCNLP, pages 1148–1158, Beijing. ACL.
 [Nivre and FernándezGonzález2014] Joakim Nivre and Daniel FernándezGonzález. 2014. Arceager Parsing with the Tree Constraint. Computational Linguistics, 40(2):259–267, 6.
 [Nivre et al.2006a] Joakim Nivre, Johan Hall, and Jens Nilsson. 2006a. MaltParser: A datadriven parsergenerator for dependency parsing. In LREC 2006, volume 6, pages 2216–2219.

[Nivre et al.2006b]
Joakim Nivre, Johan Hall, Jens Nilsson, Gülşen Eryiğit, and
Svetoslav Marinov.
2006b.
Labeled pseudoprojective dependency parsing with support vector machines.
In CoNLL 2006, pages 221–225. ACL.  [Nivre et al.2007] Joakim Nivre, Johan Hall, Jens Nilsson, Atanas Chanev, Eryiǧit Gülşen, Sandra Kübler, Svetoslav Marinov, and Erwin Marsi. 2007. MaltParser: A languageindependent system for datadriven dependency parsing. Natural Language Engineering, 13(02):95–135.
 [Nivre2003] Joakim Nivre. 2003. An Efficient Algorithm for Projective Dependency Parsing. In IWPT 2003, pages 149–160.
 [Nivre2004] Joakim Nivre. 2004. Incrementality in Deterministic Dependency Parsing. In Proceedings of the Workshop on Incremental Parsing: Bringing Engineering and Cognition Together.
 [Nivre2009] Joakim Nivre. 2009. Nonprojective Dependency Parsing in Expected Linear Time. In ACLIJCNLP 2009, pages 351–359, Stroudsburg, PA, USA. ACL.
 [Quirk and CorstonOliver2006] Chris Quirk and Simon CorstonOliver. 2006. The impact of parse quality on syntacticallyinformed statistical machine translation. In EMNLP 2006, pages 62–69, Sydney, Australia. ACL.

[Ranzato et al.2016]
Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba.
2016.
Sequence Level Training with Recurrent Neural Networks.
ICLR, pages 1–15.  [Ross et al.2011] Stephane Ross, Geoffrey J Gordon, and J Andrew Bagnell. 2011. A Reduction of Imitation Learning and Structured Prediction to NoRegret Online Learning. AISTATS, 15:627–635.
 [Seddah et al.2014] Djamé Seddah, Sandra Kübler, and Reut Tsarfaty. 2014. Introducing the SPMRL 2014 Shared Task on Parsing MorphologicallyRich Languages. In Proceedings of the First Joint Workshop on Statistical Parsing of Morphologically Rich Languages and Syntactic Analysis of NonCanonical Languages, pages 103–109.

[Shen et al.2016]
Shiqi Shen, Yong Cheng, Zhongjun He, Wei He, Hua Wu, Maosong Sun, and Yang Liu.
2016.
Minimum Risk Training for Neural Machine Translation.
In ACL 2016, pages 1683–1692, Berlin, Germany. ACL.  [Song et al.2012] HyunJe Song, JeongWoo Son, TaeGil Noh, SeongBae Park, and SangJo Lee. 2012. A Cost Sensitive PartofSpeech Tagging: Differentiating Serious Errors from Minor Errors. In ACL 2012, pages 1025–1034. ACL.
 [Sutton et al.1999] Richard S. Sutton, David Mcallester, Satinder Singh, and Yishay Mansour. 1999. Policy Gradient Methods for Reinforcement Learning with Function Approximation. In NIPS 1999, pages 1057–1063.
 [Weiss et al.2015] David Weiss, Chris Alberti, Michael Collins, and Slav Petrov. 2015. Structured Training for Neural Network TransitionBased Parsing. In ACLIJCNLP 2015, pages 323–333. ACL.
 [Williams1992] Ronald J. Williams. 1992. Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Machine Learning, 8(34):229–256.
 [Yamada and Matsumoto2003] Hiroyasu Yamada and Yuji Matsumoto. 2003. Statistical Dependency Analysis with Support Vector Machines. In Proceedings of IWPT, pages 195–206.
 [Yang and Cardie2013] Bishan Yang and Claire Cardie. 2013. Joint Inference for Finegrained Opinion Extraction. In ACL 2013, pages 1640–1649. ACL.
 [Zhang and Chan2009] Lidan Zhang and Kwok Ping Chan. 2009. Dependency Parsing with Energybased Reinforcement Learning. In IWPT 2009, pages 234–237. ACL.
Comments
There are no comments yet.