Yara Parser: A Fast and Accurate Dependency Parser

by   Mohammad Sadegh Rasooli, et al.
Columbia University

Dependency parsers are among the most crucial tools in natural language processing as they have many important applications in downstream tasks such as information retrieval, machine translation and knowledge acquisition. We introduce the Yara Parser, a fast and accurate open-source dependency parser based on the arc-eager algorithm and beam search. It achieves an unlabeled accuracy of 93.32 on the standard WSJ test set which ranks it among the top dependency parsers. At its fastest, Yara can parse about 4000 sentences per second when in greedy mode (1 beam). When optimizing for accuracy (using 64 beams and Brown cluster features), Yara can parse 45 sentences per second. The parser can be trained on any syntactic dependency treebank and different options are provided in order to make it more flexible and tunable for specific tasks. It is released with the Apache version 2.0 license and can be used for both commercial and academic purposes. The parser can be found at https://github.com/yahoo/YaraParser.



There are no comments yet.


page 1

page 2

page 3

page 4


A Practical Chinese Dependency Parser Based on A Large-scale Dataset

Dependency parsing is a longstanding natural language processing task, w...

Improving a Strong Neural Parser with Conjunction-Specific Features

While dependency parsers reach very high overall accuracy, some dependen...

POS tagging, lemmatization and dependency parsing of West Frisian

We present a lemmatizer/POS-tagger/dependency parser for West Frisian us...

Interactive Text Graph Mining with a Prolog-based Dialog Engine

On top of a neural network-based dependency parser and a graph-based nat...

CLEVR Parser: A Graph Parser Library for Geometric Learning on Language Grounded Image Scenes

The CLEVR dataset has been used extensively in language grounded visual ...

A non-projective greedy dependency parser with bidirectional LSTMs

The LyS-FASTPARSE team presents BIST-COVINGTON, a neural implementation ...

Towards Making a Dependency Parser See

We explore whether it is possible to leverage eye-tracking data in an RN...

Code Repositories


Yara K-Beam Arc-Eager Dependency Parser

view repo


Fast Transition Based Arc-Eager Dependency Parser

view repo


Yara K-Beam Arc-Eager Dependency Parser

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Dependency trees are one of the main representations used in the syntactic analysis of sentences. They show explicit syntactic dependencies among words in the sentence [Kübler et al., 2009]. Many dependency parsers have been released in the past decade. Among them, graph-based and transition-based parsing are two main approaches towards dependency parsing. In graph-based models, the parser aims to find the most likely tree from all possible trees by using maximum spanning tree algorithms often in conjunction with dynamic programming. On the other hand, in transition-based models, a tree is converted to a set of incremental actions and the parser decides to commit an action depending on the current configuration of the partial tree. Graph-based parsers can achieve state-of-the-art performance with the guarantee of recovering the best possible parse, but usually at the expense of speed. On the other hand, transition-based parsers are fast because the parser can greedily choose an action in each configuration and thus it can use arbitrary non-local features to compensate the lack of optimality. Also, it is easy to augment the set of actions to extend the functionality of the parser on such tasks as disfluency detection [Rasooli and Tetreault, 2013, Rasooli and Tetreault, 2014, Honnibal and Johnson, 2014] and punctuation prediction [Zhang et al., 2013a]. They are mostly used in supervised tasks but in rare cases they are also used in unsupervised tasks either with little manual linguistic knowledge or with no prior knowledge [Daumé III, 2009, Rasooli and Faili, 2012].

In this report, we provide a brief introduction to our newly released dependency parser. We show that it can achieve a very high accuracy on the standard English WSJ test set and show that it is very fast even in its slowest mode while getting results very close to state-of-the-art. The structure of this report is as follows: in §2 we provide some details about using Yara both in command line and as an API. We provide technical details about it in §3 and experiments are conducted in §4. Finally we conclude in §5.

2 Using Yara in Practice

In this section, we give a brief overview of training the parser, using it from the command-line and also as an API. Finally we introduce a simple NLP pipeline that can parse text files. All technical details for the parser are provided in §3. The default settings for Yara are expected to be the best in practice for accuracy (except the number of training iterations which is dependent on the data and feature settings).

2.1 Data format

Yara uses the CoNLL 2006 dependency format111http://ilk.uvt.nl/conll/#dataformat for training as well as testing. The CoNLL format is a tabular one in which each word (and its information) in a sentence occupies one line and sentences are separated by a blank line. Each line is organized into the following tab-delimited columns: 1) word number (starting at one), 2) word form, 3) word lemma, 4) coarse-grained POS tag, 5) fine-grained POS (part-of-speech) tag, 6) unordered set of syntactic and/or morphological features, separated by a vertical bar (|), or an underscore if not available, 7) head of current token (an integer showing the head number where 0 indicates root token), 8) dependency label, 9) projective head (underscore if not available) and 10) projective dependency labels (underscore if not available). Blank fields are represented by an underscore. Yara only uses the first, second, fourth, seventh and eights columns.

2.2 Training and Model Selection

The jar file in the package can be directly used to train a model with the following command line (run from the root directory of the project):

>> java -jar jar/YaraParser.jar train -train-file [train-file] -dev [dev-file] -model [model-file] -punc [punc-file]

where [train-file] and [dev-file] are CoNLL files for training and development data and [model-file] is the output path for the trained model file. [punc-file] contains a list of POS tags for punctuations in the treebank (see §2.2.1). The model for each iteration will be saved with the pattern [model-file]_iter[iter#]; e.g. model_iter2. In this way, the user can track the best performing model and delete all others. For cases where there is no development data, the user can remove the -dev option from the command line and use any of the saved model files as the final model based on his/her prior knowledge (15 is a reasonable number).

The other options are as follows:

  • -cluster [cluster-file] Brown cluster file: at most 4096 clusters are supported by Yara (default: empty). The format should be the same as https://github.com/percyliang/brown-cluster/blob/master/output.txt

  • beam:[beam-width]; e.g. beam:16 (default is 64).

  • iter:[training-iterations]; e.g. iter:10 (default is 20).

  • unlabeled (default: labeled parsing, unless explicitly put ‘unlabeled’)

  • lowercase (default: case-sensitive words, unless explicitly put ‘lowercase’)

  • basic (default: use extended feature set, unless explicitly put ‘basic’)

  • static (default: use dynamic oracles, unless explicitly put ‘static’ for static oracles)

  • early (default: use max violation update, unless explicitly put ‘early’ for early update)

  • random (default: choose maximum scoring oracle, unless explicitly put ‘random’ for randomly choosing an oracle)

  • nt:#threads; e.g. nt:4 (default is 8).

  • root_first (default: put ROOT in the last position, unless explicitly put ‘root_first’)

2.2.1 Punctuation Files

In most dependency evaluations, punctuation symbols and their incoming arcs are ignored. Most parser do this by using hard-coded rules for punctuation attachment. Yara instead allows the user to specify which punctuation POS tags are important to their task by providing a path for a punctuation file ([punc-file]) with the -punc option (e.g. -punc punc_files/wsj.puncs). If no file is provided, Yara uses WSJ punctuations. The punctuation file contains a list of punctuation POS tags, one per line. The Yara git repository provides punctuation files for WSJ data and Google universal POS tags [Petrov et al., 2011].

2.2.2 Some Examples

Here we provide examples for training Yara with different settings. Essentially we pick those examples where, we think, would be useful in practice.

Training with Brown clusters

This can be done via the -cluster option.

>> java -jar jar/YaraParser.jar train -train-file [train-file] -dev [dev-file] -model [model-file] -punc [punc-file] -cluster [cluster-file]

Training with the fastest mode

This can be done via the basic and beam:1 options.

>> java -jar jar/YaraParser.jar train -train-file [train-file] -dev [dev-file] -model [model-file] -punc [punc-file] beam:1 basic

Changing the number of iterations

This can be done via the iter option. In the following example, we selected 10 iterations.

>> java -jar jar/YaraParser.jar train -train-file [train-file] -dev [dev-file] -model [model-file] -punc [punc-file] iter:10

Extending memory consumption

It is possible that Java default setting for memory is less than what is really needed in some particular data sets. In those cases, we can extend the memory size by the JVM -Xmx option. In the following example, memory is extended to ten gigabytes.

>> java -Xmx10g -jar jar/YaraParser.jar train -train-file [train-file] -dev [dev-file] -model [model-file] -punc [punc-file]

Using very specific options

The following example shows a specific case where Yara trains a model on the training data (data/train.conll), develops it on the development data (data/dev.conll), saves each model in the model file (model/train.model) for each iteration (model/train.model_iter1, model/train.model_iter2, model/train.model_iter3, etc), uses its specific punctuation list (punc_files/my_lang.puncs), uses its specific Brown cluster data (data/cluster.path), trains the model in 10 iterations, with 16 beams and 4 threads and uses static oracle and early update. This is all done after all words are lowercased (with the lowercase option).

>> java -Xmx10g -jar jar/YaraParser.jar train -train-file data/train.conll -dev data/dev.conll -model model/train.model -punc punc_files/my_lang.puncs -cluster data/cluster.path beam:16 iter:10 unlabeled lowercase static early nt:4 root_first

2.3 Test and Evaluation

The test file can be either a CoNLL file or a POS tagged file. The output will be a file in CoNLL format.

Parsing a CoNLL file

>> java -jar jar/YaraParser.jar parse_conll -input [test-file] -out [output-file] -model [model-file]

Parsing a tagged file

The tagged file is a simple file where words and tags are separated by a delimiter (default is underscore). The user can use the option -delim [delimiter] (e.g. -delim /) to change the delimiter. The output will be in CoNLL format.

>> java -jar jar/YaraParser.jar parse_tagged -input [test-file] -out [output-file] -model [model-file]


Both [gold-file] and [parsed-file] should be in CoNLL format.

>> java -jar YaraParser.jar eval -gold [gold-file] -parse [parsed-file] -punc [punc-file]

A more descriptive end-to-end example by using a small amount of German training data222https://github.com/yahoo/YaraParser/tree/master/sample_data is shown in Yara’s Github repository. This example is shown at https://github.com/yahoo/YaraParser#example-usage.

2.4 Parsing a Partial Tree

Yara can parse partial trees where some gold dependencies are provided and it is expected to return a dependency tree consistent with the partial dependencies. Unknown dependencies are represented with “-1” as the head in the CoNLL format. Figure 1 shows an example of partial parse tree before and after doing constrained parsing.

>> java -jar YaraParser.jar parse_partial -input [test-file] -out [output-file] -model [model-file]

[theme = simple] & I & want & to & parse & a & sentence & . &
[color=blue]54aux [color=blue]57dobj [color=blue]38punct [theme = simple] & I & want & to & parse & a & sentence & . &
[color=red,dashed]93root [color=red,dashed]32nsubj [color=blue]54aux [color=red,dashed]35xcomp [color=red,dashed]57dobj [color=red,dashed]76det [color=blue]38punct

Figure 1: A sample partial dependency tree on the left side and its filled tree on the right. As shown in this figure, the added arcs are completely consistent with the partial tree arcs.

2.5 Yara Pipeline

We also provide an easy pipeline to use Yara in real applications. The pipeline benefits from the OpenNLP333http://opennlp.apache.org/index.html tokenizer and sentence delimiter and our own POS tagger444https://github.com/rasoolims/SemiSupervisedPosTagger. Thus the user has to download a specific sentence boundary detection and word tokenizer model from OpenNLP website depending on the specific target language. It is also possible to train a new sentence boundary detection and word tokenizer model with OpenNLP555For more information please visit OpenNLP manual at https://opennlp.apache.org/documentation/1.5.3/manual/opennlp.html..

The number of threads can be changed via the option nt:[#nt] (e.g. nt:10). The pipeline can be downloaded from https://github.com/rasoolims/YaraPipeline.

>> java -jar jar/YaraPipeline.jar -input [input file] -output [output file] -parse_model [parse model file] -pos_model [pos model] -tokenizer_model [tokenizer model] -sentence_model [sentence detector model]

2.6 Pipeline API usage

It is possible to use the Yara API directly666https://github.com/yahoo/YaraParser/blob/master/src/YaraParser/Parser/API_UsageExample.java, but the pipeline gives an easier way to do it with different levels of information. The user can set the number of threads for parsing: numberOfThreads.

2.6.1 Importing libraries

The user should first import libraries into the code as in Listing 1. Class YaraPipeline.java contains static methods for parsing a sentence, and ParseResult contains information about words, POS tags, dependency labels and heads, and normalized tagging score and parsing score. Info contains all information about parsing setting and models for the parser, POS tagger, tokenizer and sentence boundary detector.

1import edu.columbia.cs.rasooli.YaraPipeline.Structs.Info;
2import edu.columbia.cs.rasooli.YaraPipeline.Structs.ParseResult;
3import edu.columbia.cs.rasooli.YaraPipeline.YaraPipeline;
Listing 1: Code for importing necessary libraries.

2.6.2 Parsing Raw Text File

In this case, we need to have all models for parsing, tagging, tokenization and sentence boundary detection. Listing 2 shows such a case where the parser puts the results in CoNLL format into the [output_file].

1// should put real file path in the brackets (e.g. [parse_model])
2Info info1=new Info("[parse_model]","[pos_model]","[tokenizer_model]","[sentence_model]", numberOfThreads);
Listing 2: Code for parsing raw text file

2.6.3 Parsing Raw Text

Similar to parsing a file, we can parse raw texts. It is shown in Listing 3.

1// should put real file path in the brackets (e.g. [parse_model])
2Info info2=new Info("[parse_model]","[pos_model]","[tokenizer_model]","[sentence_model]", numberOfThreads);
3String someText="some text….";
4String conllOutputText2= YaraPipeline.parseText(someText,info1);
Listing 3: Code for parsing raw text

2.6.4 Parsing a Sentence

For the cases where the user uses his own sentence delimiter, it is possible to parse sentences as shown in Listing 4.

1// should put real file path in the brackets (e.g. [parse_model])
2Info info3=new Info("[parse_model]","[pos_model]","[tokenizer_model]", numberOfThreads);
3String someSentence="some sentence.";
4ParseResult parseResult3= YaraPipeline.parseSentence(someSentence, info1);
5String conllOutputText3=parseResult3.getConllOutput();
Listing 4: Code for parsing a sentence

2.6.5 Parsing a Tokenized Sentence

Listing 5 shows an example for the cases where the user only wants to use the parser and POS tagger to parse a pre-tokenized sentence.

1// should put real file path in the brackets (e.g. [parse_model])
2Info info4=new Info("[parse_model]","[pos_model]", numberOfThreads);
3String[] someWords4={"some", "words","…"};
4ParseResult parseResult4= YaraPipeline.parseTokenizedSentence(someWords4, info1);
5String conllOutputText4=parseResult4.getConllOutput();
Listing 5: Code for parsing a tokenized sentence

2.6.6 Parsing a Tagged Sentence

Listing 6 shows an example for the cases where the user only wants to use Yara to parse pre-tagged sentence.

1// should put real file path in the brackets (e.g. [parse_model])
2Info info5=new Info("[parse_model]", numberOfThreads);
3String[] someWords5={"some", "words","…"};
4String[] someTags5={"tag1", "tag2","tag3"};
5ParseResult parseResult5= YaraPipeline.parseTaggedSentence(someWords5,someTags5, info1);
6String conllOutputText5=parseResult5.getConllOutput();
Listing 6: Code for parsing a tagged sentence

3 Yara Technical Details

Yara is a transition-based dependency parser based on the arc-eager algorithm [Nivre, 2004]. It uses beam search training and decoding [Zhang and Clark, 2008] in order to avoid local errors in parser decisions. The features of the parser are roughly the same as [Zhang and Nivre, 2011] with additional Brown clustering [Brown et al., 1992] features.777The idea of using Brown clustering features is inspired from [Koo et al., 2008, Honnibal and Johnson, 2014]. Yara also includes several flexible parameters and options to allow users to easily tune it depending on the language and task. Generally speaking, there are 128 possible combinations of the settings in addition to tuning the number of iterations, Brown clustering features and beam width.888We put the best performing setting as the default setting for Yara.

3.1 Arc-Eager Algorithm

As in the arc-eager algorithm, Yara has the following actions:

  • Left-arc (LA): The first word in the buffer becomes the head of the top word in the stack. The top word is popped after this action.

  • Right-arc (RA): The top word in the stack becomes the head of the first word in the buffer.

  • Reduce (R): The top word in the stack is popped.

  • Shift (SH): The first word in the buffer is pushed to the stack.

Depending on position of the root, the constraints for initialization and actions differ. Figure 2 shows the transitions used to parse the sentence "I want to parse a sentence .".

Unshift Action

The original algorithm is not guaranteed to output a tree and thus in some occasions when the root is positioned in the beginning of the sentence, the parser decides to connect all remaining words in the stack to the root token. In [Nivre and Fernández-González, 2014], a new action and empty flag is introduced to compensate for this problem and preserve the tree constraint. The action is called unshift which pops the first word in the stack and returns it to the start position of the buffer. We also added the “unshift” action for the cases where the root token is in the initial position of the sentence. This makes the parser more robust and gives a slight boost in performance.999This problem happens less in the case of beam search and it is more often in greedy parsing.

Act. Stack Buffer Arc(h,d)
Shift [] [I, want, to, parse, a, sentence, ., ]
Left-Arc(nsubj) [I] [want, to, parse, a, sentence, ., ] nsubj(2,1)
Shift [] [want, to, parse, a, sentence, ., ]
Shift [want] [to, parse, a, sentence, ., ]
Left-arc(aux) [want, to] [parse, a, sentence, ., ] aux(4,3)
Right-arc(xcomp) [want] [parse, a, sentence, ., ] xcomp(2,4)
Shift [want, parse] [a, sentence, ., ]
Left-arc(det) [want, parse, a] [sentence, ., ] det(6,5)
Right-arc(dobj) [want, parse] [sentence, ., ] dobj(4,6)
Reduce [want, parse, sentence] [., ]
Reduce [want, parse] [., ]
Right-arc(punct) [want] [., ] punct(2,7)
Reduce [want, .] []
Left-arc(root) [want] [] root(8,2)
DONE! []
Figure 2: A sample action sequence with arc-eager actions for the dependency tree in Figure 1.

3.2 Online Learning

Most current supervised parsers use online learning algorithms. Online learners are fast, efficient and very accurate. We use averaged structured perceptron

[Collins, 2002] which is also used in previous similar parsers [Zhang and Clark, 2008, Zhang and Nivre, 2011, Choi and Palmer, 2011]. We use different engineering methods to speed up the parser, such as the averaging trick introduced by [Daumé III, 2006, Figure 2.3]

. Furthermore, all the features except label set-lexical pair features are converted to long integer values to prevent frequent hash collisions and decrease memory consumption. Semi-sparse weight vectors are used for additional speed up, though it comes with an increase in memory consumption. The details of this implementation are out of the scope of this report.

3.3 Beam Search and Update Methods

Early transition-based parsers such as the Malt parser [Nivre et al., 2006] were greedy and trained in batch mode. This was done by converting each tree to a set of independent actions. This has been shown to be less effective than a global search. Given our feature setting, it is impossible to use dynamic programming to get the exact result. We instead use beam search as an approximation.101010Greedy search can be viewed as beam search with a beam size of one. Therefore, unlike batch learning, the same procedure is used for training and decoding the parser. Yara supports beam search and its default beam size is 64.

There are several ways to update the classifier weights with beam learning. A very trivial way is to get the best scoring result from beam search as the prediction and update the weights compared to the gold. This is known as “late update” but it does not lead to a good performance

[Huang et al., 2012]. A more appealing way is to keep searching until the gold prediction goes out of the beam or the search reaches the end state. This is known as "early update" [Collins and Roark, 2004] and studies have shown a boost in performance relative to late update [Collins and Roark, 2004, Zhang and Clark, 2008]. The main problem with early update is that it does not update the weights according to the maximally violated prediction. A "max-violation" is a state in the beam where the gold standard is out of the beam and the gap in the score of the gold prediction and best scoring beam item is maximum. With max-violation update [Huang et al., 2012], the learner updates the weights according to the max-violation state. In other words, max-violation is the worst mistake that the classifier makes in the beam compared to the gold action. Yara supports both early and max-violation update while Zpar [Zhang and Nivre, 2011] only supports early update and RedShift [Honnibal and Johnson, 2014] only supports max-violation. Its default value for the update model is max-violation.

3.4 Dynamic and Static Oracles

With the standard transition-based parsing algorithms, it is possible to have a parse tree with different action sequences. In other words, different search paths may lead to the same parse tree. Most of the off-the-shelf parsers such as Zpar [Zhang and Nivre, 2011] define some manual rules for recovering a gold oracle to give it to the learner. This is known as a static oracle. The other way is to allow the oracle to be dynamic and let the learner choose from the oracles [Goldberg and Nivre, 2013]. Yara supports both static and dynamic oracles. In the case of dynamic oracles, only zero-cost explorations are allowed. In [Goldberg and Nivre, 2013], the gold oracle can be chosen randomly but we also provided another option to choose the best scoring oracle as the selected oracle. The latter way is known as latent structured Perceptron [Sun et al., 2013] by supposing the gold tree as the structure and each oracle as a latent path for reaching the final structure. Our experiments show that using the highest scoring oracle gives slightly better results and thus we let it be the default option in the parser training.

3.5 Other Properties

Root Position

In [Ballesteros and Nivre, 2013], it is shown that the position of the root token has a significant effect on the parser performance. We allow the root to be either in the initial or final position in the sentence. The final position is the default option for Yara parser.


We use roughly the same feature set as [Zhang and Nivre, 2011]. The extended feature set is the default but the user can use the basic option to set it to basic set of local features to improve speed with a loss in accuracy. We also add extra features from Brown word clusters [Brown et al., 1992], as used in [Koo et al., 2008], by using the Brown clusters for the first word in the buffer and stack, the prefixes of length 4 and 6 from the cluster bit string in the place of part of speech tags and the full bit string of the cluster in the place of words. When using all the features, we get a boost in performance but at the expense of speed.

Unlabeled Parsing

Although the parser is designed for labeled parsing, unlabeled parsing is also available through command line options. This is useful for the cases where the user simply needs a very fast parser and does not care about the loss in performance or the lack of label information.

Partial Parsing

There are some occasions especially in semi-supervised parsing, where we have partial information about the tree, for example, we know that the third word is the subject of the first word. With partial parsing, we let the user benefit from dynamic oracles to parse partial trees such that known arcs are preserved unless the tree constraints cannot be satisfied.


Given the fact that current systems have multiple processing unit cores and many of those cores, support hyper-threading, we added the support for multithreading. When dealing with a file, the parser does multithreaded parsing on the sentence level (i.e. parsing sentences in parallel but outputting them in the same order given in the input). When using the API, it is possible to use multithreading at the beam-level. Beam level multithreading is slower than sentence-level multithreading. We also use beam-level multi-threading for training the parser and this significantly speeds up the training phase. Yara’s default is set to 8 threads but the user can easily change it.

Model Selection

Unlike most current parsers, Yara saves the model file for all training iterations and lets the user choose the best performing model based on the performance on the development data. It also reports the performance on the development data to make it easier for the users to select the best model.

Tree Scoring

Yara also has the option to output the parse score to a text file. The score is the perceptron score divided by the sentence length.


In cases, such as spoken lanuage parsing, no casing is provided and it is better to train on lowercased text. Yara has this option with the argument lowercase in training.

4 Experiments

In this section we show how Yara performs on two different data sets and compare its performance to other leading parsers. We also graphically depict the tradeoff between beam width and accuracy and number of iterations. For all experiments we use version 0.2 of Yara. We use a multi-core 2.00GHz Intell Xeon machine. The machine has twenty cores but we only use 8 threads (parser’s default) in all experiments.

4.1 Parsing WSJ Data

We use the the traditional WSJ train-dev-test split for our experiment. As in [Zhang and Nivre, 2011], we first converted the WSJ data [Marcus et al., 1993] with Penn2Malt111111stp.lingfil.uu.se/~nivre/research/Penn2Malt.html. Next, automatic POS tags are generated for the whole dataset with version 0.2 of our POS tagger121212https://github.com/rasoolims/SemiSupervisedPosTagger. by doing 10-way jack-knifing on the training data. The tagger is a 20-beam third-order tagger trained with the maximum violation strategy with the same settings as in [Collins, 2002], along with additional Brown clustering features [Liang, 2005].131313We use the pre-built Brown cluster features in http://metaoptimize.com/projects/wordreprs/ with 1000 word classes. It achieved a POS tagging accuracy of 97.14, 97.18 and 97.37 on the train, development and test files respectively.

Table 1 shows the results on WSJ data by varying beam size and the use of Brown clusters. A comparison with prior art is made in Table 2. All unlabeled accuracy scores (UAS) and labeled accuracy scores (LAS) are calculated with punctuations ignored. As seen in Table 1, Yara’s accuracy is very close to the state-of-the-art [Bohnet and Nivre, 2012].

Parser beam Features Iter# Dev UAS Test UAS Test LAS Sent/Sec
Yara 1 ZN (basic+unlabeled) 5 89.29 88.73 3929
Yara 1 ZN (basic) 6 89.54 89.34 88.02 3921
Yara 1 ZN + BC 13 89.98 89.74 88.52 1300
Yara 64 ZN 13 93.31 92.97 91.93 133
Yara 64 ZN + BC 13 93.42 93.32 92.32 45
Table 1: Parsing accuracies of Yara parser on WSJ data. BC stands for Brown cluster features, UAS for unlabeled attachment score, LAS for labeled attachment score and ZN for [Zhang and Nivre, 2011]. Sent/sec refers to the speed in sentences per second.
Parser UAS LAS


[McDonald et al., 2005] 90.9
[McDonald and Pereira, 2006] 91.5
[Sagae and Lavie, 2006] 92.7
[Koo and Collins, 2010] 93.04
[Zhang and McDonald, 2012] 93.06
[Martins et al., 2013] 93.07
[Qian and Liu, 2013] 93.17
[Ma and Zhao, 2012] 93.4
[Zhang et al., 2013b] 93.50 92.41
[Zhang and McDonald, 2014] 93.82 92.74


[Nivre et al., 2006] 88.1 86.3
[Zhang and Clark, 2008] 92.1
[Huang and Sagae, 2010] 92.1
[Zhang and Nivre, 2011] 92.9 91.8
[Bohnet and Nivre, 2012] 93.38 92.44
[Choi and McCallum, 2013] 92.96 91.93
Yara 93.32 92.32
Table 2: Parsing accuracies on WSJ data. We only report results which use the standard train-dev-test splits and do not make use of additional training data (as in self-training). The first block of rows are the graph-based parsers and the second block are the transition-based parsers (including Yara).
Effect of Beam Size

Choosing a reasonable beam size is essential in certain NLP applications as there is always a trade-off between speed and performance. As shown in Figure 3, after a beam size of eight, the performance results do not change as much as the performance gap in for example beam of size one compared to beam of size two. This is useful because when changing the beam size from 64 to 8, one may speed up parsing by a factor of three (as shown in Table 3) with a small relative loss in performance.

Figure 3: The influence of beam size on each training iterations for Yara parser. Yara is trained with Brown clusters in all of the experiments in this figure.
Beam Size 1 (ub) 1 (b) 1 2 4 8 16 32 64
Dev UAS 89.29 89.54 89.98 91.95 92.80 93.03 93.27 93.22 93.42
Speed (sen/sec) 3929 3921 1300 370 280 167 110 105 45
Table 3: Speed vs. performance trade-off when using Brown clustering features and parsing CoNLL files with eight threads (except 1 (ub) and and 1(b) which are unlabeled and labeled parsing with basic features). The numbers are averaged over 20 training iterations and parsing development set after each iteration.
Model Unlabeled accuracy Labeled accuracy
Mate v3.6.1 91.32 87.68
Yara (without Brown clusters) 89.52 85.77
Yara (with Brown clusters) 89.97 86.32
Table 4: Parsing results on the Persian treebank excluding punctuations

4.2 Parsing Non-Projective Languages: Persian

As mentioned before, Yara can only be trained on projective trees and thus there will be some loss in accuracy for non-projective languages. We use version 1.1 of the Persian dependency treebank (PerDT) [Rasooli et al., 2013]141414http://www.dadegan.ir/catalog/perdt and tagged it with the same setting as WSJ data. We tokenized Mizan corpus151515http://www.dadegan.ir/catalog/mizan and add it to our training data to create 1000 Brown clusters.161616The definition of Brown cluster in this data is loose because there are multi-word verbs in the treebank while Brown clusters are acquired from training on single words. Therefore multi-word verbs in the treebank will not get any Brown cluster assignment and thus we will have a slight loss in performance. The training data contains 22% non-projective trees. We use Mate parser (v3.6.1) [Bohnet, 2010] as a highly accurate non-projective parsing tool to compare with Yara. Table 4 shows the performance for the two parsers. There is a 1.35% gap in unlabeled accuracy but given that 22% of the trees (2.5% of the arcs) are non-projective, this gap is reasonable.

5 Conclusion and Future Work

We presented an introduction to our open-source dependency parser. We showed that the parser is very fast and accurate. This parser can also be used for non-projective languages with a very slight loss in performance. We believe that our parser can be useful in different downstream tasks given its performance and flexible license. Our future plans include extending this parser to handle non-projectivity and also use continuous value representation features such as word embeddings to improve the accuracy of the parser.


We would like to thank Yahoo labs open-sourcing team to allow us to release the parser with a very flexible license, especially Dev Glass for setting up the initial release of the code. We thank Amanda Stent, Kapil Thadani, Idan Szpektor and Yuval Pinter and other colleagues in Yahoo labs for their support and fruitful ideas. Finally we thank Michael Collins and Matthew Honnibal for their feedback.


  • [Ballesteros and Nivre, 2013] Ballesteros, M. and Nivre, J. (2013). Going to the roots of dependency parsing. Computational Linguistics, 39(1):5–13.
  • [Bohnet, 2010] Bohnet, B. (2010). Very high accuracy and fast dependency parsing is not a contradiction. In Proceedings of the 23rd International Conference on Computational Linguistics, COLING ’10, pages 89–97, Stroudsburg, PA, USA. Association for Computational Linguistics.
  • [Bohnet and Nivre, 2012] Bohnet, B. and Nivre, J. (2012). A transition-based system for joint part-of-speech tagging and labeled non-projective dependency parsing. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 1455–1465, Jeju Island, Korea. Association for Computational Linguistics.
  • [Brown et al., 1992] Brown, P. F., Desouza, P. V., Mercer, R. L., Pietra, V. J. D., and Lai, J. C. (1992).

    Class-based n-gram models of natural language.

    Computational linguistics, 18(4):467–479.
  • [Choi and McCallum, 2013] Choi, J. D. and McCallum, A. (2013). Transition-based dependency parsing with selectional branching. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1052–1062, Sofia, Bulgaria. Association for Computational Linguistics.
  • [Choi and Palmer, 2011] Choi, J. D. and Palmer, M. (2011). Getting the most out of transition-based dependency parsing. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 687–692, Portland, Oregon, USA. Association for Computational Linguistics.
  • [Collins, 2002] Collins, M. (2002).

    Discriminative training methods for hidden markov models: Theory and experiments with perceptron algorithms.

    In Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing, pages 1–8. Association for Computational Linguistics.
  • [Collins and Roark, 2004] Collins, M. and Roark, B. (2004). Incremental parsing with the perceptron algorithm. In Proceedings of the 42nd Meeting of the Association for Computational Linguistics (ACL’04), Main Volume, pages 111–118, Barcelona, Spain.
  • [Daumé III, 2006] Daumé III, H. (2006). Practical Structured Learning Techniques for Natural Language Processing. PhD thesis, University of Southern California, Los Angeles, CA.
  • [Daumé III, 2009] Daumé III, H. (2009). Unsupervised search-based structured prediction. In

    Proceedings of the 26th Annual International Conference on Machine Learning

    , pages 209–216. ACM.
  • [Goldberg and Nivre, 2013] Goldberg, Y. and Nivre, J. (2013). Training deterministic parsers with non-deterministic oracles. TACL, 1:403–414.
  • [Honnibal and Johnson, 2014] Honnibal, M. and Johnson, M. (2014). Joint incremental disfluency detection and dependency parsing. Transactions of the Association for Computational Linguistics, 2:131–142.
  • [Huang et al., 2012] Huang, L., Fayong, S., and Guo, Y. (2012). Structured perceptron with inexact search. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 142–151, Montréal, Canada. Association for Computational Linguistics.
  • [Huang and Sagae, 2010] Huang, L. and Sagae, K. (2010). Dynamic programming for linear-time incremental parsing. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 1077–1086, Uppsala, Sweden. Association for Computational Linguistics.
  • [Koo et al., 2008] Koo, T., Carreras, X., and Collins, M. (2008). Simple semi-supervised dependency parsing. In Proceedings of ACL-08: HLT, pages 595–603, Columbus, Ohio. Association for Computational Linguistics.
  • [Koo and Collins, 2010] Koo, T. and Collins, M. (2010). Efficient third-order dependency parsers. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 1–11, Uppsala, Sweden. Association for Computational Linguistics.
  • [Kübler et al., 2009] Kübler, S., McDonald, R., and Nivre, J. (2009). Dependency parsing. Synthesis Lectures on Human Language Technologies, 1(1):1–127.
  • [Liang, 2005] Liang, P. (2005). Semi-supervised learning for natural language. PhD thesis, Massachusetts Institute of Technology.
  • [Ma and Zhao, 2012] Ma, X. and Zhao, H. (2012). Fourth-order dependency parsing. In Proceedings of COLING 2012: Posters, pages 785–796, Mumbai, India. The COLING 2012 Organizing Committee.
  • [Marcus et al., 1993] Marcus, M. P., Marcinkiewicz, M. A., and Santorini, B. (1993). Building a large annotated corpus of English: The Penn treebank. Computational linguistics, 19(2):313–330.
  • [Martins et al., 2013] Martins, A., Almeida, M., and Smith, N. A. (2013). Turning on the turbo: Fast third-order non-projective turbo parsers. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 617–622, Sofia, Bulgaria. Association for Computational Linguistics.
  • [McDonald et al., 2005] McDonald, R., Pereira, F., Ribarov, K., and Hajič, J. (2005). Non-projective dependency parsing using spanning tree algorithms. In Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing, HLT ’05, pages 523–530, Stroudsburg, PA, USA. Association for Computational Linguistics.
  • [McDonald and Pereira, 2006] McDonald, R. T. and Pereira, F. C. (2006). Online learning of approximate dependency parsing algorithms. In Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics. Association for Computational Linguistics.
  • [Nivre, 2004] Nivre, J. (2004). Incrementality in deterministic dependency parsing. In Keller, F., Clark, S., Crocker, M., and Steedman, M., editors, Proceedings of the ACL Workshop Incremental Parsing: Bringing Engineering and Cognition Together, pages 50–57, Barcelona, Spain. Association for Computational Linguistics.
  • [Nivre and Fernández-González, 2014] Nivre, J. and Fernández-González, D. (2014). Arc-eager parsing with the tree constraint. Computational linguistics, 40(2):259–267.
  • [Nivre et al., 2006] Nivre, J., Hall, J., and Nilsson, J. (2006). Maltparser: A data-driven parser-generator for dependency parsing. In Proceedings of LREC, volume 6, pages 2216–2219.
  • [Petrov et al., 2011] Petrov, S., Das, D., and McDonald, R. (2011). A universal part-of-speech tagset. arXiv preprint arXiv:1104.2086.
  • [Qian and Liu, 2013] Qian, X. and Liu, Y. (2013). Branch and bound algorithm for dependency parsing with non-local features. Transactions of the Association for Computational Linguistics, 1:37–48.
  • [Rasooli and Faili, 2012] Rasooli, M. S. and Faili, H. (2012). Fast unsupervised dependency parsing with arc-standard transitions. In Proceedings of the Joint Workshop on Unsupervised and Semi-Supervised Learning in NLP, pages 1–9, Avignon, France. Association for Computational Linguistics.
  • [Rasooli et al., 2013] Rasooli, M. S., Kouhestani, M., and Moloodi, A. (2013). Development of a Persian syntactic dependency treebank. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 306–314, Atlanta, Georgia. Association for Computational Linguistics.
  • [Rasooli and Tetreault, 2013] Rasooli, M. S. and Tetreault, J. (2013). Joint parsing and disfluency detection in linear time. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 124–129, Seattle, Washington, USA. Association for Computational Linguistics.
  • [Rasooli and Tetreault, 2014] Rasooli, M. S. and Tetreault, J. (2014). Non-monotonic parsing of fluent umm I mean disfluent sentences. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, volume 2: Short Papers, pages 48–53, Gothenburg, Sweden. Association for Computational Linguistics.
  • [Sagae and Lavie, 2006] Sagae, K. and Lavie, A. (2006). Parser combination by reparsing. In Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers, NAACL-Short ’06, pages 129–132, Stroudsburg, PA, USA. Association for Computational Linguistics.
  • [Sun et al., 2013] Sun, X., Matsuzaki, T., and Li, W. (2013). Latent structured perceptrons for large-scale learning with hidden information. IEEE Transactions on Knowledge and Data Engineering, 25(9):2063–2075.
  • [Zhang et al., 2013a] Zhang, D., Wu, S., Yang, N., and Li, M. (2013a). Punctuation prediction with transition-based parsing. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 752–760, Sofia, Bulgaria. Association for Computational Linguistics.
  • [Zhang et al., 2013b] Zhang, H., Huang, L., Zhao, K., and McDonald, R. (2013b). Online learning for inexact hypergraph search. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 908–913, Seattle, Washington, USA. Association for Computational Linguistics.
  • [Zhang and McDonald, 2012] Zhang, H. and McDonald, R. (2012). Generalized higher-order dependency parsing with cube pruning. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 320–331, Jeju Island, Korea. Association for Computational Linguistics.
  • [Zhang and McDonald, 2014] Zhang, H. and McDonald, R. (2014). Enforcing structural diversity in cube-pruned dependency parsing. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 656–661, Baltimore, Maryland. Association for Computational Linguistics.
  • [Zhang and Clark, 2008] Zhang, Y. and Clark, S. (2008). A tale of two parsers: Investigating and combining graph-based and transition-based dependency parsing. In Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing, pages 562–571, Honolulu, Hawaii. Association for Computational Linguistics.
  • [Zhang and Nivre, 2011] Zhang, Y. and Nivre, J. (2011). Transition-based dependency parsing with rich non-local features. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 188–193, Portland, Oregon, USA. Association for Computational Linguistics.