1 Introduction
Selfattention models, such as Transformer
vaswani2017attention, have been hugely successful in a wide range of natural language processing (NLP) tasks, especially when combined with languagemodel pretraining, such as BERT
devlin2018bert. These architectures contain a stack of selfattention layers which can capture longrange dependencies over the input sequence, while still representing its sequential order using absolute position encodings. Alternatively, shawetal2018self proposes to represent sequential order with relative position encodings, which are input to the selfattention functions.Recently mohammadshahi2019graphtograph extended this sequence input method to the input of arbitrary graph relations via the selfattention mechanism, and combined it with an attentionlike function for graph relation prediction, resulting in their proposed GraphtoGraph Transformer architecture (G2GTr). They demonstrated the effectiveness of G2GTr for transitionbased dependency parsing and its compatibility with pretrained BERT devlin2018bert. This parsing model predicts one edge of the parse graph at a time, conditioning on the graph of previous edges, so it is autoregressive in nature.
In this paper, we propose to take advantage of the graphtograph functionality of G2GTr to define a model that both predicts all edges of the graph in parallel, and is therefore nonautoregressive, and still captures betweenedge dependencies, like an autoregressive model. This is done by recursively applying a G2GTr model to correct the errors in a previous prediction of the output graph. At the point when no more corrections are made, all edges are predicted conditioned on all other edges in the output graph.
Our proposed Recursive Nonautoregressive Graph2graph Transformer (RNGTr) architecture first computes an initial graph with any given parsing model, even a trivial one. It then iteratively refines this graph by combining the G2GTr edge prediction step with a decoding step which finds the best graph given these predictions. The prediction of the model is the graph output by the final refinement step.
The RNG Transformer architecture can be applied to any task with a sequence or graph as input and a graph over the same set of nodes as output. We evaluate RNGTr on syntactic dependency parsing because it is a difficult structured prediction task, stateoftheart initial parsers are extremely competitive, and there is little previous evidence that nonautoregressive models (as in graphbased dependency parsers) are not sufficient for this task. We aim to show that capturing correlations between dependencies with recursive nonautoregressive refinement results in improvements over stateoftheart dependency parsers.
We propose the RNG Transformer model of graphbased dependency parsing. At each step of iterative refinement, the RNGTr model uses a G2GTr model to compute vector representation of nodes, then predict the scores for every possible dependency edge. These scores are then used to find a complete dependency graph, which is then inputted to the next step of the iterative refinement. The model stops when no changes are made to the graph, or a maximum number of iterations is reached.
The evaluation demonstrates improvements with several initial parsers, including previous stateoftheart dependency parsers, and the empty parse. We also introduce a strong Transformerbased dependency parser which is initialised with a pretrained BERT model devlin2018bert, called Dependency BERT (DepBERT), using it both for our initial parser and as the basis of our refinement model. Results on 13 languages from the Universal Dependencies Treebanks 11234/12895 and the English and Chinese Penn Treebanks marcusetal1993building; xueetal2002building show significant improvements over all initial parsers and over the stateoftheart.
In this paper, we make the following contributions:

We propose a novel architecture for the iterative refinement of arbitrary graphs, called Recursive Nonautoregressive Graphtograph Transformer (RNGTr), which combines nonautoregressive edge prediction with conditioning on the complete graph.

We propose a Transformer model for graphbased dependency parsing (DepBERT) with BERT pretraining.

We propose the RNGTr model of dependency parsing, using initial parses from a variety of previous models and our DepBERT model.

We demonstrate significant improvements over the previous stateoftheart results on Universal Dependency Treebanks and Penn Treebanks.
2 Graphbased Dependency Parsing
Syntactic dependency parsing is a critical component in a variety of natural language understanding tasks, such as semantic role labeling marcheggianititov2017encoding, machine translation Chen_2017, relation extraction zhangetal2018graph, and natural language interface pang2019improving. There are two main approaches to compute the dependency tree, transitionbased yamadamatsumoto2003statistical; nivrescholz2004deterministic; titovhenderson2007latent; zhangnivre2011transition and graphbased eisner1996three; mcdonaldetal2005online; koo2010efficient models. Transitionbased parsers predict the dependency graph one edge at a time through a sequence of parsing actions. Graphbased models compute scores for every possible dependency edge and then apply a decoding algorithm to find the highest scoring total tree, such as the maximum spanning tree (or greedy) algorithm chi1999statistical; edmonds1967optimum (MST). Typically these models consist of two components: an encoder learns contextdependent vector representations for the nodes of the dependency graph, and a decoder then computes the dependency scores for each pair of nodes, then a decoder algorithm is used to find the highestscoring dependency tree.
Graphbased models should have trouble capturing correlations between dependency edges since the score for an edge must be chosen without being sure what other edges MST will choose. MST itself only imposes the discrete tree constraint between edges. And yet, graphbased models are the stateoftheart in dependency parsing. In this paper, we show that it is possible to improve over the stateoftheart by recursively conditioning each edge score prediction on a previous prediction of the complete dependency graph.
3 RNG Transformer
The RNG Transformer architecture is illustrated in Figure 1, in this case, applied to dependency parsing. The input to the RNGTr model is an initial graph (e.g. a parse tree) over the input nodes (e.g. a sentence), and the output is a final graph over the same set of nodes. Each iteration takes the previous graph as input and predicts a new graph .
The RNGTr model predicts with a novel version of a GraphtoGraph Transformer mohammadshahi2019graphtograph. Unlike in the work of mohammadshahi2019graphtograph, this G2GTr model predicts every edge of the graph in a single nonautoregressive step. As previously, the G2GTr first encodes the input graph in a set of contextualised vector representations , with one vector for each node of the graph. The decoder component then predicts the output graph by first computing scores for each possible edge between each pair of nodes and then applying a decoding algorithm to output the highestscoring complete graph.
We can formalise the RNGTr model as:
(1) 
where is the encoder component, is the decoder component, is the partofspeech tags of the associated words and symbols , and is the number of recursion steps. The predicted dependency graph at iteration is specified as:
(2) 
Each word has one head (parent) with dependency label from the label set , where the parent may be the ROOT symbol (see Section 3.1.1).
In the following sections, we will describe each element of our RNGTr dependency parsing model in detail.
3.1 Encoder
To compute the embeddings for the nodes of the graph, we use the GraphtoGraph Transformer architecture proposed by mohammadshahi2019graphtograph, including similar mechanism to input the previously predicted dependency graph to the attention mechanism. This graph input allows the node embeddings to include both tokenlevel and dependencylevel information.
3.1.1 Input Embeddings
The RNGTr model receives a sequence of input tokens () with their associated PartofSpeech tags () and builds a sequence of input embeddings (). The sequence of input tokens starts with CLS and ends with SEP symbols (the same as BERT devlin2018bert’s input token representation). It also adds the ROOT symbol to the front of the sentence to represent the root of the dependency tree.
To build token representation for a sequence of input tokens, we sum several vectors. For the input words and symbols, we sum the pretrained word embeddings of BERT , and learned representations of POS tags. To keep the order information of the initial sequence, we add the pretrained position embeddings of BERT to our token embeddings. The final input representations are the sum of the position embeddings and the token embeddings:
(3) 
3.1.2 SelfAttention Mechanism
Conditioning on the previously predicted output graph is made possible by inputting relation embeddings to the selfattention mechanism. This edge input method was originally proposed by shawetal2018self for relative position encoding, and extending to unlabelled dependency graphs in the GraphtoGraph Transformer architecture of mohammadshahi2019graphtograph. We use it to input labelled dependency graphs. A relation embedding is added both to the value function and to the attention weight function wherever the two related tokens are involved.
Transformers have multiple layers of selfattention, each with multiple heads. The RNGTr architecture uses the same architecture as BERT devlin2018bert but changes the functions used by each attention head. Given the token embeddings at the previous layer and the input graph , the values computed by an attention head are:
(4) 
where is a onehot vector that represents the labelled dependency relationship between and in the graph . These relationships include both the label and the direction (headdependent versus dependenthead), or specify none. are learned relation embeddings (where is the number of dependency labels). The attention weights are a softmax applied to the attention function:
(5) 
where are learned relation embeddings. LN() is the layer normalisation function, for better convergence.
3.2 Decoder
The decoder uses the token embeddings produced by the encoder to predict the new dependency graph . It consists of two components, a scoring function, and a decoding algorithm. The dependency graph found by the decoding algorithm is the output graph of the decoder.
There are differences between the segmentation of the input string used by BERT and the segmentation used by the dependency treebanks. To compensate for this, the encoder uses the segmentation of BERT, but only a subset of the resulting token embeddings are considered by the decoder. For each word in the dependency annotation, only the first segment of that word from the BERT segmentation is used in the decoder. See Section 5.3 for more details.
3.2.1 Scoring Function
We first produce four distinct vectors for each token embedding from the encoder by passing it through four feedforward layers.
(6) 
where the ’s are all onelayer feedforward networks with LeakyReLUactivation functions.
These token embeddings are used to compute probabilities for every possible dependency relation, both unlabelled and labelled, similarly to dozat2016deep. The distribution of the unlabelled dependency graph is estimated using, for each token
, a biaffine classifier over possible heads
applied to and . Then for each pair , the distribution over labels given an unlabelled dependency relation is estimated using a biaffine classifier applied to and .3.2.2 Decoding Algorithms
The scoring function estimates a distribution over graphs, but the RNGTr architecture requires the decoder to output a single graph . Choosing this graph is complicated by the fact that the scoring function is nonautoregressive, and thus the estimate consists of multiple independent components, and thus there is no guarantee that every graph in this distribution is a valid dependency graph.
We take two approaches to this problem, one for intermediate parses and one for the final dependency parse . To speed up each refinement iteration, we ignore this problem for intermediate dependency graphs. We build these graphs by simply applying argmax independently to find the head of each node. This may result in graphs with loops, which are not trees, but this does not seem to cause problems for later refinement iterations.^{1}^{1}1We will investigate different decoding strategies to keep both the speed and wellformedness of the intermediate predicted graphs in future work. For the final output dependency tree, we use the maximum spanning tree algorithm, specifically the ChuLiu/Edmonds algorithm chi1999statistical; edmonds1967optimum, to find the highest scoring valid dependency tree. This is necessary to avoid problems when running the evaluation scripts.
3.3 Training
The RNG Transformer model is trained separately on each refinement iteration. Standard gradient descent techniques are used, with crossentropy loss for each edge prediction. Error is not backpropagated across iterations of refinement, because there are no continuous values being passed from one iteration to another, only a discrete dependency tree.
Stopping Criterion
In the RNG Transformer architecture, the refinement of the predicted graph can be done an arbitrary number of times, since the same encoder and decoder parameters are used at each iteration. In the experiments below, we place a limit on the maximum number of iterations. But sometimes the model converges to an output graph before this limit is reached, simply copying this graph during later iterations. To avoid multiple iterations where the model is trained to simply copy the input graph, during training the refinement iterations are stopped if the new predicted dependency graph is the same as the input graph. At test time we also stop computation in this case, but the output of the model is obviously not affected.
4 Initial Parsers
The RNGTr architecture assumes that there is an initial graph for the RNGTr model to refine. We consider several initial parsers to produce this graph. In our experiments, we find that the initial graph has little impact on the quality of the graph output by the RNGTr model.
To leverage previous work on dependency parsing, we use parsing models from the recent literature as initial parsers. To evaluate the importance of the initial parse, we also consider a setting where the initial parse is empty, so the first complete dependency tree is predicted by the RNGTr model itself. Finally, the success of our RNGTr dependency parsing model leads us to propose an initial parsing model with the same design, so that we can control for the parser design in measuring the importance of the RNG Transformer’s iterative refinement.
DepBERT model
We call this initial parser the Dependency BERT (DepBERT) model. It is the same as one iteration of the RNGTr model shown in Figure 1 and defined in Section 3, except that there is no graph input to the encoder. Analogously to (1), is computed as:
(7) 
where and are the DepBERT encoder and decoder respectively. For the encoder, we use the Transformer architecture of BERT devlin2018bert and initialise with BERT’s pretrained parameters. The token embeddings of the final layer are used for . For the decoder, we use the same segmentation strategy and scoring function as described in Section 3.2, and apply ChuLiu/Edmonds decoding algorithm chi1999statistical; edmonds1967optimum to find the highest scoring tree. The DepBERT parsing model is very similar to the UDify parsing model proposed by Kondratyuk_2019, but there are significant differences in the way token segmentation is handled, which result in significant differences in performance, as shown in Section 6.2.
5 Experimental Setup
5.1 Datasets
To evaluate our models, we apply them on two kinds of datasets, Universal Dependency (UD) Treebanks 11234/12895, and Penn Treebanks. For evaluation, following Kulmizev_2019,11234/12895, we keep punctuation for UD Treebanks and remove it for Penn Treebanks nilssonnivre2008malteval.
Universal Dependency Treebanks:
We evaluate our models on Universal Dependency Treebanks (UD v2.3) 11234/12895. We select languages based on the criteria proposed in Lhoneux2017OldSV, and adapted by smithetal2018investigation. This set contains several languages with different language families, scripts, character set sizes, morphological complexity, and training sizes and domains. The description of the selected Treebanks is in Appendix A.
Penn Treebanks:
We also evaluate our models on English and Chinese Penn Treebanks marcusetal1993building; xueetal2002building. For English, we use sections 221 for training, section 22 for development, and section 23 for testing. We add section 24 to our development set to mitigate overfitting. We use the Stanford PoS tagger toutanovaetal2003feature to produce PoS tags. We convert constituency trees to Stanford dependencies using version 3.3.0 of the converter de2006generating. For Chinese, we apply the same setup as described in chenmanning2014fast, and use gold PoS tags during training and evaluation.
5.2 Baseline Models
For UD Treebanks, we consider several parsers both as baselines and to produce initial parses for the RNGTr model. We use the monolingual parser proposed by Kulmizev_2019, which uses and extends the UUParser^{2}^{2}2https://github.com/UppsalaNLP/uuparser delhoneuxetal2017raw; smithetal201882, and applies BERT devlin2018bert and ELMo petersetal2018deep embeddings as additional input features. In addition, we also compare our models with multilingual models proposed by Kondratyuk_2019 and straka2018udpipe. UDify Kondratyuk_2019 is a multilingual multitask model for predicting universal partofspeech, morphological features, lemmas, and dependency graphs at the same time for all UD Treebanks. UDPipe straka2018udpipe is one of the winners of CoNLL 2018 Shared Task zemanetal2018conll. It’s a multitask model that performs sentence segmentation, tokenization, POS tagging, lemmatization, and dependency parsing. For a fair comparison, we use reported scores of Kondratyuk_2019 for the UDPipe model which they retrained using gold segmentation. UDify outperforms the UDPipe model on average, so we use both UDify and DepBERT models as our initial parsers to integrate with the RNG Transformer model. We also train the RNGTr model without any initial dependency graph, called Empty+RNGTr, to further analyse the impact of the initial graph.
For Penn Treebanks, we compare our models with previous stateoftheart transitionbased, and graphbased models. The Biaffine parser dozat2016deep
includes the same decoder as our model, with an LSTMbased encoder. jietal2019graph also integrate graph neural network models with the Biaffine parser, to find a better representation for the nodes of the graph. For these datasets, we use the Biaffine and DepBERT parsers as the initial parsers for our RNG Transformer model.
5.3 Implementation Details
All hyperparameters are provided in Appendix B.
For the selfattention model, we use the pretrained ”bertmultilingualcased”^{3}^{3}3https://github.com/googleresearch/bert with 12 selfattention layers.^{4}^{4}4For Chinese and Japanese, we use pretrained ”bertbasechinese” and ”bertbasejapanese” models Wolf2019HuggingFacesTS respectively. For tokenization, we apply the wordpiece tokenizer of BERT wu2016google. Since dependency relations are between the tokens of a dependency corpus, we apply the BERT tokenizer to each corpus token and run the encoder on all the resulting subwords. Each dependency between two words is inputted as a relation between their first subwords. We also input a new relationship with each nonfirst subword as the dependent and the associated first subword as its head. In the decoder, we only consider candidate dependencies between the first subwords for each word.^{5}^{5}5In preliminary experiments, we found that using the dependency predictions of the first subwords achieves better or similar results compared to using the last subword or all subwords of each word. Finally, we map the predicted heads and dependents to their original positions in the corpus for proper evaluation.
6 Results and Discussion
After some initial experiments to determine the number of refinement iterations, we report the performance of the RNG Transformer model on UD treebanks and Penn treebanks. The RNGTr models perform substantially better than models without refinement on almost every dataset. We also perform various analyses of the models to better understand these results.
6.1 The Number of Refinement Iterations
To select the best number of refinement iterations allowed by RNG Transformer model, we evaluate different variations of our model on the Turkish Treebank (Table 1).^{6}^{6}6We choose the Turkish Treebank because it is a lowresource Treebank and there are more errors in the initial parse for RNGTr to correct. We use both DepBERT and UDify as initial parsers. The DepBERT model significantly outperforms the UDify model, so adding the RNGTr model to the UDify model results in more relative error reduction in LAS compared to its integration with DepBERT (17.65% vs. 2.67%). In both cases, using three refinement iterations achieves the best result, and excluding the stopping strategy described in Section 3.3 decreases the performance. In subsequent experiments, we use three refinement iterations with the stopping strategy, unless mentioned otherwise.
6.2 UD Treebanks Results
Results for the UD treebanks are reported in Table 2. We compare our models with previous stateoftheart results (both trained monolingually and multilingually), based on labelled attachment score.^{7}^{7}7Unlabelled attachment scores are provided in Appendix C. All results are computed with the official CoNLL 2018 shared task evaluation script (https://universaldependencies.org/conll18/evaluation.html). The UDify+RNGTr model achieves significantly better performance than the UDify model, which demonstrates the effectiveness of the RNGTr model at refining an initial dependency graph. The DepBERT model significantly outperforms previous stateoftheart models on all UD Treebanks. But despite this good performance, the DepBERT+RNGTr model achieves further improvement over DepBERT in almost all languages and on average. As expected, we get more improvement when combining the RNGTr model with UDify, because UDify’s initial dependency graph contains more incorrect dependency relations for RNGTr to correct.
Although generally better, there is surprisingly little difference between the performance after refinement of the UDify+RNGTr and DepBERT+RNGTr models. To investigate the power of the RNGTr architecture to correct any initial parse, we also show results for a model with an empty initial parse, Empty+RNGTr. For this model, we run four iterations of refinement (T=4), so that the amount of computation is the same as for UDify+RNGTr and DepBERT+RNGTr. The Empty+RNGTr model achieves competitive results with the UDify+RNGTr model (i.e. above the previous stateoftheart), indicating that the RNGTr architecture is a very powerful method for graph refinement. We will discuss this conclusion further in Section 6.4.
6.3 Penn Treebanks Results
UAS and LAS results for the Penn Treebanks are reported in Table 3. We compare to the results of previous stateoftheart models and DepBERT, and we use the RNGTr model to refine both the Biaffine parser dozat2016deep and DepBERT, on the English and Chinese Penn Treebanks ^{8}^{8}8Results are calculated with the official evaluation script: (https://depparse.uvt.nl/)..
Again, the DepBERT model significantly outperforms previous stateoftheart models, with a 5.78% and 9.15% LAS relative error reduction in English and Chinese, respectively. Despite this level of accuracy, adding RNGTr refinement improves accuracy further under both measures in both languages, although in English the differences are not significant. For the Chinese Treebank, RNGTr refinement achieves a 4.7% LAS relative error reduction. When RNGTr refinement is applied to the output of the Biaffine parser dozat2016deep, it achieves LAS relative error reduction of 10.64% for the English Treebank and 16.5% for the Chinese Treebank. These improvements, even over such strong initial parsers, again demonstrate the effectiveness of the RNGTr architecture for graph refinement.
6.4 Error Analysis
To better understand the distribution of errors for our models, we follow mcdonaldnivre2011analyzing and plot labelled attachment scores as a function of dependency length, sentence length and distance to root ^{9}^{9}9We use MaltEval tool nilssonnivre2008malteval for calculating accuracies in all cases.. We compare the distributions of errors made by UDify Kondratyuk_2019, DepBERT, and refined models (UDify+RNGTr, DepBERT+RNGTr, and Empty+RNGTr). Figure 2 shows the accuracies of the different models on the concatenation of all development sets of UD Treebanks.^{10}^{10}10Tables for the error analysis section, and graphs for each language are provided in Appendix D. Results show that applying RNGTr refinement to the UDify model results in a great improvement in accuracy across all cases. They also show little difference in the error profile between the better performing models.
Dependency Length:
The leftmost plot compares the accuracy of models on different dependency lengths. Adding RNGTr refinement to UDify results in better performance both on short and long dependencies, with particular gains for the longer and more difficult cases.
Distance to Root:
The middle plot illustrates the accuracy of models as a function of the distance to the root of the dependency tree, which is calculated as the number of dependency relations from the dependent to the root. Again, when we add RNGTr refinement to the UDify parser we get significant improvement for all distances, with particular gains for the difficult middle distances, where the dependent is neither near the root nor a leaf of the tree.
Sentence Length:
The rightmost plot shows the accuracy of models on different sentence lengths. Again, adding RNGTr refinement to UDify achieves significantly better results on all sentence lengths. But in this case, the larger improvements are for the shorter, presumably easier, sentences.
6.5 Refinement Analysis
To better understand how the RNG Transformer model is doing refinement, we perform several analyses of the trained UDify+RNGTr model.^{11}^{11}11We choose UDify as the initial parser because the RNGTr model makes more changes to the parses of UDfiy than DepBERT, so we can more easily analyse these changes. An example of this refinement is shown in Figure 3, where the UDify model predicts an incorrect dependency graph, but the RNGTr model modifies it to build the gold dependency tree.
Refinements by Iteration:
To measure the accuracy gained from refinement at different iterations, we define the following metric:
(8) 
where is relative error reduction, and is the refinement iteration. is the accuracy of the initial parser, UDify in this case. To illustrate the refinement procedure for different dataset types, we split UD Treebanks based on their training size to ”LowResource” and ”HighResource” datasets.^{12}^{12}12We consider languages that have training data more than 10k sentences as ”HighResource”.
width=
Model Type  
LowResource  +13.62%  +17.74%  +0.16% 
HighResource  +29.38%  +0.81%  +0.41% 
Table 4 shows this refinement metric () after each refinement iteration of the UDify+RNGTr model on the UD Treebanks.^{13}^{13}13For these results we apply MST decoding after every iteration, to allow proper evaluation of the intermediate graphs. Every refinement step achieves an increase in accuracy, on both low and high resource languages. But the amount of improvement decreases for higher refinement iterations. Interestingly, for languages with less training data, the model cannot learn to make all corrections in a single step but can learn to make the remaining corrections in a second step, resulting in approximately the same total percentage of errors corrected as for high resource languages.
Dependency Type Refinement:
Table 5 shows the accuracy of a selection of different dependency types for the UDify model as the initial parser, and the refined model (+RNGTr).^{14}^{14}14The accuracy of UDify, DepBERT, and their integration with RNGTr on all dependency types are provided in Appendix E. On average, we achieve 29.78% LAS error reduction compared to UDify by adding the RNGTr model to UDify. Significant improvements are achieved for all dependency types, especially in hard cases, such as copula (cop).^{15}^{15}15A cop (copula) is the relation of a function word used to link a subject to a nonverbal predicate.
NonProjective Trees:
Predicting nonprojective trees is a challenging issue for dependency parsers. Figure 4
shows precision and recall for the UDify and UDify+RNGTr models on nonprojective trees of the UD Treebanks. Adding the RNGTr model to the initial parser results in a significant improvement in both precision and recall, which again demonstrates the effectiveness of the RNGTr model on hard cases.
6.6 Time complexity
In this section, we compute the time complexity of both proposed models, DepBERT and RNGTr.
DepBERT:
The time complexity of the original Transformer vaswani2017attention is , where is the sequence length, so the time complexity of the encoder is . The time complexity of the decoder is determined by the ChuLiu/Edmonds algorithm Chu1965OnTS; edmonds1967optimum, which is . So, the total time complexity of the DepBERT model is , the same as other graphbased models.
RNGTr:
In each refinement step, the time complexity of GraphtoGraph Transformer mohammadshahi2019graphtograph is . Since we use the argmax function in the intermediate steps, the decoding time complexity is determined by the last decoding step, which is . So, the total time complexity is .
7 Conclusion
We proposed a novel architecture for structured prediction, Recursive Nonautoregressive Graphtograph Transformer (RNGTr) to iteratively refine arbitrary graphs. Given an initial graph, RNG Transformer learns to predict a corrected graph over the same set of nodes. Each iteration of refinement predicts the edges of the graph in a nonautoregressive fashion, but conditions these predictions on the entire graph from the previous iteration. This graph conditioning and prediction are done with the GraphtoGraph Transformer architecture mohammadshahi2019graphtograph, which makes it capable of capturing complex patterns of interdependencies between graph edges. GraphtoGraph Transformer also benefits from initialisation with a pretrained BERT devlin2018bert model. We also propose a graphbased dependency parser called DepBERT, which is the same as our refinement model but without graph inputs.
We evaluate the RNG Transformer architecture on syntactic dependency parsing. We run experiments with a variety of initial parsers, including DepBERT, on 13 languages of Universal Dependencies Treebanks, and on English and Chinese Penn Treebanks. Our DepBERT model already significantly outperformed previous stateoftheart models on both types of Treebanks. Even with this very strong initial parser, RNGTr refinement almost always improves accuracies, setting new stateoftheart accuracies for all treebanks. Regardless of the initial parser (e.g. UDify Kondratyuk_2019 on UD Treebanks, and Biaffine parser dozat2016deep on Penn Treebanks), RNGTr reaching around the same level of accuracy, even when it is given an empty initial parse, demonstrating the power of this iterative refinement method. Finally, we provided error analyses of the proposed model to illustrate its advantages and understand how refinements are made across iterations.
The RNG Transformer architecture is a very general and powerful method for structured prediction, which could easily be applied to other NLP tasks. It would especially benefit tasks that require capturing complex structured interdependencies between graph edges, even with the computational benefits of a nonautoregressive model.
8 Acknowledgement
We are grateful to the Swiss NSF, grant CRSII5_180320, for funding this work. We also thank Lesly Miculicich and other members of the Idiap NLU group for helpful discussions.
References
Appendix A UD Treebanks Details
Appendix B Implementation Details
For better convergence, we use two separate optimizers for pretrained parameters and randomly initialized parameters. We apply bucketed batching, grouping sentences by their lengths into the same batch to speed up the training. Here is the list of hyperparameters for RNG Transformer model:
Component  Specification 
Optimiser^{1}  BertAdam 
Base Learning rate  2e3 
BERT Learning rate  1e5 
Adam Betas(,)  (0.9,0.999) 
Adam Epsilon  1e5 
Weight Decay  0.01 
MaxGradNorm  1 
Warmup  0.01 
SelfAttention  
No. Layers  12 
No. Heads  12 
Embedding size  768 
Max Position Embedding  512 
FeedForward layers (arc)  
No. Layers  2 
Hidden size  500 
Dropout  0.33 
Activation  LeakyReLU 
Negative Slope  0.1 
FeedForward layers (rel)  
No. Layers  2 
Hidden size  100 
Dropout  0.33 
Activation  LeakyReLU 
Negative Slope  0.1 
Epoch  200 
patience  100 

Wolf2019HuggingFacesTS
Comments
There are no comments yet.