Self-attention models, such as Transformervaswani2017attention
, have been hugely successful in a wide range of natural language processing (NLP) tasks, especially when combined with language-model pre-training, such as BERTdevlin2018bert. These architectures contain a stack of self-attention layers which can capture long-range dependencies over the input sequence, while still representing its sequential order using absolute position encodings. Alternatively, shaw-etal-2018-self proposes to represent sequential order with relative position encodings, which are input to the self-attention functions.
Recently mohammadshahi2019graphtograph extended this sequence input method to the input of arbitrary graph relations via the self-attention mechanism, and combined it with an attention-like function for graph relation prediction, resulting in their proposed Graph-to-Graph Transformer architecture (G2G-Tr). They demonstrated the effectiveness of G2G-Tr for transition-based dependency parsing and its compatibility with pre-trained BERT devlin2018bert. This parsing model predicts one edge of the parse graph at a time, conditioning on the graph of previous edges, so it is autoregressive in nature.
In this paper, we propose to take advantage of the graph-to-graph functionality of G2G-Tr to define a model that both predicts all edges of the graph in parallel, and is therefore non-autoregressive, and still captures between-edge dependencies, like an auto-regressive model. This is done by recursively applying a G2G-Tr model to correct the errors in a previous prediction of the output graph. At the point when no more corrections are made, all edges are predicted conditioned on all other edges in the output graph.
Our proposed Recursive Non-autoregressive Graph2graph Transformer (RNG-Tr) architecture first computes an initial graph with any given parsing model, even a trivial one. It then iteratively refines this graph by combining the G2G-Tr edge prediction step with a decoding step which finds the best graph given these predictions. The prediction of the model is the graph output by the final refinement step.
The RNG Transformer architecture can be applied to any task with a sequence or graph as input and a graph over the same set of nodes as output. We evaluate RNG-Tr on syntactic dependency parsing because it is a difficult structured prediction task, state-of-the-art initial parsers are extremely competitive, and there is little previous evidence that non-autoregressive models (as in graph-based dependency parsers) are not sufficient for this task. We aim to show that capturing correlations between dependencies with recursive non-autoregressive refinement results in improvements over state-of-the-art dependency parsers.
We propose the RNG Transformer model of graph-based dependency parsing. At each step of iterative refinement, the RNG-Tr model uses a G2G-Tr model to compute vector representation of nodes, then predict the scores for every possible dependency edge. These scores are then used to find a complete dependency graph, which is then inputted to the next step of the iterative refinement. The model stops when no changes are made to the graph, or a maximum number of iterations is reached.
The evaluation demonstrates improvements with several initial parsers, including previous state-of-the-art dependency parsers, and the empty parse. We also introduce a strong Transformer-based dependency parser which is initialised with a pre-trained BERT model devlin2018bert, called Dependency BERT (DepBERT), using it both for our initial parser and as the basis of our refinement model. Results on 13 languages from the Universal Dependencies Treebanks 11234/1-2895 and the English and Chinese Penn Treebanks marcus-etal-1993-building; xue-etal-2002-building show significant improvements over all initial parsers and over the state-of-the-art.
In this paper, we make the following contributions:
We propose a novel architecture for the iterative refinement of arbitrary graphs, called Recursive Non-autoregressive Graph-to-graph Transformer (RNG-Tr), which combines non-autoregressive edge prediction with conditioning on the complete graph.
We propose a Transformer model for graph-based dependency parsing (DepBERT) with BERT pre-training.
We propose the RNG-Tr model of dependency parsing, using initial parses from a variety of previous models and our DepBERT model.
We demonstrate significant improvements over the previous state-of-the-art results on Universal Dependency Treebanks and Penn Treebanks.
2 Graph-based Dependency Parsing
Syntactic dependency parsing is a critical component in a variety of natural language understanding tasks, such as semantic role labeling marcheggiani-titov-2017-encoding, machine translation Chen_2017, relation extraction zhang-etal-2018-graph, and natural language interface pang2019improving. There are two main approaches to compute the dependency tree, transition-based yamada-matsumoto-2003-statistical; nivre-scholz-2004-deterministic; titov-henderson-2007-latent; zhang-nivre-2011-transition and graph-based eisner-1996-three; mcdonald-etal-2005-online; koo2010efficient models. Transition-based parsers predict the dependency graph one edge at a time through a sequence of parsing actions. Graph-based models compute scores for every possible dependency edge and then apply a decoding algorithm to find the highest scoring total tree, such as the maximum spanning tree (or greedy) algorithm chi-1999-statistical; edmonds1967optimum (MST). Typically these models consist of two components: an encoder learns context-dependent vector representations for the nodes of the dependency graph, and a decoder then computes the dependency scores for each pair of nodes, then a decoder algorithm is used to find the highest-scoring dependency tree.
Graph-based models should have trouble capturing correlations between dependency edges since the score for an edge must be chosen without being sure what other edges MST will choose. MST itself only imposes the discrete tree constraint between edges. And yet, graph-based models are the state-of-the-art in dependency parsing. In this paper, we show that it is possible to improve over the state-of-the-art by recursively conditioning each edge score prediction on a previous prediction of the complete dependency graph.
3 RNG Transformer
The RNG Transformer architecture is illustrated in Figure 1, in this case, applied to dependency parsing. The input to the RNG-Tr model is an initial graph (e.g. a parse tree) over the input nodes (e.g. a sentence), and the output is a final graph over the same set of nodes. Each iteration takes the previous graph as input and predicts a new graph .
The RNG-Tr model predicts with a novel version of a Graph-to-Graph Transformer mohammadshahi2019graphtograph. Unlike in the work of mohammadshahi2019graphtograph, this G2G-Tr model predicts every edge of the graph in a single non-autoregressive step. As previously, the G2G-Tr first encodes the input graph in a set of contextualised vector representations , with one vector for each node of the graph. The decoder component then predicts the output graph by first computing scores for each possible edge between each pair of nodes and then applying a decoding algorithm to output the highest-scoring complete graph.
We can formalise the RNG-Tr model as:
where is the encoder component, is the decoder component, is the part-of-speech tags of the associated words and symbols , and is the number of recursion steps. The predicted dependency graph at iteration is specified as:
Each word has one head (parent) with dependency label from the label set , where the parent may be the ROOT symbol (see Section 3.1.1).
In the following sections, we will describe each element of our RNG-Tr dependency parsing model in detail.
To compute the embeddings for the nodes of the graph, we use the Graph-to-Graph Transformer architecture proposed by mohammadshahi2019graphtograph, including similar mechanism to input the previously predicted dependency graph to the attention mechanism. This graph input allows the node embeddings to include both token-level and dependency-level information.
3.1.1 Input Embeddings
The RNG-Tr model receives a sequence of input tokens () with their associated Part-of-Speech tags () and builds a sequence of input embeddings (). The sequence of input tokens starts with CLS and ends with SEP symbols (the same as BERT devlin2018bert’s input token representation). It also adds the ROOT symbol to the front of the sentence to represent the root of the dependency tree.
To build token representation for a sequence of input tokens, we sum several vectors. For the input words and symbols, we sum the pre-trained word embeddings of BERT , and learned representations of POS tags. To keep the order information of the initial sequence, we add the pre-trained position embeddings of BERT to our token embeddings. The final input representations are the sum of the position embeddings and the token embeddings:
3.1.2 Self-Attention Mechanism
Conditioning on the previously predicted output graph is made possible by inputting relation embeddings to the self-attention mechanism. This edge input method was originally proposed by shaw-etal-2018-self for relative position encoding, and extending to unlabelled dependency graphs in the Graph-to-Graph Transformer architecture of mohammadshahi2019graphtograph. We use it to input labelled dependency graphs. A relation embedding is added both to the value function and to the attention weight function wherever the two related tokens are involved.
Transformers have multiple layers of self-attention, each with multiple heads. The RNG-Tr architecture uses the same architecture as BERT devlin2018bert but changes the functions used by each attention head. Given the token embeddings at the previous layer and the input graph , the values computed by an attention head are:
where is a one-hot vector that represents the labelled dependency relationship between and in the graph . These relationships include both the label and the direction (headdependent versus dependenthead), or specify none. are learned relation embeddings (where is the number of dependency labels). The attention weights are a softmax applied to the attention function:
where are learned relation embeddings. LN() is the layer normalisation function, for better convergence.
The decoder uses the token embeddings produced by the encoder to predict the new dependency graph . It consists of two components, a scoring function, and a decoding algorithm. The dependency graph found by the decoding algorithm is the output graph of the decoder.
There are differences between the segmentation of the input string used by BERT and the segmentation used by the dependency treebanks. To compensate for this, the encoder uses the segmentation of BERT, but only a subset of the resulting token embeddings are considered by the decoder. For each word in the dependency annotation, only the first segment of that word from the BERT segmentation is used in the decoder. See Section 5.3 for more details.
3.2.1 Scoring Function
We first produce four distinct vectors for each token embedding from the encoder by passing it through four feed-forward layers.
where the ’s are all one-layer feed-forward networks with LeakyReLUactivation functions.
These token embeddings are used to compute probabilities for every possible dependency relation, both unlabelled and labelled, similarly to dozat2016deep. The distribution of the unlabelled dependency graph is estimated using, for each token
, a biaffine classifier over possible headsapplied to and . Then for each pair , the distribution over labels given an unlabelled dependency relation is estimated using a biaffine classifier applied to and .
3.2.2 Decoding Algorithms
The scoring function estimates a distribution over graphs, but the RNG-Tr architecture requires the decoder to output a single graph . Choosing this graph is complicated by the fact that the scoring function is non-autoregressive, and thus the estimate consists of multiple independent components, and thus there is no guarantee that every graph in this distribution is a valid dependency graph.
We take two approaches to this problem, one for intermediate parses and one for the final dependency parse . To speed up each refinement iteration, we ignore this problem for intermediate dependency graphs. We build these graphs by simply applying argmax independently to find the head of each node. This may result in graphs with loops, which are not trees, but this does not seem to cause problems for later refinement iterations.111We will investigate different decoding strategies to keep both the speed and well-formedness of the intermediate predicted graphs in future work. For the final output dependency tree, we use the maximum spanning tree algorithm, specifically the Chu-Liu/Edmonds algorithm chi-1999-statistical; edmonds1967optimum, to find the highest scoring valid dependency tree. This is necessary to avoid problems when running the evaluation scripts.
The RNG Transformer model is trained separately on each refinement iteration. Standard gradient descent techniques are used, with cross-entropy loss for each edge prediction. Error is not backpropagated across iterations of refinement, because there are no continuous values being passed from one iteration to another, only a discrete dependency tree.
In the RNG Transformer architecture, the refinement of the predicted graph can be done an arbitrary number of times, since the same encoder and decoder parameters are used at each iteration. In the experiments below, we place a limit on the maximum number of iterations. But sometimes the model converges to an output graph before this limit is reached, simply copying this graph during later iterations. To avoid multiple iterations where the model is trained to simply copy the input graph, during training the refinement iterations are stopped if the new predicted dependency graph is the same as the input graph. At test time we also stop computation in this case, but the output of the model is obviously not affected.
4 Initial Parsers
The RNG-Tr architecture assumes that there is an initial graph for the RNG-Tr model to refine. We consider several initial parsers to produce this graph. In our experiments, we find that the initial graph has little impact on the quality of the graph output by the RNG-Tr model.
To leverage previous work on dependency parsing, we use parsing models from the recent literature as initial parsers. To evaluate the importance of the initial parse, we also consider a setting where the initial parse is empty, so the first complete dependency tree is predicted by the RNG-Tr model itself. Finally, the success of our RNG-Tr dependency parsing model leads us to propose an initial parsing model with the same design, so that we can control for the parser design in measuring the importance of the RNG Transformer’s iterative refinement.
We call this initial parser the Dependency BERT (DepBERT) model. It is the same as one iteration of the RNG-Tr model shown in Figure 1 and defined in Section 3, except that there is no graph input to the encoder. Analogously to (1), is computed as:
where and are the DepBERT encoder and decoder respectively. For the encoder, we use the Transformer architecture of BERT devlin2018bert and initialise with BERT’s pre-trained parameters. The token embeddings of the final layer are used for . For the decoder, we use the same segmentation strategy and scoring function as described in Section 3.2, and apply Chu-Liu/Edmonds decoding algorithm chi-1999-statistical; edmonds1967optimum to find the highest scoring tree. The DepBERT parsing model is very similar to the UDify parsing model proposed by Kondratyuk_2019, but there are significant differences in the way token segmentation is handled, which result in significant differences in performance, as shown in Section 6.2.
5 Experimental Setup
To evaluate our models, we apply them on two kinds of datasets, Universal Dependency (UD) Treebanks 11234/1-2895, and Penn Treebanks. For evaluation, following Kulmizev_2019,11234/1-2895, we keep punctuation for UD Treebanks and remove it for Penn Treebanks nilsson-nivre-2008-malteval.
Universal Dependency Treebanks:
We evaluate our models on Universal Dependency Treebanks (UD v2.3) 11234/1-2895. We select languages based on the criteria proposed in Lhoneux2017OldSV, and adapted by smith-etal-2018-investigation. This set contains several languages with different language families, scripts, character set sizes, morphological complexity, and training sizes and domains. The description of the selected Treebanks is in Appendix A.
We also evaluate our models on English and Chinese Penn Treebanks marcus-etal-1993-building; xue-etal-2002-building. For English, we use sections 2-21 for training, section 22 for development, and section 23 for testing. We add section 24 to our development set to mitigate over-fitting. We use the Stanford PoS tagger toutanova-etal-2003-feature to produce PoS tags. We convert constituency trees to Stanford dependencies using version 3.3.0 of the converter de2006generating. For Chinese, we apply the same set-up as described in chen-manning-2014-fast, and use gold PoS tags during training and evaluation.
5.2 Baseline Models
For UD Treebanks, we consider several parsers both as baselines and to produce initial parses for the RNG-Tr model. We use the monolingual parser proposed by Kulmizev_2019, which uses and extends the UUParser222https://github.com/UppsalaNLP/uuparser de-lhoneux-etal-2017-raw; smith-etal-2018-82, and applies BERT devlin2018bert and ELMo peters-etal-2018-deep embeddings as additional input features. In addition, we also compare our models with multilingual models proposed by Kondratyuk_2019 and straka-2018-udpipe. UDify Kondratyuk_2019 is a multilingual multi-task model for predicting universal part-of-speech, morphological features, lemmas, and dependency graphs at the same time for all UD Treebanks. UDPipe straka-2018-udpipe is one of the winners of CoNLL 2018 Shared Task zeman-etal-2018-conll. It’s a multi-task model that performs sentence segmentation, tokenization, POS tagging, lemmatization, and dependency parsing. For a fair comparison, we use reported scores of Kondratyuk_2019 for the UDPipe model which they retrained using gold segmentation. UDify outperforms the UDPipe model on average, so we use both UDify and DepBERT models as our initial parsers to integrate with the RNG Transformer model. We also train the RNG-Tr model without any initial dependency graph, called Empty+RNG-Tr, to further analyse the impact of the initial graph.
For Penn Treebanks, we compare our models with previous state-of-the-art transition-based, and graph-based models. The Biaffine parser dozat2016deep
includes the same decoder as our model, with an LSTM-based encoder. ji-etal-2019-graph also integrate graph neural network models with the Biaffine parser, to find a better representation for the nodes of the graph. For these datasets, we use the Biaffine and DepBERT parsers as the initial parsers for our RNG Transformer model.
5.3 Implementation Details
All hyper-parameters are provided in Appendix B.
For the self-attention model, we use the pre-trained ”bert-multilingual-cased”333https://github.com/google-research/bert with 12 self-attention layers.444For Chinese and Japanese, we use pre-trained ”bert-base-chinese” and ”bert-base-japanese” models Wolf2019HuggingFacesTS respectively. For tokenization, we apply the wordpiece tokenizer of BERT wu2016google. Since dependency relations are between the tokens of a dependency corpus, we apply the BERT tokenizer to each corpus token and run the encoder on all the resulting sub-words. Each dependency between two words is inputted as a relation between their first sub-words. We also input a new relationship with each non-first sub-word as the dependent and the associated first sub-word as its head. In the decoder, we only consider candidate dependencies between the first sub-words for each word.555In preliminary experiments, we found that using the dependency predictions of the first sub-words achieves better or similar results compared to using the last sub-word or all sub-words of each word. Finally, we map the predicted heads and dependents to their original positions in the corpus for proper evaluation.
6 Results and Discussion
After some initial experiments to determine the number of refinement iterations, we report the performance of the RNG Transformer model on UD treebanks and Penn treebanks. The RNG-Tr models perform substantially better than models without refinement on almost every dataset. We also perform various analyses of the models to better understand these results.
6.1 The Number of Refinement Iterations
To select the best number of refinement iterations allowed by RNG Transformer model, we evaluate different variations of our model on the Turkish Treebank (Table 1).666We choose the Turkish Treebank because it is a low-resource Treebank and there are more errors in the initial parse for RNG-Tr to correct. We use both DepBERT and UDify as initial parsers. The DepBERT model significantly outperforms the UDify model, so adding the RNG-Tr model to the UDify model results in more relative error reduction in LAS compared to its integration with DepBERT (17.65% vs. 2.67%). In both cases, using three refinement iterations achieves the best result, and excluding the stopping strategy described in Section 3.3 decreases the performance. In subsequent experiments, we use three refinement iterations with the stopping strategy, unless mentioned otherwise.
6.2 UD Treebanks Results
Results for the UD treebanks are reported in Table 2. We compare our models with previous state-of-the-art results (both trained mono-lingually and multi-lingually), based on labelled attachment score.777Unlabelled attachment scores are provided in Appendix C. All results are computed with the official CoNLL 2018 shared task evaluation script (https://universaldependencies.org/conll18/evaluation.html). The UDify+RNG-Tr model achieves significantly better performance than the UDify model, which demonstrates the effectiveness of the RNG-Tr model at refining an initial dependency graph. The DepBERT model significantly outperforms previous state-of-the-art models on all UD Treebanks. But despite this good performance, the DepBERT+RNG-Tr model achieves further improvement over DepBERT in almost all languages and on average. As expected, we get more improvement when combining the RNG-Tr model with UDify, because UDify’s initial dependency graph contains more incorrect dependency relations for RNG-Tr to correct.
Although generally better, there is surprisingly little difference between the performance after refinement of the UDify+RNG-Tr and DepBERT+RNG-Tr models. To investigate the power of the RNG-Tr architecture to correct any initial parse, we also show results for a model with an empty initial parse, Empty+RNG-Tr. For this model, we run four iterations of refinement (T=4), so that the amount of computation is the same as for UDify+RNG-Tr and DepBERT+RNG-Tr. The Empty+RNG-Tr model achieves competitive results with the UDify+RNG-Tr model (i.e. above the previous state-of-the-art), indicating that the RNG-Tr architecture is a very powerful method for graph refinement. We will discuss this conclusion further in Section 6.4.
6.3 Penn Treebanks Results
UAS and LAS results for the Penn Treebanks are reported in Table 3. We compare to the results of previous state-of-the-art models and DepBERT, and we use the RNG-Tr model to refine both the Biaffine parser dozat2016deep and DepBERT, on the English and Chinese Penn Treebanks 888Results are calculated with the official evaluation script: (https://depparse.uvt.nl/)..
Again, the DepBERT model significantly outperforms previous state-of-the-art models, with a 5.78% and 9.15% LAS relative error reduction in English and Chinese, respectively. Despite this level of accuracy, adding RNG-Tr refinement improves accuracy further under both measures in both languages, although in English the differences are not significant. For the Chinese Treebank, RNG-Tr refinement achieves a 4.7% LAS relative error reduction. When RNG-Tr refinement is applied to the output of the Biaffine parser dozat2016deep, it achieves LAS relative error reduction of 10.64% for the English Treebank and 16.5% for the Chinese Treebank. These improvements, even over such strong initial parsers, again demonstrate the effectiveness of the RNG-Tr architecture for graph refinement.
6.4 Error Analysis
To better understand the distribution of errors for our models, we follow mcdonald-nivre-2011-analyzing and plot labelled attachment scores as a function of dependency length, sentence length and distance to root 999We use MaltEval tool nilsson-nivre-2008-malteval for calculating accuracies in all cases.. We compare the distributions of errors made by UDify Kondratyuk_2019, DepBERT, and refined models (UDify+RNG-Tr, DepBERT+RNG-Tr, and Empty+RNG-Tr). Figure 2 shows the accuracies of the different models on the concatenation of all development sets of UD Treebanks.101010Tables for the error analysis section, and graphs for each language are provided in Appendix D. Results show that applying RNG-Tr refinement to the UDify model results in a great improvement in accuracy across all cases. They also show little difference in the error profile between the better performing models.
The leftmost plot compares the accuracy of models on different dependency lengths. Adding RNG-Tr refinement to UDify results in better performance both on short and long dependencies, with particular gains for the longer and more difficult cases.
Distance to Root:
The middle plot illustrates the accuracy of models as a function of the distance to the root of the dependency tree, which is calculated as the number of dependency relations from the dependent to the root. Again, when we add RNG-Tr refinement to the UDify parser we get significant improvement for all distances, with particular gains for the difficult middle distances, where the dependent is neither near the root nor a leaf of the tree.
The rightmost plot shows the accuracy of models on different sentence lengths. Again, adding RNG-Tr refinement to UDify achieves significantly better results on all sentence lengths. But in this case, the larger improvements are for the shorter, presumably easier, sentences.
6.5 Refinement Analysis
To better understand how the RNG Transformer model is doing refinement, we perform several analyses of the trained UDify+RNG-Tr model.111111We choose UDify as the initial parser because the RNG-Tr model makes more changes to the parses of UDfiy than DepBERT, so we can more easily analyse these changes. An example of this refinement is shown in Figure 3, where the UDify model predicts an incorrect dependency graph, but the RNG-Tr model modifies it to build the gold dependency tree.
Refinements by Iteration:
To measure the accuracy gained from refinement at different iterations, we define the following metric:
where is relative error reduction, and is the refinement iteration. is the accuracy of the initial parser, UDify in this case. To illustrate the refinement procedure for different dataset types, we split UD Treebanks based on their training size to ”Low-Resource” and ”High-Resource” datasets.121212We consider languages that have training data more than 10k sentences as ”High-Resource”.
Table 4 shows this refinement metric () after each refinement iteration of the UDify+RNG-Tr model on the UD Treebanks.131313For these results we apply MST decoding after every iteration, to allow proper evaluation of the intermediate graphs. Every refinement step achieves an increase in accuracy, on both low and high resource languages. But the amount of improvement decreases for higher refinement iterations. Interestingly, for languages with less training data, the model cannot learn to make all corrections in a single step but can learn to make the remaining corrections in a second step, resulting in approximately the same total percentage of errors corrected as for high resource languages.
Dependency Type Refinement:
Table 5 shows the accuracy of a selection of different dependency types for the UDify model as the initial parser, and the refined model (+RNG-Tr).141414The accuracy of UDify, DepBERT, and their integration with RNG-Tr on all dependency types are provided in Appendix E. On average, we achieve 29.78% LAS error reduction compared to UDify by adding the RNG-Tr model to UDify. Significant improvements are achieved for all dependency types, especially in hard cases, such as copula (cop).151515A cop (copula) is the relation of a function word used to link a subject to a nonverbal predicate.
Predicting non-projective trees is a challenging issue for dependency parsers. Figure 4
shows precision and recall for the UDify and UDify+RNG-Tr models on non-projective trees of the UD Treebanks. Adding the RNG-Tr model to the initial parser results in a significant improvement in both precision and recall, which again demonstrates the effectiveness of the RNG-Tr model on hard cases.
6.6 Time complexity
In this section, we compute the time complexity of both proposed models, DepBERT and RNG-Tr.
The time complexity of the original Transformer vaswani2017attention is , where is the sequence length, so the time complexity of the encoder is . The time complexity of the decoder is determined by the Chu-Liu/Edmonds algorithm Chu1965OnTS; edmonds1967optimum, which is . So, the total time complexity of the DepBERT model is , the same as other graph-based models.
In each refinement step, the time complexity of Graph-to-Graph Transformer mohammadshahi2019graphtograph is . Since we use the argmax function in the intermediate steps, the decoding time complexity is determined by the last decoding step, which is . So, the total time complexity is .
We proposed a novel architecture for structured prediction, Recursive Non-autoregressive Graph-to-graph Transformer (RNG-Tr) to iteratively refine arbitrary graphs. Given an initial graph, RNG Transformer learns to predict a corrected graph over the same set of nodes. Each iteration of refinement predicts the edges of the graph in a non-autoregressive fashion, but conditions these predictions on the entire graph from the previous iteration. This graph conditioning and prediction are done with the Graph-to-Graph Transformer architecture mohammadshahi2019graphtograph, which makes it capable of capturing complex patterns of interdependencies between graph edges. Graph-to-Graph Transformer also benefits from initialisation with a pre-trained BERT devlin2018bert model. We also propose a graph-based dependency parser called DepBERT, which is the same as our refinement model but without graph inputs.
We evaluate the RNG Transformer architecture on syntactic dependency parsing. We run experiments with a variety of initial parsers, including DepBERT, on 13 languages of Universal Dependencies Treebanks, and on English and Chinese Penn Treebanks. Our DepBERT model already significantly outperformed previous state-of-the-art models on both types of Treebanks. Even with this very strong initial parser, RNG-Tr refinement almost always improves accuracies, setting new state-of-the-art accuracies for all treebanks. Regardless of the initial parser (e.g. UDify Kondratyuk_2019 on UD Treebanks, and Biaffine parser dozat2016deep on Penn Treebanks), RNG-Tr reaching around the same level of accuracy, even when it is given an empty initial parse, demonstrating the power of this iterative refinement method. Finally, we provided error analyses of the proposed model to illustrate its advantages and understand how refinements are made across iterations.
The RNG Transformer architecture is a very general and powerful method for structured prediction, which could easily be applied to other NLP tasks. It would especially benefit tasks that require capturing complex structured interdependencies between graph edges, even with the computational benefits of a non-autoregressive model.
We are grateful to the Swiss NSF, grant CRSII5_180320, for funding this work. We also thank Lesly Miculicich and other members of the Idiap NLU group for helpful discussions.
Appendix A UD Treebanks Details
Appendix B Implementation Details
For better convergence, we use two separate optimizers for pre-trained parameters and randomly initialized parameters. We apply bucketed batching, grouping sentences by their lengths into the same batch to speed up the training. Here is the list of hyper-parameters for RNG Transformer model:
|Base Learning rate||2e-3|
|BERT Learning rate||1e-5|
|Max Position Embedding||512|
|Feed-Forward layers (arc)|
|Feed-Forward layers (rel)|