Log In Sign Up

Graph-to-Graph Transformer for Transition-based Dependency Parsing

Transition-based dependency parsing is a challenging task for conditioning on and predicting structures. We demonstrate state-of-the-art results on this benchmark with the Graph2Graph Transformer architecture. This novel architecture supports both the input and output of arbitrary graphs via its attention mechanism. It can also be integrated both with previous neural network structured prediction techniques and with existing Transformer pre-trained models. Both with and without BERT pretraining, adding dependency graph inputs via the attention mechanism results in significant improvements over previously proposed mechanism for encoding the partial parse tree, resulting in accuracies which improve the state-of-the-art in transition-based dependency parsing, achieving 95.64 WSJ dependencies. Graph2Graph Transformers are not restricted to tree structures and can be easily applied to a wide range of NLP tasks.


page 4

page 5


Recursive Non-Autoregressive Graph-to-Graph Transformer for Dependency Parsing with Iterative Refinement

We propose the Recursive Non-autoregressive Graph-to-graph Transformer a...

Transition-based Parsing with Stack-Transformers

Modeling the parser state is key to good performance in transition-based...

Transition-based Semantic Dependency Parsing with Pointer Networks

Transition-based parsers implemented with Pointer Networks have become t...

TreeGen: A Tree-Based Transformer Architecture for Code Generation

A code generation system generates programming language code based on an...

Unveiling Transformers with LEGO: a synthetic reasoning task

We propose a synthetic task, LEGO (Learning Equality and Group Operation...

Hierarchical Pointer Net Parsing

Transition-based top-down parsing with pointer networks has achieved sta...

Interpretable and Generalizable Graph Learning via Stochastic Attention Mechanism

Interpretable graph learning is in need as many scientific applications ...

1 Introduction

In recent years, there has been a large amount of research on applying self-attention models to many NLP tasks. Transformer 

Vaswani et al. (2017)

is the most common architecture, which can capture long-range dependencies by using a self-attention mechanism over a set of vectors. To encode the sequential structure of sentences, typically absolute position embeddings are input to each vector in the set, but recently a mechanism has been proposed for inputting relative position

Shaw et al. (2018). For each pair of vectors, an embedding for their relative position is input to the self-attention function. This mechanism can be generalised to input arbitrary graphs of relations. We propose a version of the Transformer architecture which combines this mechanism for conditioning on graphs with an attention-like mechanism for predicting graphs and demonstrate its effectiveness on syntactic dependency parsing. We call this architecture Graph2Graph Transformer.

Our proposed Graph2Graph Transformer parser is a transition-based dependency parser. At each step, the model predicts the next parsing decision by conditioning on the sequence of previous parsing decisions and the partial parse structure which those decisions specify. We input the previously specified dependency relations into our Transformer model via the self-attention mechanism. We then predict the next dependency relation by conditioning on the two vectors for the words involved in the relation, analogously to the attention mechanism. To better model the sequence of parser decisions, we also combine the Transformer model with an LSTM model of the parse history and a composition model of the partial parse. Even without the proposed graph inputs, this novel Transformer model of transition-based dependency parsing achieves good performance, but we still get substantial improvements by adding the graph inputs.

We also demonstrate that, despite the modified input mechanisms, this Graph2Graph Transformer architecture can be effectively initialised with standard pre-trained Transformer models. Initialising the parser with pre-trained BERT Devlin et al. (2018) parameters leads to large improvements for all models, and an even larger increase in performance when we add graph inputs. The resulting model significantly improves over the state of the art in transition-based dependency parsing.

This success demonstrates the effectiveness of Graph2Graph Transformers for conditioning on and predicting graph edges. This architecture can be easily applied to other NLP tasks that have any graph as the input and need to predict a graph over the same set of nodes as output.

Our contributions are:

  • We propose a Graph2Graph Transformer architecture for conditioning on and predicting structures.

  • We propose a novel Transformer model of transition-based dependency parsing.

  • We successfully integrate the proposed model with a pre-trained BERT initialisation, achieving state-of-the-art results for transition-based dependency parsing.

2 Transition-based Dependency Parsing

A dependency parser analyses the grammatical structure of a sentence, establishing relationships between “head” words and their syntactic “dependents”. Dependency parses are used in a wide range of natural language processing (NLP) applications, such as machine translation 

Currey and Heafield (2019); Vashishth et al. (2018); Ding and Tao (2019); Bastings et al. (2017), information extraction Nguyen et al. (2009); Angeli et al. (2015); Peng et al. (2017)

, sentiment analysis 

Tai et al. (2015) and low-resource language processing McDonald et al. (2013); Ma and Xia (2014).

Dependency parsing has been dominated by two approaches, transition-based models and graph-based models McDonald and Nivre (2007, 2011)

. As with structured prediction in general, the challenge is to model the constraints and correlations between different decisions about an arbitrarily large structure without suffering from the exponential size of the structured output space. Graph-based models assume that if we can model correlations between the input sequence and individual decisions about the output structure really well, and if we can model exactly the discrete constraints between these decisions, then we can assume that these output decisions are otherwise statistically independent. This assumption allows exact dynamic programming solutions to the decoding problem (finding the best structure), given estimates for the individual decisions.

In contrast, transition-based models Nivre (2008) allow arbitrary correlations between output decisions to be modelled by making the decisions one at a time, each conditioned on all the previous decisions. Instead of dynamic programming, beam search is usually used to search the space of output structures, often even with a one-best search. This makes transition-based parsers, in general, faster than graph-based parsers, but they tend to suffer from both search errors and the difficulty of modelling the correlations between decisions.

Because we are investigating an architecture for both conditioning on and predicting structures, in this work we focus on transition-based models. At each step of the parse, the model takes as input both the input sentence and the sequence of previous output decisions, and then predicts the next decision about the output structure. While it is possible to model both these inputs as sequences and apply a sequence-to-sequence model, this does not work as well as explicitly modelling the partial structure specified by the previous decisions. We follow previous work Yamada and Matsumoto (2003); Nivre (2003, 2004) in representing the input sentence and the previous parse in the state of an incremental parser, including a buffer of words waiting to be processed, and a stack of words which are partially processed. For our proposed model, we add a third part to the parser state which is a list of deleted words which have finished being processed. In addition, parsing models usually assume that the parser state includes an explicit representation of the sequence of previous decisions, called parser actions, and of the graph of dependency relations which these decisions specify.

The challenge in transition-based dependency parsing is finding an encoding of this parser state which can be used in the transition classifier to predict the next parser action. This challenge has been addressed with several alternative approaches, such as feature engineering 

Zhang and Nivre (2011); Ballesteros and Nivre (2016); Chen et al. (2014); Ballesteros and Bohnet (2014), and inducing representations with neural networks Henderson (2003); Titov and Henderson (2010); Dyer et al. (2015); Weiss et al. (2015); Andor et al. (2016).

Our transition-based parser uses ’arc-standard’ parsing sequences Nivre (2004), which makes parsing decisions in bottom-up order. The main data structures for representing the state of an arc-standard parser are a buffer of words and a stack of partially constructed syntactic sub-trees. For input to Transformer, we represent the parser state as a partition of the words of the sentence into a buffer , a stack and a delete list , plus a directed graph of labelled dependency relations between these words. The graph includes all the dependency relations which have been specified by the previous parser decisions, so it represents the partial parse structure constructed so far. The deleted words are those which have been removed from the stack, after having both their children and parents specified in . We will describe how these components are used in section 3.2.

At the initialisation step, the stack contains just the ROOT symbol, the buffer includes the words of the input sentence in order, and the delete list is empty. Parsing is finished when the buffer becomes empty, the stack contains only the ROOT symbol, and the delete list contains all tokens. At each step, a parser action is chosen and the parser state is modified. For an arc-standard parser, the parser actions are as follows, where and are the top two elements on the stack, and is the front element of the buffer:

  • LEFT-ARC(): Add an arc with label to , remove from the stack, and insert in the delete list. Precondition: stack length must be greater than 2.

  • RIGHT-ARC(): Add an arc with label to , remove from the stack, and insert to the delete list. Precondition: stack length must be greater than 2.

  • SHIFT: moves from the buffer to the top of the stack. Precondition: buffer length must be greater than 1.

3 The Parsing Model

Figure 1: The Graph-to-Graph Transformer parsing model.

The proposed parsing model learns to embed the sequence of previous parser actions and the resulting parser state, and then to predict the next parser action from that embedding. It is illustrated in Figure 1.

Because our proposed deep learning architecture is based on Transformer, we start by proposing a novel Transformer model of transition-based dependency parsing. This model incorporates components which have proved successful in previous work, but uses a Transformer to compute context-dependent representations of words, which we will refer to as token embeddings.

To support a general mechanism for graph output, the token embeddings of only the top two words on the stack are used by the transition classifier to predict the next parser action. It is these two words which may have a dependency relation specified by the parser action. By making the graph prediction output a function of the embeddings of the two words involved in the relation, this graph output method is similar to an attention mechanism, where attention weights are also a function of the embeddings of the two tokens involved in the attention relationship. We hypothesise that this synergy between the representations expected by the attention mechanism and those expected by the output mechanism will improve performance.

To support a general mechanism for graph input, we add input embeddings specifying the graph of previously chosen dependency relations to the self-attention mechanism. For every pair of words, the attention functions receive embeddings which specify the dependency relation, if any, between these two words. As with the graph outputs, we hypothesise that inputting graph relation embeddings into the mechanism for finding attention relationships will improve performance.

Finally, we show that these modifications to the input of the Transformer architecture do not prevent the effective use of initialisation with BERT, which has been pre-trained without them. In the rest of this section we describe the architecture and parsing model in more detail.

3.1 Input Embeddings

The Transformer architecture takes a sequence of input tokens, converts them into a sequence of input embedding vectors, and then produces a new sequence of context-dependent token embeddings. For our model, the sequence of input tokens represents the current parser state, as illustrated in Figure 1.

The input tokens include the words of the sentence with their associated part-of-speech tags (PoS) . Each of these words can appear in the stack or buffer of the parser state, or otherwise are included in a list of words which have been deleted from the stack because all their dependency relations have been specified. In addition, there is the ROOT symbol, for the root of the dependency tree, which is always on the bottom of the stack. Inspired by the input representation of BERT Devlin et al. (2018), we also use two special symbols, START and SEP, which indicate the different parts of the parser state.

The sequence of input tokens is illustrated at the top of Figure 2. It starts with the START symbol, then includes the tokens on the stack from bottom to top. Then it has a SEP symbol, followed by the tokens on the buffer from front to back, so that they are in the same order in which they appeared in the sentence. If the model does not use graph inputs to the attention mechanism, then this is a sufficient representation of the parser state. Otherwise, the input sequence includes another SEP symbol followed by the tokens in the delete list, ordered according to their order in the sentence. Adding the deleted words allows the input of the dependency relations which involves those words.

As part of the input sequence, we also include a specification of the dependency relation, if any, specified in the immediately preceding parser action, which always has the top of the stack as its head. This will be discussed when we discuss the composition model in Section 3.1.2. In addition, we include input information specifying the label of the parent dependency relation for each token, which is only known for the words in the delete list.

Figure 2: Input embeddings of self-attention model in G2G Transformer at a specific time step. The input embeddings are the summation of output embeddings of composition model (described in section 3.1.2), segment embeddings, position embeddings, and graph label embeddings

Given this input sequence, the model computes a sequence of vectors which are input to the Transformer network. As depicted in the lower section of Figure 

2, this vector is the sum of several embeddings, which are defined in the remainder of this subsection.

3.1.1 Input Token Embeddings

The words and POS tags of the sentence each have associated embedding vectors


where is the embedding mapping from the set of training words () plus the set of PoS tags () to the embedding space ( where is the dimension of embedding space). For the word embeddings, we use pre-trained word vectors from the BERT model Devlin et al. (2018). The PoS embeddings are trained parameters. These word and PoS embeddings are summed to get the embedding of each token:


Where is the token embedding of word in the sentence.

3.1.2 Composition Model

Previous work has shown that recursive neural networks are capable of inducing a representation for complex phrases by recursively embedding sub-phrases Socher et al. (2011, 2014, 2013); Hermann and Blunsom (2013). dyer-etal-2015-transition showed that this is an effective technique for embedding the partial parse subtrees specified by the parse history in transition-based dependency parsing. Since a word in a dependency tree can have a variable number of dependents, they combined the dependency relations incrementally as they are specified by the parser. They used a non-linear mapping function () to compute the new embeddings.

We extend this idea by using a feed-forward neural network with

as the activation function and skip connections. For every token in position

on the stack, after making decision , the composition model computes a vector which is added to the input embedding for that token:


where the function is a one-layer feed forward neural network, and represents any new dependency relation with head specified by the decision at step . In arc-standard parsing, the only word which might have received a new dependent by the previous decision is the word on the top of the stack, . This gives us the following definition of :


where and are the embeddings of the top two elements of the stack at time step , and is the initial token embedding of the word on the front of the buffer at time . is the label embedding of the specified relation, including its direction. For all words on the stack which have not received a new dependent, the composition is computed anyway, but with a [NULL] dependent and [L-NULL] label.111Preliminary experiments indicated that not updating the composition embedding for these cases resulted in worse performance.

At , for all tokens , is set to the initial token embedding . The model then computes Equation 3 iteratively at each step for each token on the stack at that step. For all tokens not on the stack, their vector is left unchanged from the previous step’s vector , regardless of what position that token occupied in the previous parser state. This means that all tokens on the buffer retain their initial vector of , and all tokens in the delete list retain the composition vector they had when they were popped from the stack.

There is a skip connection in Equation 3

to address the vanishing gradient problem. Also, preliminary experiments showed that without this skip connection to bias the composition model towards the initial token embeddings

, integrating pre-trained BERT Devlin et al. (2018) parameters into the model (discussed in Section 3.5) did not work.

3.1.3 Parser State Structure Embeddings

To distinguish the different positions and roles of words in the parser state, we add position embeddings and segment embeddings to the above token embeddings. These embeddings are not included in the input to the composition model (which uses the ), but they are included in the input to the Transformer’s self-attention layers.

Position Embedding: We initialise with the pre-trained position embedding of BERT Devlin et al. (2018), because the buffer and delete parts of the parser state have the same word order as the input sentence. The position embeddings have the same dimension () as the output of the composition model, so we can sum them together. We further fine-tune the positional embeddings parameters during training.

Segment Embedding: Since the input sequence contains stack, buffer and deleted parts (if we have graph input), the model should make a distinction between tokens which occur in these different parts. To make this distinction, there are embeddings for each of these segments of the parser state, namely , and , respectively. The dimension of these segment embeddings is the same as positional embeddings.

3.1.4 Total Input Embeddings

Finally, we sum the outputs of the composition model, the segment embeddings and the positional embeddings, and consider them as the input embeddings at step for the self-attention layers of the Transformer model:


3.2 Graph2Graph Transformer

Our proposed model for mapping the input sequence of embeddings described in Section 3.1 to a vector which can be used by the transition classifier described below in Section 3.4 is a form of Transformer Vaswani et al. (2017). Transformers are multi-layer self-attention-based models for encoding and generating sequences. They have been very successful in a wide variety of NLP tasks. Here we propose a version of Transformer which is designed for both conditioning on graphs and predicting graphs, which we call Graph2Graph Transformer, and show how it can be applied to transition-based dependency parsing.

Inspired by the relative position embeddings of shaw-etal-2018-self, we use the attention mechanism of Transformer to input arbitrary binary graph relations. By inputting the embedding for a relation into the attention computations for the related words, the model can more easily learn to pass information between graph-local words, which gives the model an appropriate linguistic bias, without imposing hard constraints.

Given that the attention function is being used to input graph relations, it is natural to assume that graph relations can also be predicted with an attention-like function. We do not go so far as to restrict the form of the prediction function, but we do restrict the vectors used to predict graph relations to only the two for the words involved in the relation.

3.2.1 Baseline Transformer

Transformer Vaswani et al. (2017) is a sequence-to-sequence model, of which we only use the encoder component. A Transformer encoder computes an output embedding for each token in the input sequence through stacked layers of a self-attention mechanism. Each layer contains two sub-layers: a multi-head self-attention layer, and a position-wise feed-forward layer. In the self-attention layer, the value vectors output by multiple attention heads are concatenated together and projected to build the output vector of this layer.

Each attention head has its own parameters and computes its own value vectors. Given () as the input sequence, the attention mechanism finds a sequence of value vectors (). Each output element

is a linear transformation of input elements:


where is called the value matrix and learned during training, and is the attention head size. is the input element, and is the output element at position . is the attention weight, which is calculated by a Softmax function:


where is calculated by:


where , are query and key matrices respectively and contain learned parameters.

3.2.2 Graph Inputs

Inspired by shaw-etal-2018-self, we extend the architecture of the Transformer to accept the dependency tree as an additional input, using the same formulas as they use for inputting relative position embeddings. In G2G Transformer, we encode the information of a dependency pair () by modifying Equation 8:


where is a one-hot vector which specifies the type of dependency relation between and (see Table 1). is a matrix of learned parameters. We also modify Equation 6 to transmit information of the partially constructed graph to the output of the attention layer:


where is a parameter matrix.

Relation assigned dimension
None 0
(head dependent) 1
(dependent head) 2
Table 1: Index of the type of dependency relation. In this work we only input the unlabelled directionality of the relation, if any.

As shown in Table 1, we have chosen to input unlabelled dependency relations, explicitly representing only the direction of the dependency. This choice was made mostly to simplify our extension of Transformer, as well as to limit the computational cost of this extension. In order to input the labels, we add dependency label embeddings to the token embeddings of the dependent word, as discussed in Section 3.1. Since all words which have their heads specified are in the deleted part of the input sequence, this label embedding is added to all and only the tokens in the delete list.

3.2.3 Graph Outputs

The output of our Graph2Graph Transformer model is the concatenation of the final layer representations of the top two elements on the stack:


where and are the top two elements on the stack. These two vectors are input directly to the transition classifier, described below in Section 3.4.

3.3 History Model

In addition to the composition model, we input a history embedding of the sequence of previous actions at the current parser state. We use the same LSTM history model as dyer-etal-2015-transition (referred to there as a StackLSTM). The sequence input to the LSTM is simply the sequence of parser action types, including LEFT-ARC, RIGHT-ARC and SHIFT, but not including the dependency labels or the identities of the words involved.

The output of the history model after step is the final output vector of the LSTM after inputting the parser action at . This history vector is passed directly to the transition classifier.

3.4 Transition Classifier

The Transformer’s output vector, described in Section 3.2.3, is concatenated with the history vector, described in Section 3.3, to form the input to the transition classifier. The transition classifier takes this representation of the parser state and parse history and predicts the next parser action, choosing from the legal next actions as defined in Section 2.

At each step , the model first chooses the type of action, namely whether a dependency should be specified and in what direction, which we call the Exist classifier. The Exist classifier outputs scores for three alternatives:

  • No Relation: Do SHIFT

  • Right Relation: Do RIGHT-ARC

  • Left Relation: Do LEFT-ARC

In the latter two cases, a dependency relation is specified between the top two tokens on the stack, in the specified direction. For these cases a second classifier, the Relation classifier, predicts the label of the relation, conditioned on the direction. Given the parser state vector and the parse history vector , these classifiers compute:


where and (

is the number of dependency labels times the number of directions). Both classifiers are multi-layer perceptron classifiers with one hidden layer and

activation function.

3.5 Pre-Training with BERT

BERT Devlin et al. (2018) provides deep contextual representations based on a series of Transformers trained on a huge amount of un-annotated data with a language-modelling objective. BERT is trained by the Cloze task Taylor (1953) which enables the model to encode information from both directions. In addition, BERT is trained on the next sentence classification objective. BERT employs a subword vocabulary with WordPiece Wu et al. (2016) which splits a word into subwords.

Initialising a Transformer model with the pre-trained parameters of BERT, and then fine-tuning on the target task, has demonstrated large improvements in many tasks. But unlike the previous work we are aware of which has used BERT pre-training, our version of Transformer has novel inputs which were not present when BERT was trained. These novel inputs are the graph inputs to the attention mechanism, and the composition embeddings. Also, the input sequence has a novel structure, which is only partially similar to the input sentences which BERT was trained on. So it is not clear that BERT pre-training will even work with this novel architecture.

To evaluate whether BERT pre-training works for our proposed architecture, we initialise the weights of the Graph2Graph Transformer model with the first layers of BERT, where is the number of layers in our model.

4 Implementation Details

4.1 Dataset

We train our models on a dependency version of the English Wall Street Journal (WSJ) corpus, which is a part of the Penn Treebank Marcus et al. (1993). We follow the standard split and use sections 2-21 for training, section 22 for evaluation, and section 23 for testing. We also add section 24 to our development set to mitigate over-fitting on section 22. We convert constituency trees in the corpus to Stanford dependencies De Marneffe et al. (2006) applying version 3.3.0 of the converter. For POS tags, we use Stanford POS tagger Toutanova et al. (2003), which has an accuracy of 97.44%. As in previous work, we exclude punctuation from evaluation.

4.2 Baselines

We compare our models with several baselines and reduced versions of the model, based on unlabelled/labelled attachment scores (UAS/LAS). As strong baselines from previous work, we compare to previous transition-based models Dyer et al. (2015); Weiss et al. (2015); Andor et al. (2016); Ballesteros et al. (2016); Chen et al. (2014).

To demonstrate the usefulness of each part of the proposed G2G Transformer model, we compare the full model (G2G Tr) with four different reduced versions of the model. We define the Dependency Transformer (DepTr) model as the same model described in Section 3 but without the composition and history models, and without graph inputs to attention (i.e. attention Equation 8). This model does, however, include the same graph output mechanism, conditioning on the two tokens on the top of the stack. Then we add the history model (DepTr+H), and both composition and history models (DepTr+CH), to the Dependency Transformer baseline. Finally, we also consider a version of the full model with the graph output mechanism removed (G2CLS Tr), where we predict the next parser action from the START symbol’s token embedding (referred to as CLS in BERT). All five of these models are evaluated both with and without initialisation with the first layers of a pre-trained BERT model.

4.3 Hyper-parameters and details of implementation

All hyper-parameters are given in Appendix A. The same hyper-parameter optimisation strategy was used for all models. For all models we use 6 self-attention layers, except where specified otherwise.

All our models use one-best (deterministic) decoding, meaning that at each step only the highest scoring parser action is considered for continuation. This was done for simplicity. Beam search could also be used with these models.

We use pre-trained base “cased” BERT with 12 layers of attention and 12 attention heads.222 We extract the weights of the first layers of BERT (where is the number of attention layers in our models) and use them to initialise our BERT models. For tokenisation, we average the embeddings of subword tokens which are produced by the native BERT tokeniser, so that the model has the desired one token per word.

In the graph input to the attention function, we don’t train the row of the graph embedding matrices ( and ) for the case of ”No Relation” between the two tokens, leaving them frozen at their random initialisation, for reasons of training efficiency.

5 Results and Discussion

5.1 UAS/LAS Results

Test Set UAS LAS chen2014feature 91.80 89.60 dyer-etal-2015-transition 93.10 90.90 ballesteros2016training 93.56 92.41 weiss-etal-2015-structured 94.26 91.42 andor-etal-2016-globally 94.61 92.79 DepTr 88.40 84.23 DepTr+H 90.44 86.91 DepTr+CH 92.35 89.51 G2CLS Tr 92.73 90.65 G2G Tr 93.21 91.06 BERT DepTr 91.61 87.81 BERT DepTr+H 92.90 89.42 BERT DepTr+CH 94.86 92.28 BERT G2CLS Tr 94.27 92.29 BERT G2G Tr 95.30 93.44 BERT G2G Tr 7-layer 95.64 93.81
Table 2: Results on English WSJ Treebank Stanford dependencies. All models have 6 layers of self-attention, except the last line which has 7 layers.

In Table 2, we compare all variations of our model and previous transition-based models. Compared to the previous state-of-the-art in transition-based dependency parsing, our complete model (BERT G2G Tr) performs significantly better, at 95.30% UAS and 93.44% LAS. This performance continues to improve, to 95.64% UAS and 93.81% LAS, when we increase the depth of the model from 6 self-attention layers to 7, a full percentage point improvement over previous results.

Comparing the different versions of the proposed parser, the same pattern appears both with and without BERT initialisation. Simply applying Transformer to encode the parser state sequence (DepTr) does not perform well. Adding an embedding of the history sequence of parser actions (DepTr+H) results in a big improvement (17%/13% LAS relative error reduction without/with BERT). Adding explicit modelling of the composition of sub-phrases (DepTr+CH) results in a further big improvement (20%/27% LAS relative error reduction without/with BERT), which, with BERT, reaches accuracies competitive with the state-of-the-art.

Even from this very strong starting point, adding graph inputs to the attention mechanism result in a further large improvement of 1.6%/1.2% absolute and 15%/15% relative LAS error reduction without/with BERT. This improvement makes our BERT G2G Tr model accuracy significantly better than the previous state-of-the-art in transition-based dependency parsing.

All these models use the same graph output mechanism, conditioning on the embeddings of the two tokens on the top of the stack. We motivated this choice because it is similar to the way the attention mechanism finds relationships between tokens, and it is these two tokens whose relationship we need to decide. But with transition-based parsing, the next parser action could equally well be predicted from the START token embedding, which in BERT (there called CLS) is used to classify the input as a whole. Using our proposed graph output mechanism (G2G Tr) instead of predicting from the START token embedding (G2CLS Tr), there is again an improvement, particularly with BERT (4%/15% relative LAS error reduction without/with BERT).

Although in general it is not surprising that models pretrained with BERT outperform equivalent models which use no resources other than the parsed training corpus, in this case it is surprising because the input to the Transformer is different from that of BERT, as discussed in Section 3.5. In particular, here we add inputs to the Transformer from the composition model and graph inputs to the attention mechanism, neither of which BERT was trained with. In fact, the LAS relative error reduction from adding BERT initialisation is the highest in the full model (27%), followed closely by the DepTr+CH model with composition inputs (26%). The model which is closest to BERT (DepTr), has a lower LAS relative error reduction from adding BERT initialisation (23%), and adding just the history model outside of the Transformer (DepTr+H) is even lower (19%). This surprising result that BERT initialisation helps a Transformer with graph inputs at least as much as one without it supports the naturalness of inputting graph relations into the attention mechanism. Removing the attention-like graph output (G2CLS Tr) makes BERT initialisation the least helpful (18%). The fact that BERT initialisation helps a Transformer with attention-like graph outputs more than one with CLS outputs supports the naturalness of this output mechanism. These claims are further supported by recent work which shows that the syntactic tree of the sentence is implicitly embedded in the BERT model Hewitt and Manning (2019); Coenen et al. (2019); Goldberg (2019); Kondratyuk and Straka (2019).

All of the above experiments were run with 6 layers of self-attention. We trained an additional model with 7 layers of self-attention (BERT G2G Tr 7-layer) as an indication of whether the model will continue to improve as it is made deeper. This deeper model does perform better, with a 6% LAS relative error reduction, motivating future work on larger Graph2Graph Transformer models.

5.2 Error Analysis

To analyse the errors made by our BERT G2G and BERT DepTr models, we measure their accuracy as a function of dependency length, distance to root and sentence length.333Tables of results and frequencies for the error analysis in Figures 35 are in Appendix B. These results demonstrate that most of the improvement by the G2G model over the DepTr models derives from the hard cases which require a more global view of the sentence.

Figure 3

shows labelled F-scores on dependencies binned by dependency lengths. The length of a dependency relation (

) is measured by the absolute difference of positions and . The composition model is crucial to get good accuracies on long dependencies, and the G2G model results in further improvement. The relative stability of results for the BERT G2G model across dependency length demonstrates the benefit of adding the partial dependency tree to the self-attention model, which provides a global view of the sentence when the model considers long dependencies. It also shows a larger increase in absolute performance on the harder cases.

Figure 3: Comparing F-score vs dependency relation length for BERT models.

Figure 4 shows the labelled F-score for dependencies binned by the distance to the root, computed as the number of dependencies in the path from the dependent to the root node. The BERT G2G model outperforms the other models on nodes which are higher in the dependency tree, again illustrating the benefits of a better global view of the sentence through dependencies input to the self-attention mechanism. Other models recover some of the difference for nodes which are farther from the root, which tend to be leaf nodes and thus require information from a narrower context.

Figure 4: Comparing F-score vs distance to root for BERT models.

Figure 5 shows labelled attachment scores (LAS) for sentences with different lengths. The BERT G2G model consistently outperforms the other models on both short and long sentences, or performs equally to DepTr+CH on very long sentences. The improvement tends to increase as the sentence length increases, again illustrating better performance on the harder cases.

Figure 5: Comparing LAS vs sentence length for BERT models.

6 Related Work

Recent work on parsing have used deep contextualised word representations which are derived from training deep learning models on a large amount of unannotated data with language model objectives such as BERT Devlin et al. (2018), and ELMo Peters et al. (2018). In kulmizev2019deep, they use the deep contextualised word representations of BERT and ELMo as input to their parsing models. They show that contextualised word embeddings give information about global sentence structure, and that transition-based models benefit from this more than graph-based models.

kondratyuk201975 propose a multilingual multi-task architecture to predict universal part-of-speech, morphological features, lemmas, and dependency trees for Universal Dependencies treebanks by applying pre-trained multilingual BERT as the shared encoder of the sequence. They use graph-based biaffine attention parser Dozat and Manning (2016); Dozat et al. (2017) to find the predicted dependency tree.

Ma_2018 propose a dependency parsing model named Stack Pointer Network which first encodes the whole sentence, then finds the dependency relation for the element on the top of the stack with an attention-based mechanism. We exclude their results from the transition-based models in Table 2 since the time complexity of decoding in transition-based models must be linear in the length of a sentence, whereas their decoding algorithm is quadratic in the sentence length.

7 Conclusion

We proposed a graph-to-graph deep learning architecture which uses its self-attention mechanism to input embeddings of graph relations and an attention-like mechanism to predict new graph relations, and we demonstrate the effectiveness of this architecture on the transition-based dependency parsing task. This proposed Graph2Graph Transformer model can accept arbitrary graphs as input, and can predict arbitrary graphs over the same set of vertices. For transition-based dependency parsing, the input graph is the partial dependency tree specified by the previous parser decisions, and the output graph is predicted one dependency at a time with each parser decision.

The proposed model of transition-based dependency parsing is novel in several respects. We first introduce the use of the Transformer architecture to encode the stack, buffer and delete list of the parser state, and to predict dependency relations between words from the resulting token embeddings of those words. We then add mechanisms for encoding the history of parser actions and compositional embeddings of the constructed phrases. Finally, we add the input of dependency relation embeddings into the self-attention mechanism of Transformer, to get our proposed Graph2Graph Transformer model of transition-based dependency parsing. Despite the competitive performance of the extended Transformer model, adding these graph inputs to self-attention results in significant improvement. Similarly, removing the attention-like graph outputs and predicting parser actions from the START token results in a significant decrease in accuracy.

Despite the differences in input representation for our versions of the Transformer model and the version used to train the BERT model, we find that initialising our models with pretrained BERT parameters greatly improves parsing performance. Our full model with BERT initialisation and 7 self-attention layers reached state-of-the-art accuracies (95.64% UAS and 93.81% LAS) on WSJ Penn Treebank Stanford dependencies, and significantly outperforms previous transition-based models. Further analysis shows the benefits of the Graph2Graph Transformer model on long-range dependencies, dependencies higher in the tree and longer sentences, illustrating how inputting structural information to the self-attention mechanism improves decisions which require a larger, more global view of the sentence.

Finally, we believe that our Graph2Graph Transformer model can be easily applied to other NLP tasks which can be formulated as mappings between graphs, such as semantic parsing tasks, which we hope to demonstrate in future work.

8 Acknowledgement

We are grateful to the Swiss NSF, grant CRSII5_180320, for funding this work.


Appendix A Hyper-parameters

Component Specification
Optimizer1 BertAdam
Learning rate 1e-5
Adam Betas(,) (0.9,0.999)
Adam Epsilon 1e-6
Weight Decay 0.01
Max-Grad-Norm 1
Warm-up 0.0052
No. Layers() 6
No. Heads 12
Embedding size 768
Max Position Embedding 512
Classifiers MLP
No. Layers 2
Hidden size 200
Drop-out 0.05
History Model LSTM
No. Layers 2
Hidden Size 100
Comp. Model MLP
No. Layers 2
Hidden size 768
Epochs 12
Replace-with-unk3 5%

Table 1: Hyper-parameters for training DepTr and G2G models

Appendix B Error-Analysis

b.a Dependency Length

Model ROOT 1 2 3 4 5 6 7 7
BERT G2G 97.19 94.99 94.56 93.05 90.32 86.80 85.61 85.10 86.95
BERT DepTr+CH 95.86 94.78 93.87 91.44 87.56 83.21 81.87 80.83 83.15
BERT DepTr+H 91.18 94.13 91.58 88.29 82.38 75.98 74.25 71.94 74.84
BERT DepTr 87.38 93.33 90.15 86.50 80.13 74.48 71.54 68.10 71.20
Table 2: F-Score vs dependency relation length
Model ROOT 1 2 3 4 5 6 7 7
BERT G2G 2416 24217 11395 5930 3138 1862 1288 946 5492
BERT DepTr+CH 2416 24202 11356 5889 3140 1855 1264 942 5620.
BERT DepTr+H 2416 24234 11369 5871 3089 1857 1283 925 5640
BERT DepTr 2416 24192 11340 5851 3112 1841 1302 960 5670
Total Gold 2416 24152 11352 5922 3153 1873 1284 946 5586
Table 3: Size of each bin based on dependency length

b.b Distance to Root

Model 1 2 3 4 5 6 7 7
BERT G2G 97.19 93.97 92.98 91.94 91.82 92.84 93.27 93.65
BERT DepTr+CH 95.86 91.66 91.26 90.96 91.68 92.36 91.92 93.74
BERT DepTr+H 91.18 87.56 88.28 88.91 89.34 90.11 90.35 90.73
BERT DepTr 87.38 85.24 87.57 87.31 87.57 88.25 89.42 89.33
Table 4: F-Score vs distance to root
Model 1 2 3 4 5 6 7 7
BERT G2G 2416 12885 12352 9991 7196 4940 3074 3830
BERT DepTr+CH 2416 12889 12423 10114 7163 4928. 2996 3755
BERT DepTr+H 2416 12787 12469 10162 7108 4932 3043 3767
BERT DepTr 2416 12704 12631 10247 7103 4894 3020 3669
Total Gold 2416 12941 12378 10081 7238 4893 3022 3715

Table 5: Size of each bin based on distance to root

b.c Sentence Length

Model 1-9 10-19 20-29 30-39 40-49 50
BERT G2G 95.07 94.25 93.14 93.07 92.15 89.43
BERT DepTr+CH 94.48 93.19 91.94 91.38 90.84 89.39
BERT DepTr+H 93.38 90.95 88.96 88.29 88.30 82.80
BERT DepTr 92.42 89.06 87.59 86.41 86.17 82.43
Table 6: LAS vs. sentence length
1-9 10-19 20-29 30-39 40-49 50
Total 1359 10873 19314 15719 7006 2413
Table 7: Size of each bin based on sentence length