1 Introduction
In recent years, there has been a large amount of research on applying self-attention models to many NLP tasks. Transformer
Vaswani et al. (2017)is the most common architecture, which can capture long-range dependencies by using a self-attention mechanism over a set of vectors. To encode the sequential structure of sentences, typically absolute position embeddings are input to each vector in the set, but recently a mechanism has been proposed for inputting relative position
Shaw et al. (2018). For each pair of vectors, an embedding for their relative position is input to the self-attention function. This mechanism can be generalised to input arbitrary graphs of relations. We propose a version of the Transformer architecture which combines this mechanism for conditioning on graphs with an attention-like mechanism for predicting graphs and demonstrate its effectiveness on syntactic dependency parsing. We call this architecture Graph2Graph Transformer.Our proposed Graph2Graph Transformer parser is a transition-based dependency parser. At each step, the model predicts the next parsing decision by conditioning on the sequence of previous parsing decisions and the partial parse structure which those decisions specify. We input the previously specified dependency relations into our Transformer model via the self-attention mechanism. We then predict the next dependency relation by conditioning on the two vectors for the words involved in the relation, analogously to the attention mechanism. To better model the sequence of parser decisions, we also combine the Transformer model with an LSTM model of the parse history and a composition model of the partial parse. Even without the proposed graph inputs, this novel Transformer model of transition-based dependency parsing achieves good performance, but we still get substantial improvements by adding the graph inputs.
We also demonstrate that, despite the modified input mechanisms, this Graph2Graph Transformer architecture can be effectively initialised with standard pre-trained Transformer models. Initialising the parser with pre-trained BERT Devlin et al. (2018) parameters leads to large improvements for all models, and an even larger increase in performance when we add graph inputs. The resulting model significantly improves over the state of the art in transition-based dependency parsing.
This success demonstrates the effectiveness of Graph2Graph Transformers for conditioning on and predicting graph edges. This architecture can be easily applied to other NLP tasks that have any graph as the input and need to predict a graph over the same set of nodes as output.
Our contributions are:
-
We propose a Graph2Graph Transformer architecture for conditioning on and predicting structures.
-
We propose a novel Transformer model of transition-based dependency parsing.
-
We successfully integrate the proposed model with a pre-trained BERT initialisation, achieving state-of-the-art results for transition-based dependency parsing.
2 Transition-based Dependency Parsing
A dependency parser analyses the grammatical structure of a sentence, establishing relationships between “head” words and their syntactic “dependents”. Dependency parses are used in a wide range of natural language processing (NLP) applications, such as machine translation
Currey and Heafield (2019); Vashishth et al. (2018); Ding and Tao (2019); Bastings et al. (2017), information extraction Nguyen et al. (2009); Angeli et al. (2015); Peng et al. (2017)Tai et al. (2015) and low-resource language processing McDonald et al. (2013); Ma and Xia (2014).Dependency parsing has been dominated by two approaches, transition-based models and graph-based models McDonald and Nivre (2007, 2011)
. As with structured prediction in general, the challenge is to model the constraints and correlations between different decisions about an arbitrarily large structure without suffering from the exponential size of the structured output space. Graph-based models assume that if we can model correlations between the input sequence and individual decisions about the output structure really well, and if we can model exactly the discrete constraints between these decisions, then we can assume that these output decisions are otherwise statistically independent. This assumption allows exact dynamic programming solutions to the decoding problem (finding the best structure), given estimates for the individual decisions.
In contrast, transition-based models Nivre (2008) allow arbitrary correlations between output decisions to be modelled by making the decisions one at a time, each conditioned on all the previous decisions. Instead of dynamic programming, beam search is usually used to search the space of output structures, often even with a one-best search. This makes transition-based parsers, in general, faster than graph-based parsers, but they tend to suffer from both search errors and the difficulty of modelling the correlations between decisions.
Because we are investigating an architecture for both conditioning on and predicting structures, in this work we focus on transition-based models. At each step of the parse, the model takes as input both the input sentence and the sequence of previous output decisions, and then predicts the next decision about the output structure. While it is possible to model both these inputs as sequences and apply a sequence-to-sequence model, this does not work as well as explicitly modelling the partial structure specified by the previous decisions. We follow previous work Yamada and Matsumoto (2003); Nivre (2003, 2004) in representing the input sentence and the previous parse in the state of an incremental parser, including a buffer of words waiting to be processed, and a stack of words which are partially processed. For our proposed model, we add a third part to the parser state which is a list of deleted words which have finished being processed. In addition, parsing models usually assume that the parser state includes an explicit representation of the sequence of previous decisions, called parser actions, and of the graph of dependency relations which these decisions specify.
The challenge in transition-based dependency parsing is finding an encoding of this parser state which can be used in the transition classifier to predict the next parser action. This challenge has been addressed with several alternative approaches, such as feature engineering
Zhang and Nivre (2011); Ballesteros and Nivre (2016); Chen et al. (2014); Ballesteros and Bohnet (2014), and inducing representations with neural networks Henderson (2003); Titov and Henderson (2010); Dyer et al. (2015); Weiss et al. (2015); Andor et al. (2016).Our transition-based parser uses ’arc-standard’ parsing sequences Nivre (2004), which makes parsing decisions in bottom-up order. The main data structures for representing the state of an arc-standard parser are a buffer of words and a stack of partially constructed syntactic sub-trees. For input to Transformer, we represent the parser state as a partition of the words of the sentence into a buffer , a stack and a delete list , plus a directed graph of labelled dependency relations between these words. The graph includes all the dependency relations which have been specified by the previous parser decisions, so it represents the partial parse structure constructed so far. The deleted words are those which have been removed from the stack, after having both their children and parents specified in . We will describe how these components are used in section 3.2.
At the initialisation step, the stack contains just the ROOT symbol, the buffer includes the words of the input sentence in order, and the delete list is empty. Parsing is finished when the buffer becomes empty, the stack contains only the ROOT symbol, and the delete list contains all tokens. At each step, a parser action is chosen and the parser state is modified. For an arc-standard parser, the parser actions are as follows, where and are the top two elements on the stack, and is the front element of the buffer:
-
LEFT-ARC(): Add an arc with label to , remove from the stack, and insert in the delete list. Precondition: stack length must be greater than 2.
-
RIGHT-ARC(): Add an arc with label to , remove from the stack, and insert to the delete list. Precondition: stack length must be greater than 2.
-
SHIFT: moves from the buffer to the top of the stack. Precondition: buffer length must be greater than 1.
3 The Parsing Model

The proposed parsing model learns to embed the sequence of previous parser actions and the resulting parser state, and then to predict the next parser action from that embedding. It is illustrated in Figure 1.
Because our proposed deep learning architecture is based on Transformer, we start by proposing a novel Transformer model of transition-based dependency parsing. This model incorporates components which have proved successful in previous work, but uses a Transformer to compute context-dependent representations of words, which we will refer to as token embeddings.
To support a general mechanism for graph output, the token embeddings of only the top two words on the stack are used by the transition classifier to predict the next parser action. It is these two words which may have a dependency relation specified by the parser action. By making the graph prediction output a function of the embeddings of the two words involved in the relation, this graph output method is similar to an attention mechanism, where attention weights are also a function of the embeddings of the two tokens involved in the attention relationship. We hypothesise that this synergy between the representations expected by the attention mechanism and those expected by the output mechanism will improve performance.
To support a general mechanism for graph input, we add input embeddings specifying the graph of previously chosen dependency relations to the self-attention mechanism. For every pair of words, the attention functions receive embeddings which specify the dependency relation, if any, between these two words. As with the graph outputs, we hypothesise that inputting graph relation embeddings into the mechanism for finding attention relationships will improve performance.
Finally, we show that these modifications to the input of the Transformer architecture do not prevent the effective use of initialisation with BERT, which has been pre-trained without them. In the rest of this section we describe the architecture and parsing model in more detail.
3.1 Input Embeddings
The Transformer architecture takes a sequence of input tokens, converts them into a sequence of input embedding vectors, and then produces a new sequence of context-dependent token embeddings. For our model, the sequence of input tokens represents the current parser state, as illustrated in Figure 1.
The input tokens include the words of the sentence with their associated part-of-speech tags (PoS) . Each of these words can appear in the stack or buffer of the parser state, or otherwise are included in a list of words which have been deleted from the stack because all their dependency relations have been specified. In addition, there is the ROOT symbol, for the root of the dependency tree, which is always on the bottom of the stack. Inspired by the input representation of BERT Devlin et al. (2018), we also use two special symbols, START and SEP, which indicate the different parts of the parser state.
The sequence of input tokens is illustrated at the top of Figure 2. It starts with the START symbol, then includes the tokens on the stack from bottom to top. Then it has a SEP symbol, followed by the tokens on the buffer from front to back, so that they are in the same order in which they appeared in the sentence. If the model does not use graph inputs to the attention mechanism, then this is a sufficient representation of the parser state. Otherwise, the input sequence includes another SEP symbol followed by the tokens in the delete list, ordered according to their order in the sentence. Adding the deleted words allows the input of the dependency relations which involves those words.
As part of the input sequence, we also include a specification of the dependency relation, if any, specified in the immediately preceding parser action, which always has the top of the stack as its head. This will be discussed when we discuss the composition model in Section 3.1.2. In addition, we include input information specifying the label of the parent dependency relation for each token, which is only known for the words in the delete list.
Given this input sequence, the model computes a sequence of vectors which are input to the Transformer network. As depicted in the lower section of Figure
2, this vector is the sum of several embeddings, which are defined in the remainder of this subsection.3.1.1 Input Token Embeddings
The words and POS tags of the sentence each have associated embedding vectors
(1) |
where is the embedding mapping from the set of training words () plus the set of PoS tags () to the embedding space ( where is the dimension of embedding space). For the word embeddings, we use pre-trained word vectors from the BERT model Devlin et al. (2018). The PoS embeddings are trained parameters. These word and PoS embeddings are summed to get the embedding of each token:
(2) |
Where is the token embedding of word in the sentence.
3.1.2 Composition Model
Previous work has shown that recursive neural networks are capable of inducing a representation for complex phrases by recursively embedding sub-phrases Socher et al. (2011, 2014, 2013); Hermann and Blunsom (2013). dyer-etal-2015-transition showed that this is an effective technique for embedding the partial parse subtrees specified by the parse history in transition-based dependency parsing. Since a word in a dependency tree can have a variable number of dependents, they combined the dependency relations incrementally as they are specified by the parser. They used a non-linear mapping function () to compute the new embeddings.
We extend this idea by using a feed-forward neural network with
as the activation function and skip connections. For every token in position
on the stack, after making decision , the composition model computes a vector which is added to the input embedding for that token:(3) |
where the function is a one-layer feed forward neural network, and represents any new dependency relation with head specified by the decision at step . In arc-standard parsing, the only word which might have received a new dependent by the previous decision is the word on the top of the stack, . This gives us the following definition of :
(4) |
where and are the embeddings of the top two elements of the stack at time step , and is the initial token embedding of the word on the front of the buffer at time . is the label embedding of the specified relation, including its direction. For all words on the stack which have not received a new dependent, the composition is computed anyway, but with a [NULL] dependent and [L-NULL] label.111Preliminary experiments indicated that not updating the composition embedding for these cases resulted in worse performance.
At , for all tokens , is set to the initial token embedding . The model then computes Equation 3 iteratively at each step for each token on the stack at that step. For all tokens not on the stack, their vector is left unchanged from the previous step’s vector , regardless of what position that token occupied in the previous parser state. This means that all tokens on the buffer retain their initial vector of , and all tokens in the delete list retain the composition vector they had when they were popped from the stack.
There is a skip connection in Equation 3
to address the vanishing gradient problem. Also, preliminary experiments showed that without this skip connection to bias the composition model towards the initial token embeddings
, integrating pre-trained BERT Devlin et al. (2018) parameters into the model (discussed in Section 3.5) did not work.3.1.3 Parser State Structure Embeddings
To distinguish the different positions and roles of words in the parser state, we add position embeddings and segment embeddings to the above token embeddings. These embeddings are not included in the input to the composition model (which uses the ), but they are included in the input to the Transformer’s self-attention layers.
Position Embedding: We initialise with the pre-trained position embedding of BERT Devlin et al. (2018), because the buffer and delete parts of the parser state have the same word order as the input sentence. The position embeddings have the same dimension () as the output of the composition model, so we can sum them together. We further fine-tune the positional embeddings parameters during training.
Segment Embedding: Since the input sequence contains stack, buffer and deleted parts (if we have graph input), the model should make a distinction between tokens which occur in these different parts. To make this distinction, there are embeddings for each of these segments of the parser state, namely , and , respectively. The dimension of these segment embeddings is the same as positional embeddings.
3.1.4 Total Input Embeddings
Finally, we sum the outputs of the composition model, the segment embeddings and the positional embeddings, and consider them as the input embeddings at step for the self-attention layers of the Transformer model:
(5) |
3.2 Graph2Graph Transformer
Our proposed model for mapping the input sequence of embeddings described in Section 3.1 to a vector which can be used by the transition classifier described below in Section 3.4 is a form of Transformer Vaswani et al. (2017). Transformers are multi-layer self-attention-based models for encoding and generating sequences. They have been very successful in a wide variety of NLP tasks. Here we propose a version of Transformer which is designed for both conditioning on graphs and predicting graphs, which we call Graph2Graph Transformer, and show how it can be applied to transition-based dependency parsing.
Inspired by the relative position embeddings of shaw-etal-2018-self, we use the attention mechanism of Transformer to input arbitrary binary graph relations. By inputting the embedding for a relation into the attention computations for the related words, the model can more easily learn to pass information between graph-local words, which gives the model an appropriate linguistic bias, without imposing hard constraints.
Given that the attention function is being used to input graph relations, it is natural to assume that graph relations can also be predicted with an attention-like function. We do not go so far as to restrict the form of the prediction function, but we do restrict the vectors used to predict graph relations to only the two for the words involved in the relation.
3.2.1 Baseline Transformer
Transformer Vaswani et al. (2017) is a sequence-to-sequence model, of which we only use the encoder component. A Transformer encoder computes an output embedding for each token in the input sequence through stacked layers of a self-attention mechanism. Each layer contains two sub-layers: a multi-head self-attention layer, and a position-wise feed-forward layer. In the self-attention layer, the value vectors output by multiple attention heads are concatenated together and projected to build the output vector of this layer.
Each attention head has its own parameters and computes its own value vectors. Given () as the input sequence, the attention mechanism finds a sequence of value vectors (). Each output element
is a linear transformation of input elements:
(6) |
where is called the value matrix and learned during training, and is the attention head size. is the input element, and is the output element at position . is the attention weight, which is calculated by a Softmax function:
(7) |
where is calculated by:
(8) |
where , are query and key matrices respectively and contain learned parameters.
3.2.2 Graph Inputs
Inspired by shaw-etal-2018-self, we extend the architecture of the Transformer to accept the dependency tree as an additional input, using the same formulas as they use for inputting relative position embeddings. In G2G Transformer, we encode the information of a dependency pair () by modifying Equation 8:
(9) |
where is a one-hot vector which specifies the type of dependency relation between and (see Table 1). is a matrix of learned parameters. We also modify Equation 6 to transmit information of the partially constructed graph to the output of the attention layer:
(10) |
where is a parameter matrix.
Relation | assigned dimension |
None | 0 |
(head dependent) | 1 |
(dependent head) | 2 |
As shown in Table 1, we have chosen to input unlabelled dependency relations, explicitly representing only the direction of the dependency. This choice was made mostly to simplify our extension of Transformer, as well as to limit the computational cost of this extension. In order to input the labels, we add dependency label embeddings to the token embeddings of the dependent word, as discussed in Section 3.1. Since all words which have their heads specified are in the deleted part of the input sequence, this label embedding is added to all and only the tokens in the delete list.
3.2.3 Graph Outputs
The output of our Graph2Graph Transformer model is the concatenation of the final layer representations of the top two elements on the stack:
(11) |
where and are the top two elements on the stack. These two vectors are input directly to the transition classifier, described below in Section 3.4.
3.3 History Model
In addition to the composition model, we input a history embedding of the sequence of previous actions at the current parser state. We use the same LSTM history model as dyer-etal-2015-transition (referred to there as a StackLSTM). The sequence input to the LSTM is simply the sequence of parser action types, including LEFT-ARC, RIGHT-ARC and SHIFT, but not including the dependency labels or the identities of the words involved.
The output of the history model after step is the final output vector of the LSTM after inputting the parser action at . This history vector is passed directly to the transition classifier.
3.4 Transition Classifier
The Transformer’s output vector, described in Section 3.2.3, is concatenated with the history vector, described in Section 3.3, to form the input to the transition classifier. The transition classifier takes this representation of the parser state and parse history and predicts the next parser action, choosing from the legal next actions as defined in Section 2.
At each step , the model first chooses the type of action, namely whether a dependency should be specified and in what direction, which we call the Exist classifier. The Exist classifier outputs scores for three alternatives:
-
No Relation: Do SHIFT
-
Right Relation: Do RIGHT-ARC
-
Left Relation: Do LEFT-ARC
In the latter two cases, a dependency relation is specified between the top two tokens on the stack, in the specified direction. For these cases a second classifier, the Relation classifier, predicts the label of the relation, conditioned on the direction. Given the parser state vector and the parse history vector , these classifiers compute:
(12) |
where and (
is the number of dependency labels times the number of directions). Both classifiers are multi-layer perceptron classifiers with one hidden layer and
activation function.3.5 Pre-Training with BERT
BERT Devlin et al. (2018) provides deep contextual representations based on a series of Transformers trained on a huge amount of un-annotated data with a language-modelling objective. BERT is trained by the Cloze task Taylor (1953) which enables the model to encode information from both directions. In addition, BERT is trained on the next sentence classification objective. BERT employs a subword vocabulary with WordPiece Wu et al. (2016) which splits a word into subwords.
Initialising a Transformer model with the pre-trained parameters of BERT, and then fine-tuning on the target task, has demonstrated large improvements in many tasks. But unlike the previous work we are aware of which has used BERT pre-training, our version of Transformer has novel inputs which were not present when BERT was trained. These novel inputs are the graph inputs to the attention mechanism, and the composition embeddings. Also, the input sequence has a novel structure, which is only partially similar to the input sentences which BERT was trained on. So it is not clear that BERT pre-training will even work with this novel architecture.
To evaluate whether BERT pre-training works for our proposed architecture, we initialise the weights of the Graph2Graph Transformer model with the first layers of BERT, where is the number of layers in our model.
4 Implementation Details
4.1 Dataset
We train our models on a dependency version of the English Wall Street Journal (WSJ) corpus, which is a part of the Penn Treebank Marcus et al. (1993). We follow the standard split and use sections 2-21 for training, section 22 for evaluation, and section 23 for testing. We also add section 24 to our development set to mitigate over-fitting on section 22. We convert constituency trees in the corpus to Stanford dependencies De Marneffe et al. (2006) applying version 3.3.0 of the converter. For POS tags, we use Stanford POS tagger Toutanova et al. (2003), which has an accuracy of 97.44%. As in previous work, we exclude punctuation from evaluation.
4.2 Baselines
We compare our models with several baselines and reduced versions of the model, based on unlabelled/labelled attachment scores (UAS/LAS). As strong baselines from previous work, we compare to previous transition-based models Dyer et al. (2015); Weiss et al. (2015); Andor et al. (2016); Ballesteros et al. (2016); Chen et al. (2014).
To demonstrate the usefulness of each part of the proposed G2G Transformer model, we compare the full model (G2G Tr) with four different reduced versions of the model. We define the Dependency Transformer (DepTr) model as the same model described in Section 3 but without the composition and history models, and without graph inputs to attention (i.e. attention Equation 8). This model does, however, include the same graph output mechanism, conditioning on the two tokens on the top of the stack. Then we add the history model (DepTr+H), and both composition and history models (DepTr+CH), to the Dependency Transformer baseline. Finally, we also consider a version of the full model with the graph output mechanism removed (G2CLS Tr), where we predict the next parser action from the START symbol’s token embedding (referred to as CLS in BERT). All five of these models are evaluated both with and without initialisation with the first layers of a pre-trained BERT model.
4.3 Hyper-parameters and details of implementation
All hyper-parameters are given in Appendix A. The same hyper-parameter optimisation strategy was used for all models. For all models we use 6 self-attention layers, except where specified otherwise.
All our models use one-best (deterministic) decoding, meaning that at each step only the highest scoring parser action is considered for continuation. This was done for simplicity. Beam search could also be used with these models.
We use pre-trained base “cased” BERT with 12 layers of attention and 12 attention heads.222https://github.com/google-research/bert We extract the weights of the first layers of BERT (where is the number of attention layers in our models) and use them to initialise our BERT models. For tokenisation, we average the embeddings of subword tokens which are produced by the native BERT tokeniser, so that the model has the desired one token per word.
In the graph input to the attention function, we don’t train the row of the graph embedding matrices ( and ) for the case of ”No Relation” between the two tokens, leaving them frozen at their random initialisation, for reasons of training efficiency.
5 Results and Discussion
5.1 UAS/LAS Results
In Table 2, we compare all variations of our model and previous transition-based models. Compared to the previous state-of-the-art in transition-based dependency parsing, our complete model (BERT G2G Tr) performs significantly better, at 95.30% UAS and 93.44% LAS. This performance continues to improve, to 95.64% UAS and 93.81% LAS, when we increase the depth of the model from 6 self-attention layers to 7, a full percentage point improvement over previous results.
Comparing the different versions of the proposed parser, the same pattern appears both with and without BERT initialisation. Simply applying Transformer to encode the parser state sequence (DepTr) does not perform well. Adding an embedding of the history sequence of parser actions (DepTr+H) results in a big improvement (17%/13% LAS relative error reduction without/with BERT). Adding explicit modelling of the composition of sub-phrases (DepTr+CH) results in a further big improvement (20%/27% LAS relative error reduction without/with BERT), which, with BERT, reaches accuracies competitive with the state-of-the-art.
Even from this very strong starting point, adding graph inputs to the attention mechanism result in a further large improvement of 1.6%/1.2% absolute and 15%/15% relative LAS error reduction without/with BERT. This improvement makes our BERT G2G Tr model accuracy significantly better than the previous state-of-the-art in transition-based dependency parsing.
All these models use the same graph output mechanism, conditioning on the embeddings of the two tokens on the top of the stack. We motivated this choice because it is similar to the way the attention mechanism finds relationships between tokens, and it is these two tokens whose relationship we need to decide. But with transition-based parsing, the next parser action could equally well be predicted from the START token embedding, which in BERT (there called CLS) is used to classify the input as a whole. Using our proposed graph output mechanism (G2G Tr) instead of predicting from the START token embedding (G2CLS Tr), there is again an improvement, particularly with BERT (4%/15% relative LAS error reduction without/with BERT).
Although in general it is not surprising that models pretrained with BERT outperform equivalent models which use no resources other than the parsed training corpus, in this case it is surprising because the input to the Transformer is different from that of BERT, as discussed in Section 3.5. In particular, here we add inputs to the Transformer from the composition model and graph inputs to the attention mechanism, neither of which BERT was trained with. In fact, the LAS relative error reduction from adding BERT initialisation is the highest in the full model (27%), followed closely by the DepTr+CH model with composition inputs (26%). The model which is closest to BERT (DepTr), has a lower LAS relative error reduction from adding BERT initialisation (23%), and adding just the history model outside of the Transformer (DepTr+H) is even lower (19%). This surprising result that BERT initialisation helps a Transformer with graph inputs at least as much as one without it supports the naturalness of inputting graph relations into the attention mechanism. Removing the attention-like graph output (G2CLS Tr) makes BERT initialisation the least helpful (18%). The fact that BERT initialisation helps a Transformer with attention-like graph outputs more than one with CLS outputs supports the naturalness of this output mechanism. These claims are further supported by recent work which shows that the syntactic tree of the sentence is implicitly embedded in the BERT model Hewitt and Manning (2019); Coenen et al. (2019); Goldberg (2019); Kondratyuk and Straka (2019).
All of the above experiments were run with 6 layers of self-attention. We trained an additional model with 7 layers of self-attention (BERT G2G Tr 7-layer) as an indication of whether the model will continue to improve as it is made deeper. This deeper model does perform better, with a 6% LAS relative error reduction, motivating future work on larger Graph2Graph Transformer models.
5.2 Error Analysis
To analyse the errors made by our BERT G2G and BERT DepTr models, we measure their accuracy as a function of dependency length, distance to root and sentence length.333Tables of results and frequencies for the error analysis in Figures 3–5 are in Appendix B. These results demonstrate that most of the improvement by the G2G model over the DepTr models derives from the hard cases which require a more global view of the sentence.
Figure 3
shows labelled F-scores on dependencies binned by dependency lengths. The length of a dependency relation (
) is measured by the absolute difference of positions and . The composition model is crucial to get good accuracies on long dependencies, and the G2G model results in further improvement. The relative stability of results for the BERT G2G model across dependency length demonstrates the benefit of adding the partial dependency tree to the self-attention model, which provides a global view of the sentence when the model considers long dependencies. It also shows a larger increase in absolute performance on the harder cases.
Figure 4 shows the labelled F-score for dependencies binned by the distance to the root, computed as the number of dependencies in the path from the dependent to the root node. The BERT G2G model outperforms the other models on nodes which are higher in the dependency tree, again illustrating the benefits of a better global view of the sentence through dependencies input to the self-attention mechanism. Other models recover some of the difference for nodes which are farther from the root, which tend to be leaf nodes and thus require information from a narrower context.

Figure 5 shows labelled attachment scores (LAS) for sentences with different lengths. The BERT G2G model consistently outperforms the other models on both short and long sentences, or performs equally to DepTr+CH on very long sentences. The improvement tends to increase as the sentence length increases, again illustrating better performance on the harder cases.

6 Related Work
Recent work on parsing have used deep contextualised word representations which are derived from training deep learning models on a large amount of unannotated data with language model objectives such as BERT Devlin et al. (2018), and ELMo Peters et al. (2018). In kulmizev2019deep, they use the deep contextualised word representations of BERT and ELMo as input to their parsing models. They show that contextualised word embeddings give information about global sentence structure, and that transition-based models benefit from this more than graph-based models.
kondratyuk201975 propose a multilingual multi-task architecture to predict universal part-of-speech, morphological features, lemmas, and dependency trees for Universal Dependencies treebanks by applying pre-trained multilingual BERT as the shared encoder of the sequence. They use graph-based biaffine attention parser Dozat and Manning (2016); Dozat et al. (2017) to find the predicted dependency tree.
Ma_2018 propose a dependency parsing model named Stack Pointer Network which first encodes the whole sentence, then finds the dependency relation for the element on the top of the stack with an attention-based mechanism. We exclude their results from the transition-based models in Table 2 since the time complexity of decoding in transition-based models must be linear in the length of a sentence, whereas their decoding algorithm is quadratic in the sentence length.
7 Conclusion
We proposed a graph-to-graph deep learning architecture which uses its self-attention mechanism to input embeddings of graph relations and an attention-like mechanism to predict new graph relations, and we demonstrate the effectiveness of this architecture on the transition-based dependency parsing task. This proposed Graph2Graph Transformer model can accept arbitrary graphs as input, and can predict arbitrary graphs over the same set of vertices. For transition-based dependency parsing, the input graph is the partial dependency tree specified by the previous parser decisions, and the output graph is predicted one dependency at a time with each parser decision.
The proposed model of transition-based dependency parsing is novel in several respects. We first introduce the use of the Transformer architecture to encode the stack, buffer and delete list of the parser state, and to predict dependency relations between words from the resulting token embeddings of those words. We then add mechanisms for encoding the history of parser actions and compositional embeddings of the constructed phrases. Finally, we add the input of dependency relation embeddings into the self-attention mechanism of Transformer, to get our proposed Graph2Graph Transformer model of transition-based dependency parsing. Despite the competitive performance of the extended Transformer model, adding these graph inputs to self-attention results in significant improvement. Similarly, removing the attention-like graph outputs and predicting parser actions from the START token results in a significant decrease in accuracy.
Despite the differences in input representation for our versions of the Transformer model and the version used to train the BERT model, we find that initialising our models with pretrained BERT parameters greatly improves parsing performance. Our full model with BERT initialisation and 7 self-attention layers reached state-of-the-art accuracies (95.64% UAS and 93.81% LAS) on WSJ Penn Treebank Stanford dependencies, and significantly outperforms previous transition-based models. Further analysis shows the benefits of the Graph2Graph Transformer model on long-range dependencies, dependencies higher in the tree and longer sentences, illustrating how inputting structural information to the self-attention mechanism improves decisions which require a larger, more global view of the sentence.
Finally, we believe that our Graph2Graph Transformer model can be easily applied to other NLP tasks which can be formulated as mappings between graphs, such as semantic parsing tasks, which we hope to demonstrate in future work.
8 Acknowledgement
We are grateful to the Swiss NSF, grant CRSII5_180320, for funding this work.
References
- Andor et al. (2016) Daniel Andor, Chris Alberti, David Weiss, Aliaksei Severyn, Alessandro Presta, Kuzman Ganchev, Slav Petrov, and Michael Collins. 2016. Globally normalized transition-based neural networks. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2442–2452, Berlin, Germany. Association for Computational Linguistics.
- Angeli et al. (2015) Gabor Angeli, Melvin Jose Johnson Premkumar, and Christopher D Manning. 2015. Leveraging linguistic structure for open domain information extraction. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 344–354.
-
Ballesteros and Bohnet (2014)
Miguel Ballesteros and Bernd Bohnet. 2014.
Automatic feature selection for agenda-based dependency parsing.
In Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers, pages 794–805. - Ballesteros et al. (2016) Miguel Ballesteros, Yoav Goldberg, Chris Dyer, and Noah A Smith. 2016. Training with exploration improves a greedy stack-lstm parser. arXiv preprint arXiv:1603.03793.
- Ballesteros and Nivre (2016) Miguel Ballesteros and Joakim Nivre. 2016. Maltoptimizer: Fast and effective parser optimization. Natural Language Engineering, 22(2):187–213.
- Bastings et al. (2017) Joost Bastings, Ivan Titov, Wilker Aziz, Diego Marcheggiani, and Khalil Sima’an. 2017. Graph convolutional encoders for syntax-aware neural machine translation. arXiv preprint arXiv:1704.04675.
- Chen et al. (2014) Wenliang Chen, Yue Zhang, and Min Zhang. 2014. Feature embedding for dependency parsing. In Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers, pages 816–826.
- Coenen et al. (2019) Andy Coenen, Emily Reif, Ann Yuan, Been Kim, Adam Pearce, Fernanda Viégas, and Martin Wattenberg. 2019. Visualizing and measuring the geometry of bert. arXiv preprint arXiv:1906.02715.
-
Currey and Heafield (2019)
Anna Currey and Kenneth Heafield. 2019.
Incorporating source syntax into transformer-based neural machine translation.
In Proceedings of the Fourth Conference on Machine Translation. Association for Computational Linguistics. - De Marneffe et al. (2006) Marie-Catherine De Marneffe, Bill MacCartney, Christopher D Manning, et al. 2006. Generating typed dependency parses from phrase structure parses. In Lrec, volume 6, pages 449–454.
- Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: pre-training of deep bidirectional transformers for language understanding. CoRR, abs/1810.04805.
- Ding and Tao (2019) Liang Ding and Dacheng Tao. 2019. Recurrent graph syntax encoder for neural machine translation.
- Dozat and Manning (2016) Timothy Dozat and Christopher D Manning. 2016. Deep biaffine attention for neural dependency parsing. arXiv preprint arXiv:1611.01734.
- Dozat et al. (2017) Timothy Dozat, Peng Qi, and Christopher D. Manning. 2017. Stanford’s graph-based neural dependency parser at the CoNLL 2017 shared task. In Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pages 20–30, Vancouver, Canada. Association for Computational Linguistics.
- Dyer et al. (2015) Chris Dyer, Miguel Ballesteros, Wang Ling, Austin Matthews, and Noah A. Smith. 2015. Transition-based dependency parsing with stack long short-term memory. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 334–343, Beijing, China. Association for Computational Linguistics.
- Goldberg (2019) Yoav Goldberg. 2019. Assessing bert’s syntactic abilities.
- Henderson (2003) James Henderson. 2003. Inducing history representations for broad coverage statistical parsing. In Proceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, pages 103–110.
- Hermann and Blunsom (2013) Karl Moritz Hermann and Phil Blunsom. 2013. The role of syntax in vector space models of compositional semantics. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 894–904, Sofia, Bulgaria. Association for Computational Linguistics.
- Hewitt and Manning (2019) John Hewitt and Christopher D Manning. 2019. A structural probe for finding syntax in word representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4129–4138.
- Kondratyuk and Straka (2019) Dan Kondratyuk and Milan Straka. 2019. 75 languages, 1 model: Parsing universal dependencies universally.
- Kulmizev et al. (2019) Artur Kulmizev, Miryam de Lhoneux, Johannes Gontrum, Elena Fano, and Joakim Nivre. 2019. Deep contextualized word embeddings in transition-based and graph-based dependency parsing – a tale of two parsers revisited.
- Ma et al. (2018) Xuezhe Ma, Zecong Hu, Jingzhou Liu, Nanyun Peng, Graham Neubig, and Eduard Hovy. 2018. Stack-pointer networks for dependency parsing. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).
- Ma and Xia (2014) Xuezhe Ma and Fei Xia. 2014. Unsupervised dependency parsing with transferring distribution via parallel guidance and entropy regularization. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1337–1348.
- Marcus et al. (1993) Mitchell P. Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. 1993. Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics, 19(2):313–330.
- McDonald and Nivre (2007) Ryan McDonald and Joakim Nivre. 2007. Characterizing the errors of data-driven dependency parsing models. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL).
- McDonald and Nivre (2011) Ryan McDonald and Joakim Nivre. 2011. Analyzing and integrating dependency parsers. Computational Linguistics, 37(1):197–230.
- McDonald et al. (2013) Ryan McDonald, Joakim Nivre, Yvonne Quirmbach-Brundage, Yoav Goldberg, Dipanjan Das, Kuzman Ganchev, Keith Hall, Slav Petrov, Hao Zhang, Oscar Täckström, Claudia Bedini, Núria Bertomeu Castelló, and Jungmee Lee. 2013. Universal dependency annotation for multilingual parsing. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 92–97, Sofia, Bulgaria. Association for Computational Linguistics.
- Nguyen et al. (2009) Truc-Vien T Nguyen, Alessandro Moschitti, and Giuseppe Riccardi. 2009. Convolution kernels on constituent, dependency and sequential structures for relation extraction. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 3-Volume 3, pages 1378–1387. Association for Computational Linguistics.
- Nivre (2003) Joakim Nivre. 2003. An efficient algorithm for projective dependency parsing. In Proceedings of the Eighth International Conference on Parsing Technologies, pages 149–160, Nancy, France.
- Nivre (2004) Joakim Nivre. 2004. Incrementality in deterministic dependency parsing. In Proceedings of the Workshop on Incremental Parsing: Bringing Engineering and Cognition Together, pages 50–57, Barcelona, Spain. Association for Computational Linguistics.
- Nivre (2008) Joakim Nivre. 2008. Algorithms for deterministic incremental dependency parsing. Comput. Linguist., 34(4):513–553.
- Peng et al. (2017) Nanyun Peng, Hoifung Poon, Chris Quirk, Kristina Toutanova, and Wen-tau Yih. 2017. Cross-sentence n-ary relation extraction with graph lstms. Transactions of the Association for Computational Linguistics, 5:101–115.
- Peters et al. (2018) Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers).
- Shaw et al. (2018) Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. 2018. Self-attention with relative position representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 464–468, New Orleans, Louisiana. Association for Computational Linguistics.
-
Socher et al. (2011)
Richard Socher, Eric H Huang, Jeffrey Pennin, Christopher D Manning, and
Andrew Y Ng. 2011.
Dynamic pooling and unfolding recursive autoencoders for paraphrase detection.
In Advances in neural information processing systems, pages 801–809. - Socher et al. (2014) Richard Socher, Andrej Karpathy, Quoc V Le, Christopher D Manning, and Andrew Y Ng. 2014. Grounded compositional semantics for finding and describing images with sentences. Transactions of the Association for Computational Linguistics, 2:207–218.
- Socher et al. (2013) Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Ng, and Christopher Potts. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1631–1642, Seattle, Washington, USA. Association for Computational Linguistics.
- Tai et al. (2015) Kai Sheng Tai, Richard Socher, and Christopher D Manning. 2015. Improved semantic representations from tree-structured long short-term memory networks. arXiv preprint arXiv:1503.00075.
- Taylor (1953) Wilson L Taylor. 1953. “cloze procedure”: A new tool for measuring readability. Journalism Bulletin, 30(4):415–433.
- Titov and Henderson (2010) Ivan Titov and James Henderson. 2010. A latent variable model for generative dependency parsing. In Trends in Parsing Technology, pages 35–55. Springer.
- Toutanova et al. (2003) Kristina Toutanova, Dan Klein, Christopher D. Manning, and Yoram Singer. 2003. Feature-rich part-of-speech tagging with a cyclic dependency network. In Proceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, pages 252–259.
- Vashishth et al. (2018) Shikhar Vashishth, Manik Bhandari, Prateek Yadav, Piyush Rai, Chiranjib Bhattacharyya, and Partha Talukdar. 2018. Incorporating syntactic and semantic information in word embeddings using graph convolutional networks. arXiv preprint arXiv:1809.04283.
- Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need.
- Weiss et al. (2015) David Weiss, Chris Alberti, Michael Collins, and Slav Petrov. 2015. Structured training for neural network transition-based parsing. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 323–333, Beijing, China. Association for Computational Linguistics.
- Wu et al. (2016) Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Łukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado, Macduff Hughes, and Jeffrey Dean. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation.
-
Yamada and Matsumoto (2003)
Hiroyasu Yamada and Yuji Matsumoto. 2003.
Statistical dependency analysis with support vector machines.
In Proceedings of the Eighth International Conference on Parsing Technologies. - Zhang and Nivre (2011) Yue Zhang and Joakim Nivre. 2011. Transition-based dependency parsing with rich non-local features. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers-Volume 2, pages 188–193. Association for Computational Linguistics.
Appendix A Hyper-parameters
Component | Specification |
Optimizer1 | BertAdam |
Learning rate | 1e-5 |
Adam Betas(,) | (0.9,0.999) |
Adam Epsilon | 1e-6 |
Weight Decay | 0.01 |
Max-Grad-Norm | 1 |
Warm-up | 0.0052 |
Self-Attention | |
No. Layers() | 6 |
No. Heads | 12 |
Embedding size | 768 |
Max Position Embedding | 512 |
Classifiers | MLP |
No. Layers | 2 |
Hidden size | 200 |
Drop-out | 0.05 |
History Model | LSTM |
No. Layers | 2 |
Hidden Size | 100 |
Comp. Model | MLP |
No. Layers | 2 |
Hidden size | 768 |
Epochs | 12 |
Replace-with-unk3 | 5% |
|
-
0.01 for BERT models
-
We sort the training words based on number of occurance in the train set, then convert the last 5% words with [UNK] symbol.
Appendix B Error-Analysis
b.a Dependency Length
Model | ROOT | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 7 |
BERT G2G | 97.19 | 94.99 | 94.56 | 93.05 | 90.32 | 86.80 | 85.61 | 85.10 | 86.95 |
BERT DepTr+CH | 95.86 | 94.78 | 93.87 | 91.44 | 87.56 | 83.21 | 81.87 | 80.83 | 83.15 |
BERT DepTr+H | 91.18 | 94.13 | 91.58 | 88.29 | 82.38 | 75.98 | 74.25 | 71.94 | 74.84 |
BERT DepTr | 87.38 | 93.33 | 90.15 | 86.50 | 80.13 | 74.48 | 71.54 | 68.10 | 71.20 |
Model | ROOT | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 7 |
BERT G2G | 2416 | 24217 | 11395 | 5930 | 3138 | 1862 | 1288 | 946 | 5492 |
BERT DepTr+CH | 2416 | 24202 | 11356 | 5889 | 3140 | 1855 | 1264 | 942 | 5620. |
BERT DepTr+H | 2416 | 24234 | 11369 | 5871 | 3089 | 1857 | 1283 | 925 | 5640 |
BERT DepTr | 2416 | 24192 | 11340 | 5851 | 3112 | 1841 | 1302 | 960 | 5670 |
Total Gold | 2416 | 24152 | 11352 | 5922 | 3153 | 1873 | 1284 | 946 | 5586 |
b.b Distance to Root
Model | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 7 |
BERT G2G | 97.19 | 93.97 | 92.98 | 91.94 | 91.82 | 92.84 | 93.27 | 93.65 |
BERT DepTr+CH | 95.86 | 91.66 | 91.26 | 90.96 | 91.68 | 92.36 | 91.92 | 93.74 |
BERT DepTr+H | 91.18 | 87.56 | 88.28 | 88.91 | 89.34 | 90.11 | 90.35 | 90.73 |
BERT DepTr | 87.38 | 85.24 | 87.57 | 87.31 | 87.57 | 88.25 | 89.42 | 89.33 |
Model | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 7 |
BERT G2G | 2416 | 12885 | 12352 | 9991 | 7196 | 4940 | 3074 | 3830 |
BERT DepTr+CH | 2416 | 12889 | 12423 | 10114 | 7163 | 4928. | 2996 | 3755 |
BERT DepTr+H | 2416 | 12787 | 12469 | 10162 | 7108 | 4932 | 3043 | 3767 |
BERT DepTr | 2416 | 12704 | 12631 | 10247 | 7103 | 4894 | 3020 | 3669 |
Total Gold | 2416 | 12941 | 12378 | 10081 | 7238 | 4893 | 3022 | 3715 |
|
b.c Sentence Length
Model | 1-9 | 10-19 | 20-29 | 30-39 | 40-49 | 50 |
BERT G2G | 95.07 | 94.25 | 93.14 | 93.07 | 92.15 | 89.43 |
BERT DepTr+CH | 94.48 | 93.19 | 91.94 | 91.38 | 90.84 | 89.39 |
BERT DepTr+H | 93.38 | 90.95 | 88.96 | 88.29 | 88.30 | 82.80 |
BERT DepTr | 92.42 | 89.06 | 87.59 | 86.41 | 86.17 | 82.43 |
1-9 | 10-19 | 20-29 | 30-39 | 40-49 | 50 | |
Total | 1359 | 10873 | 19314 | 15719 | 7006 | 2413 |