Dependency Parsing (DP), which aims to extract the structural information beneath sentences, is fundamental in understanding natural languages. It benefits a wide range of Natural Language Processing (NLP) applications, such as machine translationbugliarello2020mt, question answering teney2017qa, and information retrieval chandurkar2017ir. As shown in Figure 1, dependency parsing predicts for each word the existence and dependency relation with other words according to a pre-defined schema. Such dependency structure is represented in tree or directed acyclic graph, which can be converted into flattened sequence, as presented in this paper.
The field of dependency parsing develops three main categories of paradigms: graph-based methods dozat2017biaffine, transition-based methods ma2018stackptr, and sequence-based methods li2018seq2seq. While prospering with these methods, dependency parsing shows three trends now. 1) New Schema. Recent works extend dependency parsing from syntactic DP (SyDP) to semantic DP (SeDP) with many new schemata stephan2015sdp15; che2016semeval. 2) Cross-Domain. Corpora from different domains facilitate the research on cross-domain dependency parsing peng2019nlpcc; li2019codt
. 3) PLM. With the development of pre-trained language models (PLMs), researchers manage to enable PLMs on dependency task and successfully achieve the new state-of-the-art (SOTA) resultsfernandez2020transition; gan21mrc. However, there are still two main issues.
Lacking Universality. Although there are many successful parsers, most of them are schema-specific and have limitations, e.g., sequence-based parsers vacareanu2020pat are only suitable for SyDP. Thus, these methods require re-training before being adapted to another schema.
Relying on Extra Decoder. Previous parsers usually produce the parsing results employing an extra decoding module, such as a biaffine network for score calculation dozat2017biaffine and a neural transducer for decision making zhang2019broad. These modules cannot be pre-trained and learn the dependency relation merely from the training corpora. Thus, only part of these models generalizes to sentences of different domains.
To address these issues, we propose schema-free Dependency Parsing via Sequence Generation (DPSG). The core idea is to find a unified unambiguous serialized representation for both syntactic and semantic dependency structures. Then an encoder-decoder PLM is learned to generate the parsing results following the serialized representation, without the need for an additional decoder. That is, our parser can achieve its function using one original PLM (without any modification), and thus is entirely pre-trained. Furthermore, by adding a prefix to the serialized representation, DPSG provides a principled way to pack different schemata into a single model.
In particular, DPSG consists of three key components. The Serializer is responsible for converting between the dependency structure and the serialized representation. The Positional Prompt pattern provides supplementary word position information in the input sentence to facilitate the sequence generation process. The encoder-decoder PLM with added special tokens performs the parsing task via sequence generation. The main advantages of DPSG comparing with previous paradigms are summarized in Table 1. Our DPSG accomplishes DP for different schemata, unifies multiple schemata without training multiple models, and transfers the overall model to different domains.
We conduct experiments on popular DP benchmarks: PTB, CODT, SDP15, and SemEval16. DPSG performs generally well on different DP. It significantly outperforms the baselines on cross-domain (CODT) and Chinese SeDP (SemEval16) corpora, and achieves comparable results on the other two benchmarks, which further shows that our DPSG has the potential to be a new paradigm for dependency parsing.
We formally introduce the dependency parsing task and the encoder-decoder PLM, and the corresponding notations. This paper uses bold lower case letters, blackboard letters, and bold upper case letters to denote sequences, sets, and functions, respectively. Elements in the sequence and the sets are enclosed in parentheses and braces, respectively.
2.1 Dependency Parsing
A pre-defined dependency schema is a set of relations . Dependency parsing takes a sentence as input, where is the word in the sentence. It outputs the set of dependency pairs , where denotes the dependency pair of the word . We use and to denote the head word of and their relation. denotes the position of the specific word in the input sentence.
Syntactic Dependency Parsing (SyDP) analyses the grammatical dependency relations. The parsing result of SyDP is a tree structure called the syntactic parsing tree. In the SyDP, each non-root word has exactly one head word, which means if is the not root word.
Semantic Dependency Parsing (SeDP) focuses on representing the deep-semantic relation between words. Each word in SeDP is allowed to have multiple (even no) head words. This leads to the result of SeDP being a directed acyclic graph called Semantic Dependency Graph. Figure 1 shows the difference between SyDP and SeDP, where SyDP produces a tree while SeDP produces a graph.
2.2 Pre-trained Language Model
PLMs are usually stacks of attention blocks of Transformer vaswani2017attention. Some PLMs that consist of encoder blocks only (e.g., BERT devlin2019bert) are not capable of sequence generation. This paper focuses on PLMs having both encoder blocks and decoder blocks, such as T5 colin2020t5 and BART lewis2020bart.
An encoder-decoder PLM takes a sequence as input, and outputs a sequence . Each PLM has an associated vocabulary , which is a set of tokens that can be directly accepted and embedded by the PLM. The PLM first splits the input sequence into tokens in the vocabulary with a subword tokenization algorithm, such as SentencePieces kudo2018sentencepiece
. Then, the tokens are mapped into vectors by looking up the embedding table. The attention blocks digest the embedded sequence and generate the output sequence.
DPSG leverages a PLM to parse the dependency relation of a sentence by sequence generation. Therefore, the Serializer converts the dependency structure into a serialized representation that meets the output format of the PLM (Section 3.1). The Positional Prompt injects word position information into the input sentence so as to avoid numerical reasoning (Section 3.2). The PLM is modified by adding special tokens introduced by the Serializer and the Positional Prompt (Section 3.3). Figure 2 illustrates the overall framework.
3.1 Serializer for Dependency Structure
The Serializer is a function that maps sentence and its corresponding dependency pairs into a serialized representation
, which servers as the target output to fine-tune the language model. The Inverse Serializerconverts the output of the PLM into dependency pairs to meet the output requirement of the DP task.
Specifically, the Serializer decomposes dependency pairs, , into smaller dependency units by scattering the dependent word into each of its head word, which forms the following triplets set: . Then, it replaces each relation with a special token111Brackets indicate special tokens out of vocabulary . , where is a set of special tokens for all different relations. The head word is substituted by its position in the input sentence , denoted as . The target serialized representation concatenates all the dependency units with split token as the following:
The Inverse Serializer restores the dependency structure from the serialized representation by substituting the special token with the original relation and indexing the head with its position in the input sentence .
There are two issues in the Serializer designing:
Word Ambiguity. It is highly possible to have words, especially function words, appear multiple times in one sentence, e.g., there are more than sentences in Penn Treebank marcus1993ptb3 have repeated words. We take two measures for word disambiguation in a dependency unit: (1) To disambiguate head word, the Serializer represents the head word by its position, rather than the word itself; (2) To disambiguate dependent word, the Serializer arranges dependency units by order of the dependent word in the input sentence , rather than topological ordering or depth/breadth first search ordering of the dependency graph. The Inverse Serializer scans and simultaneously so as to refer the corresponding dependent word to .
Isolated Words. There are dependency schemata allowing for isolated words which have neither head words nor dependency relations with other words, e.g., the period mark in the SeDP results shown in Figure 1. Note that the isolated words are different from the root word, as the root word is the head word of itself. One direct solution is to remove the isolated words from the serialized representation. However, this will result in inconsistencies between and , which complicates the word disambiguation. Thus, We use special token to denote such isolation relation and word to represent the position of the virtual head word.
3.2 Positional Prompt for Input Sentence
As Section 3.1 mentions, representing the head words by their positions is an important scheme for head word disambiguation. However, PLMs are less skilled at numerical reasoning geva2020injecting. We also empirically find it difficult for the PLM to learn the positional information of each word from scratch. Thus, we inject Positional Prompt (PP) for each word, which converts the positional encoding problem into generating the position number in the input, rather than counting for each word.
In particular, given the input sentence , the positional prompt is the position number of each word wrapped with two special tokens and . marks the beginning of the position number and prevents the tokenization algorithms from falsely taking the position prompt as part of the previous word. separates the position number from the next word. They also provide word segmentation information for some languages, such as Chinese. After the conversion, we have the input sequence in the following form:
For brevity, we denote the above process as a function that maps input sentence into sequence with positional prompt.
3.3 PLM for Sequence Generation
Both Serializer and Positional Prompt introduce special tokens that are out of the original vocabulary , including the relation tokens in , the separation tokens , and the special relation token . Before training, these tokens are added to the vocabulary, and their corresponding embeddings are randomly initialized from the same distribution as other tokens. As we should notice, these special tokens are expected to undertake different semantics. PLM thus treats them as trainable variables and learns their semantics during training.
With all the three components of DPSG, input sentence is first converted into sequence with positional prompt:
. The sequence is further fed into the PLM and get the sequence output with the maximum probability:. The final predicted dependency structure is recovered via the Inverse Serializer: .
The training objective aims to maximize the likelihood of the ground truth dependency structure. To do so, we take the serialized dependency structure as the target and minimize the auto-regressive language model loss. We can further enhance the unsupervised cross-domain capacity of DPSG with intermediate fine-tuning (IFT) pruksachatkun2020intermediate; chang2021rethinking. Before training on the dependency parsing, the intermediate fine-tuning uses the unlabeled sentences in the target domain and continues to train the PLM in source domain.
4.1 Evaluation Setups
We evaluate DPSG on the following widely used benchmarks for both SyDP and SeDP. We show more details about datasets in Appendix A.
Penn Treebank (PTB) marcus1993ptb3 is the most proverbial benchmark for SyDP.
Chinese Open Dependency Treebank (CODT) li2019codt aims to evaluate the cross-domain SyDP capacity of the parser. It includes a balanced corpus (BC) for training, and three other corpora gathering from different domains for testing: product blogs (PB), popular novel “Zhu Xian” (ZX), and product comments (PC).
BroadCoverage Semantic Dependency Parsing dataset (SDP15) stephan2015sdp15 annotates English SeDP sentences with three different schemata, named as DM, PAS, and PSD. It provides both in-domain (ID) and out-of-domain (OOD) evaluation datasets. The schema of SDP15 allows for isolated words.
Chinese semantic Dependency Parsing dataset (SDP16) che2016semeval is a Chinese SeDP benchmark. The sentences are gathered from News (NEWS) and textbook (TEXT). The schema of SemEval16 allows for multiple head words but does not have isolated words.
4.1.2 Evaluation Metrics
Following the conventions, we use unlabeled attachment score (UAS) and labeled attachment score (LAS) for SyDP. We use labeled attachment F1 Score (LF) on SDP15 of SeDP. For SeDP on SemEval16, we use unlabeled attachment F1 (UF) and labeled attachment F1 (LF). All the results are presented in percentages ().
We use T5-base colin2020t5 and mT5-base xue2021mt5 as the backbone PLM for English dependency parsing and Chinese dependency parsing, respectively. In particular, we use their V1.1 checkpoints, which are only pre-trained on unlabeled sentences, so as to keep the PLM unbiased. In order to focus on the parsing capability of PLM itself, we do not use additional information, such as part-of-speech (pos) tagging and character embedding wang2020second; gan21mrc.
The PLM is implemented with Huggingface Transformers wolf2020transformers. The learning rate is , weight decay is . The optimizer is AdamW loshchilov2017adamw. We conduct all the experiments on Tesla V100.
We divide baselines into three main categories based on their domain of expertise. Note that almost all baselines use the additional lexical-level feature (including pos tagging, character-level embedding, and other pre-trained word embeddings), which is different from our DPSG. We supplement more details about baselines in Appendix B.
In-domain SyDP. Biaffine dozat2017biaffine, StackPTR ma2018stackptr, and CRF2O li2020crf introduce specially designed parsing modules without PLM. CVT clark2018cvt, MP2O wang2020second, and MRC gan21mrc are recently proposed PLM-based dependency parser. SeqNMT li2018seq2seq, SeqViable strzyz2019viable, and PaT vacareanu2020pat cast dependency parsing as sequence labeling task, which is closely related to our sequence generation method.
Unsupervised Cross-domain SyDP. peng2019nlpcc and li2019codt modify the Biaffine for the unsupervised cross-domain DP. SSADP lin2021unsupervised relies on extra domain adaptation steps. In the PLM era, li2019codt propose ELMo-Biaffine with IFT on unlabeled target domain data.
SeDP. dozat2018sedp modify Biaffine for SeDP. BS-IT wang2019sedp is a transition-based semantic dependency parser with incremental Tree-LSTM. HIT-SCIR che2019pipeline solves the SeDP with a BERT based ipeline. BERT+Flair222They use different pre-processing scripts on SDP15, thus are not comparable with DPSG and other baselines on SDP15. he2020bert augments the Biaffine model with BERT and Flair akbik2018flair embedding. Pointer fernandez2020transition
combines transition-based parser with Pointer Network. It is also augmented with a Convolutional Neural Network (CNN) encoder for the character-level feature.
4.3 Main Results
4.3.1 DPSG is Schema-Free
The schema-free characteristics of DPSG are reflected by the following two perspectives.
Towards Specific Schema. DPSG obtains the SOTA performance on both CODT in Table 5 and SemEval16 in Table 3, and achieves the first-tier even among methods used additional lexical-level features on PTB in Table 2 and SDP15 in Table 4. For in-domain SyDP in Table 2, DPSG outperforms all the previous sequence-based methods, and performs sightly lower than MRC, which uses contextual interactive pos tagging, by 0.45% in LAS.
For SeDP in Table 3, DPSG ourperforms BERT +Flair to a large margin on SemEval16, achieves 3.55% performances gain on NEWS, and 1.95% performance gain on TEXT with regard to LF. DPSG also outperforms the PLM-based pipeline HIT-SCIR on SDP15 (Table 4), but sightly lower than Pointer, which applies additional CNN to encode the character-level embeddings. We also observe that DPSG and the Pointer have the largest gap in the PSD schema of SDP15. This is caused in that PSD has much more relation labels than the other schemata peng2017multi, which increases the search space of our generation model.
|w/ PLM||ELMo-Biaffine w/ IFT|
|DPSG w/o IFT|
|DPSG w/ IFT|
Towards Multi-Schemata. Furthermore, we design the multi-schemata experiment. We mix PTB and SDP15 by concatenating a prefix to the input text to distinguish different schemata. To prevent data leakage, we filter out sentences from the training set of PTB, which also appear in the test set of SDP15. As DPSG (Multi) uses less training data for PTB, it performs worse than DPSG in Table 2. DPSG (Multi) in Table 4 outperforms Pointer by 1.49% in ID evaluation of the PAS schema, 0.05% in ID evaluation of the DM schema, and achieves almost the same performance with Pointer in ID evaluation of the PSD schema. The improvement over schema-specific model is most obvious on PAS. It could be because the PAS schema is more similar to the syntax schema peng2017multi, thus it benefits more from PTB. This multi-schemata approach also provides a new method to explore the inner connection between SyDP and SeDP.
4.3.2 Unsupervised Cross-domain
Table 5 demonstrates the outstanding transferability of DPSG. We implement DPSG with and without IFT on the target domain. DPSG with IFT achieves the new SOTA, with a boosting of , and in terms of LAS on PB, ZX, and PC, compared to ELMo with IFT. DPSG is completely trained during IFT. While the additional biaffine module of ELMo cannot benefit from the unlabeled sentences from the target domain.
This section studies whether there is better implementation for DPSG. We are particularly interested in: 1) the designing of the Serializer, 2) the effect of the introduced special tokens, and 3) the choice of the PLM model. We use PTB as the benchmark and compare DPSG introduced in Section 3 with many other possible choices. The results of these exploratory experiments are shown in Table 6.
5.1 Serializer Designing
Tree, as the well-studied data structure for syntactic dependency parsing, has several other serialization methods to be converted into serialized representations. We explore the serializer designing of the tree structure in DPSG with two other widely used serialized representation—Prufer sequence and Bracket Tree, which are shown in Figure 3. Note that both Prufer sequence and Bracket Tree face the same word ambiguity issues; we associate each word with a unique position number as well.
Prufer Sequence is a unique sequence associated with the labeled tree in combinatorial mathematics. The algorithm which converts labeled tree into Prufer sequence does not preserve the root node, while in dependency parsing, the root is a unique word. To bridge this inconsistency, we introduce an additionally added virtual node to the dependency tree to mark the root word.
Bracket Tree is one of the most commonly used serialization methods to represent the tree structure strzyz2019viable. By recursively putting the sub-tree nodes in a pair of brackets from left-to-right, bracket tree can build a bijection between parsing tree and bracket tree. More details about how to construct the Prufer sequence and the bracket tree are shown in Appendix C.
We denote the experimental results of Prufer sequence and bracket tree as Prufer and Bracket, respectively, in Table 6. Both Prufer sequence and bracket tree undermine the performance of DPSG to a large margin, which indicates that our proposed Serializer provides a better serialized representation for the PLM to generate. This is because our Serializer guarantees the dependency units in the output have the same order of the words in the input sentences, while Prufer sequence and bracket tree do not preserve the order. Thus, our proposed DPSG expands the input sentence to generate the output sequence, while Prufer sequence and bracket tree based DPSG reconstruct the syntax dependency structure. As expansion strategy has smaller generation space than reconstruction, the serialization representation proposed in Section 3.1 eases the learning complexity of the PLM, and further brings better performance.
5.2 Special Tokens Designing
We further investigate whether the additionally introduced special tokens are useful.
Relation Tokens. There are two different ways to represent the dependency relations in the serialized representation: adding a special token for each dependency relation, or mapping each dependency relation to one token in the original vocabulary with the closest meaning, e.g., conj conjunct. Experimental results using word mapping is denoted as DPSG in Table 6. DPSG is inferior than DPSG, which indicates that the special tokens for relations are important. The reason is that if we use the tokens in the original vocabulary, they interfere with their original meanings as the word. Special tokens disentangle the dependency relation from the words that could appear in the sentence.
Positional Prompt. We are also particularly interested in the effectiveness of the positional prompts. We conduct experiments where the positional prompt is removed and send the original input sentence to the PLM. The result is denoted as DPSG in Table 6. DPSG undermines the performance of DPSG because it requires the PLM to perform numerical reasoning, that is, to count for the position of each head word.
5.3 Model Choosing
There are two different legalities in DPSG. Formation Legality focus on whether the sequence has the correct formation (see Section 3.1) and Structural Legality focus on the legality of the corresponding parsing structure. The statistics on PTB show that the formation legality of DPSG is , and the structure legality of DPSG is , which is acceptable in practical usage.
6 Related Work
6.1 Syntactic Dependency Parsing
In-domain SyDP. Transition-based methods and graph-based methods are widely used in SyDP. dozat2017biaffine introduce biaffine attention into the graph-based methods. ma2018stackptr adopt pointer network to alleviate the drawback of local information in transition-based methods. li2020crf improve the CRF to capture second-order information.
There are also researches using sequence to sequence methods for SyDP. li2018seq2seq use BiLSTM to predict the labeling of positions and relations of dependency parsing. strzyz2019viable improve li2018seq2seq’s method and explore more representation of predicated labeling sequence of dependency parsing. vacareanu2020pat use BERT to augment the sequence labeling methods.
Unsupervised Cross-domain SyDP. The labeling of parsing data requires a wealth of linguistics knowledge and this limitation facilitates the research of unsupervised cross-domain DP. yu2015self introduce pseudo-labeling unsupervised cross-domain SyDP via self-training. li2019codt propose a cross-domain datasets CODT for SyDP and build baselines for unsupervised cross-domain SyDP. lin2021unsupervised introduce feature-based domain adaptation method in this field.
6.2 Semantic Dependency Parsing
jan2017 accomplish the first transition-based parser for Minimal Recursion Semantics (MRS). zhang2016 present two novel transition-systems to generate arbitrary directed graphs in an incremental manner. dozat2018sedp modify the Biaffine dozat2017biaffine for SeDP. However, due to the words in SeDP may have multiple-head, there is not sequence-based method for SeDP now.
6.3 Probing in Language Model
The research of exploring whether PLM can learn the linguistic features during the pre-training process, especially syntax knowledge, attracts some attention. hewitt2019structural map the distance between word embedding in PLM into the distance in syntax tree and construct a syntax tree without relation label. clark2019bert design a structural probe to detect the ability of attention heads to express dobj (direct object) dependency relation. Their results prove the syntax knowledge can also be found in the attention maps.
This paper proposes DPSG—a schema-free dependency parsing method. By serializing the parsing structure to a flattened sequence, PLM can directly generate the parsing results in serialized representation. DPSG not only achieves good results in each different schema, but also performs surprisingly well on unsupervised cross-domain DP. The multi-schemata experiments also suggest that DPSG is capable of investigating the inner connection between different schemata dependency parsing. The exploratory experiments and analyses demonstrate the rationality of the designing of DPSG. Considering the unity, indirectness, and effectiveness of DPSG, we believe it has the potential to become a new paradigm for dependency parsing.
Appendix A Dataset Statistics
|Domain||Train Set||Dev Set||Test Set||Unlabeled Set|
|Schema||Train Set||ID Test Set||OOD Test Set|
|Domain||Train Set||Dev Set||Test Set|
Appendix B More Details on Baseline
Baselines for in-domain SyDP.
333* means model without PLM
Biaffine: dozat2017biaffine adopt biaffine attention mechanism into the graph-based method of dependency parsing.
StackPTR: ma2018stackptr introduce the pointer network into the transition-based methods of dependency parsing.
CRF: li2020crf improve the CRF to capture more high-order information in dependency parsing.
444 means sequence-based methods
use an Encoder-Decoder architecture to achieve the Seq2Seq dependency parsing by sequence tagging. The BPE segmentation from Neural Machine Translation (NMT) and character embedding from AllenNLPgardner2018allennlp are applied to argument their model.
SeqViable: strzyz2019viable explore four encodings of dependency trees and improve the performance comparing with li2018seq2seq.
PaT: vacareanu2020pat use a simple tagging structure over BERT-base to achieve sequence labeling of dependency parsing.
555+ means model utilizing PLM
CVT: clark2018cvt propose another pre-train method named cross-view training, which can be used in many sequence constructing task including SyDP. The best results of CVT is achieved by the multi-task pre-training of SyDP and part-of-speech tagging.
MP2O: wang2020second use message passing GNN based on BERT to capture second-order information in SyDP.
MRC: gan21mrc use span-based method to construct the edges at the subtree level. The Machine Reading Comprehension (MRC) is applied to link the different span. RoBERTa-large liu2019roberta is applied to enhance the representation of parser.
Baselines for cross-domain SyDP.
Biaffine: peng2019nlpcc; li2019codt use Biaffine trained on source domain and test on target domain as the baseline of unsupervised cross-domain SyDP.
SSADP: lin2021unsupervised use both semantic and structural feature to achieve the domain adaptation of unsupervised cross-domain parsing.
ELMo: li2019codt use ELMo with intermediate fine-tuning in unlabeled text of target domain to achieve the SOTA on unsupervised cross-domain SyDP.
Baselines for SeDP.
Biaffine: dozat2018sedp transfer the Biaffine model from SyDP to SeDP.
BS-IT: wang2019sedp use graph-based method for SeDP.
HIT-SCIR: che2019pipeline propose a BERT-based pipeline model for SeDP.
BERT+Flair: he2020bert use BERT and flair embedding akbik2018flair to argument their modificated Biaffine.
Appendix C Construction of Prufer Sequence
c.1 Prufer Sequence
The principle of construction is deleting the leaf node with minimum index and adding the index of its farther node into the prufer sequence. This process is repeated more times until there are only two nodes left in the tree.
c.2 Prufer for Parsing Tree
The arc in parsing tree is directed and thus is a rooted tree. When all the son nodes with smaller index are deleted, the root node will be treated as a leaf node then deleted in the next step. To address this problem, we add a virtual node with the maximum index and build a arc from virtual node to the real root. This virtual root force the root node always being a leaf node in the whole construction of prufer sequence. The overall construction process as shown on Figure 4 (a)~(f).
Appendix D Construction of Bracket Tree
The Bracket Tree uses Bracket to indicate levels of nodes. All the nodes belonging to the same level are wrapped in the same pair of brackets. The process of construction is shown on Figure 5.
Appendix E Comparison between T5 and BART
shows the UAS comparison on dev sets of PTB between the T5 and BART in first 30 epochs. After the first two epochs, the performance of T5 raise rapidly and can better maintain performance in the later stages of training. Although BART achieves a better performance in the first two round, but there is not much room for performance improvement. To make matters worse, it can be clearly seen that after achieving the best performance, BART is very unstable, and even a significant performance drop has occurred.