The discourse structure of a document describes discourse relationships between its elements as a graph or a tree. Discourse parsing is largely dominated by greedy parsers (. braud2016multi; ji2014representation; yu2018transition; SogaardBC17). Global parsing is rarer joty2015codra; li2016discourse because the dependency between node’s label and its internal split point can make prediction computationally prohibitive.
In this work, we propose a CKY-based global parser with tractable inference using a new independence assumption that loosens the coupling between the identification of the best split point label prediction. Doing so gives us the advantage that we can search for the best tree in a larger space. Greedy discourse parsers braud2016multi; ji2014representation; yu2018transition; SogaardBC17 have to use complex models to ensure each step is correct because the search space is limited. For example, ji2014representation manually crafted features and feature transformations to encode elementary discourse units (EDUs); yu2018transition and braud2016multi used multi-task learning for a better EDU representation. Instead, in this work, we use a simple recurrent span representation to build a parser that outperforms previous global parsers.
Our contributions are: (i) We propose an independence assumption that allows global inference for discourse parsing. (ii) Without any manually engineered features, our simple global parser outperforms previous global methods for the task. (iii) Experiments reveal that our parser outperforms greedy approaches that use the same representations, and is comparable to greedy models that rely on hand-crafted features or more data.
2 RST Tree Structure
The Rhetorical Structure theory (RST) of mann1988rhetorical is an influential theory on discourse. In this work, we focus on discourse parsing with the RST Discourse Treebank carlson2001building. An RST tree assigns relation and nuclearity labels to adjacent nodes. Leaves, called elementary discourse units (EDUs), are clauses (not words) that serve as building blocks for RST trees. Figure 1 shows an example RST tree.
RST trees have important structural differences from constituency parse trees. In a constituency tree, node labels describe their syntactic role in a sentence, and are independent of the splitting point between their children, thus driving methods such as that of DBLP:conf/acl/SternAK17. However, in an RST tree, the label of a node describes the relationship between its sub-trees; the assignment of labels depends on the split point that separates its children.
3 Chart-based Parsing
In this section, we will first describe chart parsing, and then look at our independence assumption that reduces inference time. Finally, we will look at the the loss function for training the parser.
3.1 Chart Parsing System
An RST tree structure can be represented as a set of labeled spans:
where, for a span , the relation label is and the nuclearity, which determines the direction of the relation, is . The score of a tree is the sum of its span, relation and nuclearity scores.
To find the best tree, we can use a chart to store the scores of possible spans and labels. For each cell in the table, we need to decide the splitting point , and the nuclearity and relation labels.
As we saw in §2, the label and split decisions are not independent (unlike, DBLP:conf/acl/SternAK17). The joint score for a cell is the sum of all three scores, and also the scores of its best subtrees:
The base case for a leaf node does not account for the split point and subtrees:
The CKY algorithm can be used for decoding the recursive definition in (2). The running time is , where is the number of EDUs in a document, and is the grammar constant, which depends on the number of labels.
3.2 Independence Assumption
Although we have framed the parsing process as a chart parsing system, the large grammar constant makes inference expensive. To resolve this, we assume that we can identify the splitting point of a node without knowing the its label. After this decision, we use this split point to inform the label predictors instead of searching for the best split point jointly. The scoring function becomes:
Unlike the parser of li2016discourse that completely disentangles label and splitting points, we retain a one-sided dependency. The joint score is still used in the recursion. Because they are not completely independent, we call our assumption the partial independence assumption. When we use the CKY algorithm as the inference algorithm to resolve equation 4, the running time complexity becomes . While we still have a cubic dependency on the number of EDUs, the impact of the constant makes our approach practically feasible.
3.3 Loss Function
Since inference is feasible, we can train the model with inference in the inner step. Specifically, we use a max-margin loss that is the neural analogue of a structured SVM taskar2005learning. Recall that if we had all our scoring functions, we can predict the best tree using CKY as
For training, we can use the gold tree of a document to define the structured loss as:
is the hamming distance between a tree and the reference . The above loss can be computed with loss-augmented decoding as standard for a structured SVM, thus giving us a sub-differentiable function in the model parameters.
4 Neural Model for Global Parser
In this section, we describe our neural model that defines the scoring functions using a EDU representation. The network first maps a document—a sequence of words
—to a vector representation for each EDU in the document. Those EDU representations serve as inputs to the three predictors:and .
Since the relation and nuclearity of a span
depend on its context, recurrent neural networks are a natural way of modeling the sequence, as they have been shown successfully capture word/span context for many NLP applicationsDBLP:conf/acl/SternAK17; DBLP:journals/corr/BahdanauCB14.
Each word is embedded by the concatenation of its GloVe pennington2014glove and ELMo embeddings peters2018deep, and embeddings of its POS tag. These serve as inputs to a bi-LSTM network. The POS tag embeddings are initialized uniformly from and updated during the training process, while the other two embeddings are not updated. The softmax-normalized weights and scale parameters of ELMo are fine-tuned during the training process.
Suppose for a word , the forward and backward encodings from the Bi-LSTM are and respectively. The representation of an EDU with span , denoted as , is the concatenation of its encoded first and last words:
The parameters of this EDU representation include three parts: (i) POS tag embeddings; (ii) Softmax-normalized weights and scalar parameter for ELMo; (iii) Weights of the bi-LSTM.
Using this representation, our scoring functions , and
, are implemented as a two-layer feedforward neural network which takes an EDUs representation to score their respective decisions. The EDU representation parameters and the scoring functions are jointly learned.
The primary goal of our experiments is to compare the partial independence assumption against the full independence assumption of li2016discourse. In addition, we also compare the global models against a shift-reduce parser (as in ji2014representation) that uses the same representation.
We evaluate our parsers on the RST Discourse Treebank carlson2001building. It consists of documents in total, with training and testing examples. We further created a development set by choosing
random documents from the training set for development and to fine tune hyperparameters. The supplementary material lists all the hyperparameters.
Following previous studies carlson2001building, the original relation types are partitioned into classes. All experiments are conducted on manually segmented EDUs. The POS tag of each word in the EDUs is obtained from spaCy111https://spacy.io/. We train our parser on the training split and use the best-performing model on the development set as the final model. We optimized the max-margin loss using Adam kingma2014adam.
We use the standard evaluation method marcu2000theory to test model performances using three metrics: Span, Nuclearity and Relation (Full). We follow morey2017much to report both macro-averaged and micro-averaged F1 scores.
Table 1 shows the final performance of our parsers using macro-averaged F1 scores. Our partial independence assumption outperforms the complete independence assumption by a large margin. Among all other parsers, our partial independence parser achieves the best results. Table 2 shows the performance of our parsers using micro-averaged F1 scores. Under this metric, the partial independence assumption still outperforms the complete independence assumption and the baseline. Again, we are among the the best-performing parsers, though the best method yu2018transition is shift-reduce based parser augmented by multi-task learning. The latter’s better performance, as per in the ablation study of the original work, is due to the use of external resources (Bi-Affine Parser) for a better representation.
To better understand the difference between complete independence and partial independence assumption, we count how many trees that found by the inference algorithm has a lower score than the corresponding gold tree during training. Since both assumptions cannot perform exact search, it is possible to find a tree whose score is higher than the gold one. We call this situation missing prediction. Figure 2 shows the results. Complete independence assumption produces more missing prediction trees. This is because, in complete independence assumption, the tree structure is decided only by its span scores. A tree can have high span scores but lower label scores, resulting in a low score in total.
|Our System||Complete Independence||85.7||72.2||56.7|
|Our System||Complete Independence||83.0||67.7||51.8|
6 Analysis and Related Work
Some prior work explores global parsing for RST structures li2016discourse used the CKY algorithm to infer by ignoring the dependency relation between splitting point and label assignment. joty2015codra applied a two-stage parsing strategy. A sentence is first parsed, and then the document is parsed. In this process, all the cross-sentence spans are ignored.
Greedy parsing can only explore a small part of the output space, thus necessitating high-quality representation and models to ensure each step is as correct as possible. This is the reason why many early studies usually involve rich manually engineered features joty2015codra; feng2014linear, external resources yu2018transition; braud2016multi or heavily designed models li2016discourse; ji2014representation. Table 3 summarize all the different components used by various parsers. In contrast, using global inference, our parser only needs a recurrent input representation to achieve comparable performance without any components mentioned in Table 3.
In this work, we propose a new independence assumption for global inference of discourse parsing, which makes globally optimal inference feasible for RST trees. By using a global inference, we develop a simple neural discourse parser. Our experiments show that the simple parser can achieve comparable performance to state-of-art parsers using only learned span representations.
This research was supported by The U.S-Israel Binational Science Foundation grant 2016257, its associated NSF grant 1737230 and The Yandex Initiative for Machine Learning.
Appendix A Hyper-parameters for Experiments
Table 4 shows the hyper-parameters for our experiments.
|biLSTM Hidden Size|
|Feedforward Hidden Size|
|GloVe Word Embedding Size|
|ELMo Word Embedding Size|
|POS Tag Embedding Size|