A Simple Global Neural Discourse Parser

09/02/2020 ∙ by Yichu Zhou, et al. ∙ THE UNIVERSITY OF UTAH Tel Aviv University 0

Discourse parsing is largely dominated by greedy parsers with manually-designed features, while global parsing is rare due to its computational expense. In this paper, we propose a simple chart-based neural discourse parser that does not require any manually-crafted features and is based on learned span representations only. To overcome the computational challenge, we propose an independence assumption between the label assigned to a node in the tree and the splitting point that separates its children, which results in tractable decoding. We empirically demonstrate that our model achieves the best performance among global parsers, and comparable performance to state-of-art greedy parsers, using only learned span representations.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The discourse structure of a document describes discourse relationships between its elements as a graph or a tree. Discourse parsing is largely dominated by greedy parsers (. braud2016multi; ji2014representation; yu2018transition; SogaardBC17). Global parsing is rarer joty2015codra; li2016discourse because the dependency between node’s label and its internal split point can make prediction computationally prohibitive.

In this work, we propose a CKY-based global parser with tractable inference using a new independence assumption that loosens the coupling between the identification of the best split point label prediction. Doing so gives us the advantage that we can search for the best tree in a larger space. Greedy discourse parsers braud2016multi; ji2014representation; yu2018transition; SogaardBC17 have to use complex models to ensure each step is correct because the search space is limited. For example, ji2014representation manually crafted features and feature transformations to encode elementary discourse units (EDUs); yu2018transition and braud2016multi used multi-task learning for a better EDU representation. Instead, in this work, we use a simple recurrent span representation to build a parser that outperforms previous global parsers.

Our contributions are: (i) We propose an independence assumption that allows global inference for discourse parsing. (ii) Without any manually engineered features, our simple global parser outperforms previous global methods for the task. (iii) Experiments reveal that our parser outperforms greedy approaches that use the same representations, and is comparable to greedy models that rely on hand-crafted features or more data.

2 RST Tree Structure

The Rhetorical Structure theory (RST) of mann1988rhetorical is an influential theory on discourse. In this work, we focus on discourse parsing with the RST Discourse Treebank carlson2001building. An RST tree assigns relation and nuclearity labels to adjacent nodes. Leaves, called elementary discourse units (EDUs), are clauses (not words) that serve as building blocks for RST trees. Figure 1 shows an example RST tree.

RST trees have important structural differences from constituency parse trees. In a constituency tree, node labels describe their syntactic role in a sentence, and are independent of the splitting point between their children, thus driving methods such as that of DBLP:conf/acl/SternAK17. However, in an RST tree, the label of a node describes the relationship between its sub-trees; the assignment of labels depends on the split point that separates its children.

Figure 1: An example of RST tree, where are EDUs, Elaboration is a discourse relation label. Nucleus and Satellite are nuclearity labels.

3 Chart-based Parsing

In this section, we will first describe chart parsing, and then look at our independence assumption that reduces inference time. Finally, we will look at the the loss function for training the parser.

3.1 Chart Parsing System

An RST tree structure can be represented as a set of labeled spans:

(1)

where, for a span , the relation label is and the nuclearity, which determines the direction of the relation, is . The score of a tree is the sum of its span, relation and nuclearity scores.

To find the best tree, we can use a chart to store the scores of possible spans and labels. For each cell in the table, we need to decide the splitting point , and the nuclearity and relation labels.

As we saw in §2, the label and split decisions are not independent (unlike,  DBLP:conf/acl/SternAK17). The joint score for a cell is the sum of all three scores, and also the scores of its best subtrees:

(2)

The base case for a leaf node does not account for the split point and subtrees:

(3)

The CKY algorithm can be used for decoding the recursive definition in (2). The running time is , where is the number of EDUs in a document, and is the grammar constant, which depends on the number of labels.

3.2 Independence Assumption

Although we have framed the parsing process as a chart parsing system, the large grammar constant makes inference expensive. To resolve this, we assume that we can identify the splitting point of a node without knowing the its label. After this decision, we use this split point to inform the label predictors instead of searching for the best split point jointly. The scoring function becomes:

(4)

Unlike the parser of li2016discourse that completely disentangles label and splitting points, we retain a one-sided dependency. The joint score is still used in the recursion. Because they are not completely independent, we call our assumption the partial independence assumption. When we use the CKY algorithm as the inference algorithm to resolve equation 4, the running time complexity becomes . While we still have a cubic dependency on the number of EDUs, the impact of the constant makes our approach practically feasible.

3.3 Loss Function

Since inference is feasible, we can train the model with inference in the inner step. Specifically, we use a max-margin loss that is the neural analogue of a structured SVM taskar2005learning. Recall that if we had all our scoring functions, we can predict the best tree using CKY as

(5)

For training, we can use the gold tree of a document to define the structured loss as:

(6)

is the hamming distance between a tree and the reference . The above loss can be computed with loss-augmented decoding as standard for a structured SVM, thus giving us a sub-differentiable function in the model parameters.

4 Neural Model for Global Parser

In this section, we describe our neural model that defines the scoring functions using a EDU representation. The network first maps a document—a sequence of words

—to a vector representation for each EDU in the document. Those EDU representations serve as inputs to the three predictors:

and .

Since the relation and nuclearity of a span

depend on its context, recurrent neural networks are a natural way of modeling the sequence, as they have been shown successfully capture word/span context for many NLP applications 

DBLP:conf/acl/SternAK17; DBLP:journals/corr/BahdanauCB14.

Each word is embedded by the concatenation of its GloVe pennington2014glove and ELMo embeddings peters2018deep, and embeddings of its POS tag. These serve as inputs to a bi-LSTM network. The POS tag embeddings are initialized uniformly from and updated during the training process, while the other two embeddings are not updated. The softmax-normalized weights and scale parameters of ELMo are fine-tuned during the training process.

Suppose for a word , the forward and backward encodings from the Bi-LSTM are and respectively. The representation of an EDU with span , denoted as , is the concatenation of its encoded first and last words:

(7)

The parameters of this EDU representation include three parts: (i) POS tag embeddings; (ii) Softmax-normalized weights and scalar parameter for ELMo; (iii) Weights of the bi-LSTM.

Using this representation, our scoring functions , and

, are implemented as a two-layer feedforward neural network which takes an EDUs representation to score their respective decisions. The EDU representation parameters and the scoring functions are jointly learned.

5 Experiments

The primary goal of our experiments is to compare the partial independence assumption against the full independence assumption of li2016discourse. In addition, we also compare the global models against a shift-reduce parser (as in ji2014representation) that uses the same representation.

We evaluate our parsers on the RST Discourse Treebank carlson2001building. It consists of documents in total, with training and testing examples. We further created a development set by choosing

random documents from the training set for development and to fine tune hyperparameters. The supplementary material lists all the hyperparameters.

Following previous studies carlson2001building, the original relation types are partitioned into classes. All experiments are conducted on manually segmented EDUs. The POS tag of each word in the EDUs is obtained from spaCy111https://spacy.io/. We train our parser on the training split and use the best-performing model on the development set as the final model. We optimized the max-margin loss using Adam kingma2014adam.

We use the standard evaluation method marcu2000theory to test model performances using three metrics: Span, Nuclearity and Relation (Full). We follow morey2017much to report both macro-averaged and micro-averaged F1 scores.

5.1 Results

Table 1 shows the final performance of our parsers using macro-averaged F1 scores. Our partial independence assumption outperforms the complete independence assumption by a large margin. Among all other parsers, our partial independence parser achieves the best results. Table 2 shows the performance of our parsers using micro-averaged F1 scores. Under this metric, the partial independence assumption still outperforms the complete independence assumption and the baseline. Again, we are among the the best-performing parsers, though the best method yu2018transition is shift-reduce based parser augmented by multi-task learning. The latter’s better performance, as per in the ablation study of the original work, is due to the use of external resources (Bi-Affine Parser) for a better representation.

To better understand the difference between complete independence and partial independence assumption, we count how many trees that found by the inference algorithm has a lower score than the corresponding gold tree during training. Since both assumptions cannot perform exact search, it is possible to find a tree whose score is higher than the gold one. We call this situation missing prediction. Figure 2 shows the results. Complete independence assumption produces more missing prediction trees. This is because, in complete independence assumption, the tree structure is decided only by its span scores. A tree can have high span scores but lower label scores, resulting in a low score in total.

Categories Parsing System S N R
Global joty2015codra 85.7 73.0 60.2
li2016discourse 85.4 70.8 57.6
Greedy SogaardBC17 85.1 73.1 61.4
feng2014linear 87.0 74.1 60.5
surdeanu2015two 85.1 71.1 59.1
hayashi2016empirical 85.9 72.1 59.4
braud2016multi 83.6 69.8 55.1
ji2014representation 85.0 71.6 61.9
Baseline 86.6 73.8 61.6
Our System Complete Independence 85.7 72.2 56.7
Partial Independence 87.2 74.9 61.9
Human 89.6 78.3 66.7
Table 1: Macro-averaged comparison for different parsers. The results of other models are from morey2017much. Baseline is a shift-reduce parser that uses the same representation as our system.
Categories Parsing System S N R
Global joty2015codra 82.6 68.3 55.4
li2016discourse 82.2 66.5 50.6
Greedy SogaardBC17 81.3 68.1 56.0
feng2014linear 84.3 69.4 56.2
surdeanu2015two 82.6 67.1 54.9
hayashi2016empirical 82.6 66.6 54.3
braud2016multi 79.7 63.6 47.5
ji2014representation 82.0 68.2 57.6
yu2018transition 85.5 73.1 59.9
Baseline 83.3 70.4 56.7
Our System Complete Independence 83.0 67.7 51.8
Partial Independence 84.5 71.1 57.5
Human 88.3 77.3 65.4
Table 2: Micro-averaged comparison for different parsers. The results of other models are from morey2017much. Baseline is a shift-reduce parser that uses the same representation as our system.
Figure 2:

The number of missing prediction trees for different independence assumptions. X-axis is the training epochs, y-axis is the number of missing trees for each epoch.

6 Analysis and Related Work

Some prior work explores global parsing for RST structures li2016discourse used the CKY algorithm to infer by ignoring the dependency relation between splitting point and label assignment. joty2015codra applied a two-stage parsing strategy. A sentence is first parsed, and then the document is parsed. In this process, all the cross-sentence spans are ignored.

Greedy parsing can only explore a small part of the output space, thus necessitating high-quality representation and models to ensure each step is as correct as possible. This is the reason why many early studies usually involve rich manually engineered features joty2015codra; feng2014linear, external resources yu2018transition; braud2016multi or heavily designed models li2016discourse; ji2014representation. Table 3 summarize all the different components used by various parsers. In contrast, using global inference, our parser only needs a recurrent input representation to achieve comparable performance without any components mentioned in Table 3.

Parsing
System
Manual
Features
Two
Stages
Multi-task
Feature
Transform
joty2015codra
li2016discourse
SogaardBC17
feng2014linear
surdeanu2015two
hayashi2016empirical
braud2016multi
ji2014representation
yu2018transition
Table 3: Components in different parsing models in the literature. By manual features, we mean human designed features other than POS tags. In comparison, our global parser uses none of these components.

7 Conclusion

In this work, we propose a new independence assumption for global inference of discourse parsing, which makes globally optimal inference feasible for RST trees. By using a global inference, we develop a simple neural discourse parser. Our experiments show that the simple parser can achieve comparable performance to state-of-art parsers using only learned span representations.

Acknowledgements

This research was supported by The U.S-Israel Binational Science Foundation grant 2016257, its associated NSF grant 1737230 and The Yandex Initiative for Machine Learning.

References

Appendix A Hyper-parameters for Experiments

Table 4 shows the hyper-parameters for our experiments.

Hyper-parameters Setting
Max Epoch
biLSTM Hidden Size
Feedforward Hidden Size
GloVe Word Embedding Size
ELMo Word Embedding Size
POS Tag Embedding Size

Dropout Probability

Learning Rate
Table 4: Hyper-parameters in all experiments.