1 Introduction
The discourse structure of a document describes discourse relationships between its elements as a graph or a tree. Discourse parsing is largely dominated by greedy parsers (. braud2016multi; ji2014representation; yu2018transition; SogaardBC17). Global parsing is rarer joty2015codra; li2016discourse because the dependency between node’s label and its internal split point can make prediction computationally prohibitive.
In this work, we propose a CKYbased global parser with tractable inference using a new independence assumption that loosens the coupling between the identification of the best split point label prediction. Doing so gives us the advantage that we can search for the best tree in a larger space. Greedy discourse parsers braud2016multi; ji2014representation; yu2018transition; SogaardBC17 have to use complex models to ensure each step is correct because the search space is limited. For example, ji2014representation manually crafted features and feature transformations to encode elementary discourse units (EDUs); yu2018transition and braud2016multi used multitask learning for a better EDU representation. Instead, in this work, we use a simple recurrent span representation to build a parser that outperforms previous global parsers.
Our contributions are: (i) We propose an independence assumption that allows global inference for discourse parsing. (ii) Without any manually engineered features, our simple global parser outperforms previous global methods for the task. (iii) Experiments reveal that our parser outperforms greedy approaches that use the same representations, and is comparable to greedy models that rely on handcrafted features or more data.
2 RST Tree Structure
The Rhetorical Structure theory (RST) of mann1988rhetorical is an influential theory on discourse. In this work, we focus on discourse parsing with the RST Discourse Treebank carlson2001building. An RST tree assigns relation and nuclearity labels to adjacent nodes. Leaves, called elementary discourse units (EDUs), are clauses (not words) that serve as building blocks for RST trees. Figure 1 shows an example RST tree.
RST trees have important structural differences from constituency parse trees. In a constituency tree, node labels describe their syntactic role in a sentence, and are independent of the splitting point between their children, thus driving methods such as that of DBLP:conf/acl/SternAK17. However, in an RST tree, the label of a node describes the relationship between its subtrees; the assignment of labels depends on the split point that separates its children.
3 Chartbased Parsing
In this section, we will first describe chart parsing, and then look at our independence assumption that reduces inference time. Finally, we will look at the the loss function for training the parser.
3.1 Chart Parsing System
An RST tree structure can be represented as a set of labeled spans:
(1) 
where, for a span , the relation label is and the nuclearity, which determines the direction of the relation, is . The score of a tree is the sum of its span, relation and nuclearity scores.
To find the best tree, we can use a chart to store the scores of possible spans and labels. For each cell in the table, we need to decide the splitting point , and the nuclearity and relation labels.
As we saw in §2, the label and split decisions are not independent (unlike, DBLP:conf/acl/SternAK17). The joint score for a cell is the sum of all three scores, and also the scores of its best subtrees:
(2)  
The base case for a leaf node does not account for the split point and subtrees:
(3) 
The CKY algorithm can be used for decoding the recursive definition in (2). The running time is , where is the number of EDUs in a document, and is the grammar constant, which depends on the number of labels.
3.2 Independence Assumption
Although we have framed the parsing process as a chart parsing system, the large grammar constant makes inference expensive. To resolve this, we assume that we can identify the splitting point of a node without knowing the its label. After this decision, we use this split point to inform the label predictors instead of searching for the best split point jointly. The scoring function becomes:
(4)  
Unlike the parser of li2016discourse that completely disentangles label and splitting points, we retain a onesided dependency. The joint score is still used in the recursion. Because they are not completely independent, we call our assumption the partial independence assumption. When we use the CKY algorithm as the inference algorithm to resolve equation 4, the running time complexity becomes . While we still have a cubic dependency on the number of EDUs, the impact of the constant makes our approach practically feasible.
3.3 Loss Function
Since inference is feasible, we can train the model with inference in the inner step. Specifically, we use a maxmargin loss that is the neural analogue of a structured SVM taskar2005learning. Recall that if we had all our scoring functions, we can predict the best tree using CKY as
(5) 
For training, we can use the gold tree of a document to define the structured loss as:
(6) 
is the hamming distance between a tree and the reference . The above loss can be computed with lossaugmented decoding as standard for a structured SVM, thus giving us a subdifferentiable function in the model parameters.
4 Neural Model for Global Parser
In this section, we describe our neural model that defines the scoring functions using a EDU representation. The network first maps a document—a sequence of words
—to a vector representation for each EDU in the document. Those EDU representations serve as inputs to the three predictors:
and .Since the relation and nuclearity of a span
depend on its context, recurrent neural networks are a natural way of modeling the sequence, as they have been shown successfully capture word/span context for many NLP applications
DBLP:conf/acl/SternAK17; DBLP:journals/corr/BahdanauCB14.Each word is embedded by the concatenation of its GloVe pennington2014glove and ELMo embeddings peters2018deep, and embeddings of its POS tag. These serve as inputs to a biLSTM network. The POS tag embeddings are initialized uniformly from and updated during the training process, while the other two embeddings are not updated. The softmaxnormalized weights and scale parameters of ELMo are finetuned during the training process.
Suppose for a word , the forward and backward encodings from the BiLSTM are and respectively. The representation of an EDU with span , denoted as , is the concatenation of its encoded first and last words:
(7) 
The parameters of this EDU representation include three parts: (i) POS tag embeddings; (ii) Softmaxnormalized weights and scalar parameter for ELMo; (iii) Weights of the biLSTM.
Using this representation, our scoring functions , and
, are implemented as a twolayer feedforward neural network which takes an EDUs representation to score their respective decisions. The EDU representation parameters and the scoring functions are jointly learned.
5 Experiments
The primary goal of our experiments is to compare the partial independence assumption against the full independence assumption of li2016discourse. In addition, we also compare the global models against a shiftreduce parser (as in ji2014representation) that uses the same representation.
We evaluate our parsers on the RST Discourse Treebank carlson2001building. It consists of documents in total, with training and testing examples. We further created a development set by choosing
random documents from the training set for development and to fine tune hyperparameters. The supplementary material lists all the hyperparameters.
Following previous studies carlson2001building, the original relation types are partitioned into classes. All experiments are conducted on manually segmented EDUs. The POS tag of each word in the EDUs is obtained from spaCy^{1}^{1}1https://spacy.io/. We train our parser on the training split and use the bestperforming model on the development set as the final model. We optimized the maxmargin loss using Adam kingma2014adam.
We use the standard evaluation method marcu2000theory to test model performances using three metrics: Span, Nuclearity and Relation (Full). We follow morey2017much to report both macroaveraged and microaveraged F1 scores.
5.1 Results
Table 1 shows the final performance of our parsers using macroaveraged F1 scores. Our partial independence assumption outperforms the complete independence assumption by a large margin. Among all other parsers, our partial independence parser achieves the best results. Table 2 shows the performance of our parsers using microaveraged F1 scores. Under this metric, the partial independence assumption still outperforms the complete independence assumption and the baseline. Again, we are among the the bestperforming parsers, though the best method yu2018transition is shiftreduce based parser augmented by multitask learning. The latter’s better performance, as per in the ablation study of the original work, is due to the use of external resources (BiAffine Parser) for a better representation.
To better understand the difference between complete independence and partial independence assumption, we count how many trees that found by the inference algorithm has a lower score than the corresponding gold tree during training. Since both assumptions cannot perform exact search, it is possible to find a tree whose score is higher than the gold one. We call this situation missing prediction. Figure 2 shows the results. Complete independence assumption produces more missing prediction trees. This is because, in complete independence assumption, the tree structure is decided only by its span scores. A tree can have high span scores but lower label scores, resulting in a low score in total.
Categories  Parsing System  S  N  R 

Global  joty2015codra  85.7  73.0  60.2 
li2016discourse  85.4  70.8  57.6  
Greedy  SogaardBC17  85.1  73.1  61.4 
feng2014linear  87.0  74.1  60.5  
surdeanu2015two  85.1  71.1  59.1  
hayashi2016empirical  85.9  72.1  59.4  
braud2016multi  83.6  69.8  55.1  
ji2014representation  85.0  71.6  61.9  
Baseline  86.6  73.8  61.6  
Our System  Complete Independence  85.7  72.2  56.7 
Partial Independence  87.2  74.9  61.9  
Human  89.6  78.3  66.7 
Categories  Parsing System  S  N  R 

Global  joty2015codra  82.6  68.3  55.4 
li2016discourse  82.2  66.5  50.6  
Greedy  SogaardBC17  81.3  68.1  56.0 
feng2014linear  84.3  69.4  56.2  
surdeanu2015two  82.6  67.1  54.9  
hayashi2016empirical  82.6  66.6  54.3  
braud2016multi  79.7  63.6  47.5  
ji2014representation  82.0  68.2  57.6  
yu2018transition  85.5  73.1  59.9  
Baseline  83.3  70.4  56.7  
Our System  Complete Independence  83.0  67.7  51.8 
Partial Independence  84.5  71.1  57.5  
Human  88.3  77.3  65.4 
6 Analysis and Related Work
Some prior work explores global parsing for RST structures li2016discourse used the CKY algorithm to infer by ignoring the dependency relation between splitting point and label assignment. joty2015codra applied a twostage parsing strategy. A sentence is first parsed, and then the document is parsed. In this process, all the crosssentence spans are ignored.
Greedy parsing can only explore a small part of the output space, thus necessitating highquality representation and models to ensure each step is as correct as possible. This is the reason why many early studies usually involve rich manually engineered features joty2015codra; feng2014linear, external resources yu2018transition; braud2016multi or heavily designed models li2016discourse; ji2014representation. Table 3 summarize all the different components used by various parsers. In contrast, using global inference, our parser only needs a recurrent input representation to achieve comparable performance without any components mentioned in Table 3.



Multitask 



joty2015codra  
li2016discourse  
SogaardBC17  
feng2014linear  
surdeanu2015two  
hayashi2016empirical  
braud2016multi  
ji2014representation  
yu2018transition 
7 Conclusion
In this work, we propose a new independence assumption for global inference of discourse parsing, which makes globally optimal inference feasible for RST trees. By using a global inference, we develop a simple neural discourse parser. Our experiments show that the simple parser can achieve comparable performance to stateofart parsers using only learned span representations.
Acknowledgements
This research was supported by The U.SIsrael Binational Science Foundation grant 2016257, its associated NSF grant 1737230 and The Yandex Initiative for Machine Learning.
References
Appendix A Hyperparameters for Experiments
Table 4 shows the hyperparameters for our experiments.
Hyperparameters  Setting 

Max Epoch  
biLSTM Hidden Size  
Feedforward Hidden Size  
GloVe Word Embedding Size  
ELMo Word Embedding Size  
POS Tag Embedding Size  
Dropout Probability 

Learning Rate 