Natural language generation and document classification have been widely conducted using neural sequence models based on the encoder–decoder architecture. The underlying technique relies on the production of a context vector as the document representation, to estimate both tokens in natural language generation and labels in classification tasks. By combining recurrent neural networks with attentionBahdanau et al. (2015), the model is able to learn contextualized representations of words at the sentence level. However, higher-level concepts, such as discourse structure beyond the sentence, are hard for an RNN to learn, especially for longer documents. We hypothesize that NLP tasks such as summarization and document classification can be improved through the incorporation of discourse information.
In this paper, we propose to incorporate latent representations of discourse units into neural training. A discourse parser can provide information about the document structure as well as the relationships between discourse units. In a summarization scenario, for example, this information may help to remove redundant information or discourse disfluencies. In the case of document classification, the structure of the text can provide valuable hints about the document category. For instance, a scientific paper follows a particular discourse narrative pattern, different from a short story. Similarly, we may be able to predict the societal influence of a document such as a petition document, in part, from its discourse structure and coherence.
Specifically, discourse analysis aims to identify the organization of a text by segmenting sentences into units with relations. One popular representation is Rhetorical Structure Theory (RST) proposed by mann1988rhet, where the document is parsed into a hierarchical tree, where leaf nodes are the segmented units, known as Entity Discourse Units (EDUs), and non-terminal nodes define the relations.
As an example, in Figure 1 the two-sentence text has been annotated with discourse structure based on RST, in the form of 4 EDUs connected with discourse labels attr and elab. Arrows in the tree capture the nuclearity of relations, wherein a “satellite” points to its “nucleus”. The Nucleus unit is considered more prominent than the Satellite, indicating that the Satellite is a supporting sentence for the Nucleus. Nuclearity relationships between two EDUs can take the following three forms: Nucleus–Satellite, Satellite–Nucleus, and Nucleus–Nucleus. In this work, we use our reimplementation of the state of the art neural RST parser of Yu et al. (2018), which is based on eighteen relations: purp, cont, attr, evid, comp, list, back, same, topic, mann, summ, cond, temp, eval, text, cause, prob, elab.111The details of each relation can be found on the RST website http://www.sfu.ca/rst/index.html
This research investigates the impact of discourse representations obtained from an RST parser on natural language generation and document classification. We primarily experiment with an abstractive summarization model in the form of a pointer–generator network See et al. (2017), focusing on two factors: (1) whether summarization benefits from discourse parsing; and (2) how a pointer–generator network guides the summarization model when discourse information is provided. For document classification, we investigate the content-based popularity prediction of online petitions with a deep regression model Subramanian et al. (2018). We argue that document structure is a key predictor of the societal influence (as measured by signatures to the petition) of a document such as a petition.
Our primary contributions are as follows: (1) we are the first to incorporate a neural discourse parser in sequence training; (2) we empirically demonstrate that a latent representation of discourse structure enhances the summaries generated by an abstractive summarizer; and (3) we show that discourse structure is an essential factor in modelling the popularity of online petitions.
2 Related Work
Discourse parsing, especially in the form of RST parsing, has been the target of research over a long period of time, including pre-neural feature engineering approaches Hernault et al. (2010); Feng and Hirst (2012); Ji and Eisenstein (2014). Two approaches have been proposed to construct discourse parses: (1) bottom-up construction, where EDU merge operations are applied to single units; and (2) transition parser approaches, where the discourse tree is constructed as a sequence of parser actions. Neural sequence models have also been proposed. In early work, Li et al. (2016b) applied attention in an encoder–decoder framework and slightly improved on a classical feature-engineering approach. The current state of the art is a neural transition-based discourse parser Yu et al. (2018) which incorporates implicit syntax features obtained from a bi-affine dependency parser Dozat and Manning (2017). In this work, we employ this discourse parser to generate discourse representations.
2.1 Discourse and Summarization
Research has shown that discourse parsing is valuable for summarization. Via the RST tree, the salience of a given text can be determined from the nuclearity structure. In extractive summarization,Ono et al. (1994), O’Donnell (1997), and Marcu (1997) suggest introducing penalty scores for each EDU based on the nucleus–satellite structure. In recent work, Schrimpf (2018) utilizes the topic relation to divide documents into sentences with similar topics. Every chunk of sentences is then summarized in extractive fashion, resulting in a concise summary that covers all of the topics discussed in the passage.
Although the idea of using discourse information in summarization is not new, most work to date has focused on extractive summarization, where our focus is abstractive summarization. Gerani et al. (2014) used the parser of Joty et al. (2013) to RST-parse product reviews. By extracting graph-based features, important aspects are identified in the review and included in the summary based on a template-based generation framework. Although the experiment shows that the RST can be beneficial for content selection, the proposed feature is rule-based and highly tailored to review documents. Instead, in this work, we extract a latent representation of the discourse directly from the Yu et al. (2018) parser, and incorporate this into the abstractive summarizer.
2.2 Discourse Analysis for Document Classification
Bhatia et al. (2015)
show that discourse analyses produced by an RST parser can improve document-level sentiment analysis. Based on DPLP (Discourse Parsing from Linear Projection) — an RST parser byJi and Eisenstein (2014) — they recursively propagate sentiment scores up to the root via a neural network.
A similar idea was proposed by Lee et al. (2018), where a recursive neural network is used to learn a discourse-aware representation. Here, DPLP is utilized to obtain discourse structures, and a recursive neural network is applied to the doc2vec Le and Mikolov (2014) representations for each EDU. The proposed approach is evaluated over sentiment analysis and sarcasm detection tasks, but found to not be competitive with benchmark methods.
Our work is different in that we use the latent representation (as distinct from the decoded discrete predictions) obtained from a neural RST parser. It is most closely related to the work of Bhatia et al. (2015) and Lee et al. (2018), but intuitively, our discourse representations contain richer information, and we evaluate over more tasks such as popularity prediction of online petitions.
3 Discourse Feature Extraction
To incorporate discourse information into our models (for summarization or document regression), we use the RST parser developed by yu2018transition to extract shallow and latent discourse features. The parser is competitive with other traditional parsers that use heuristic featuresFeng and Hirst (2012); Li and Hovy (2014); Ji and Eisenstein (2014) and other neural network-based parsers Li et al. (2016a).
3.1 Shallow Discourse Features
Given a discourse tree produced by the RST parser Yu et al. (2018), we compute several shallow features for an EDU: (1) the nuclearity score; (2) the relation score for each relation; and (3) the node type and that of its sibling.
Intuitively, the nuclearity score measures how informative an EDU is, by calculating the (relative) number of ancestor nodes that are nuclei:222The ancestor nodes of an EDU are all the nodes traversed in its path to the root.
where is an EDU; gives the height from node ;333Note that tree height is computed from the leaves, and so the height of the root node is equivalent to the depth of a leave node. and is an indicator function, i.e. it returns 1 when node is a nucleus and 0 otherwise.
The relation score measures the importance of a discourse relation to an EDU, by computing the (relative) number of ancestor nodes that participate in the relation:
where is a discourse relation (one of 18 in total).
Note that we weigh each ancestor node here by its height; our rationale is that ancestor nodes that are closer to the root are more important. The formulation of these shallow features (nuclearity and relation scores) are inspired by ono1994abstract, who propose a number of ways to score an EDU based on the RST tree structure.
Lastly, we have 2 more features for the node type (nucleus or satellite) of the EDU and its sibling. In sum, our shallow feature representation for an EDU has 21 dimensions: 1 nuclearity score, 18 relation scores, and 2 node types.
3.2 Latent Discourse Features
In addition to the shallow features, we also extract latent features from the RST parser.
In the RST parser, each word and POS tag of an EDU span is first mapped to an embedding and concatenated to form the input sequence ( is number of words in the EDU). yu2018transition also use syntax features () from the bi-affine dependency parser Dozat and Manning (2017)
. The syntax features are the output of the multi-layer perceptron layer (see dozat2017deep for full details).
The two sequences are then fed to two (separate) bi-directional LSTMs and average pooling is applied to learn the latent representation for an EDU:
where denotes the concatenate operation.
Lastly, yu2018transition apply another bi-directional LSTM over the EDUs to learn a contextualized representation:
We extract the contextualized EDU representations () and use them as latent discourse features.
3.3 Feature Extraction Pipeline
In Figure 2, we present the feature extraction pipeline. Given an input document, we use Stanford CoreNLP to tokenize words and sentences, and obtain the POS tags.444https://stanfordnlp.github.io/CoreNLP/ We then parse the processed input with the bi-affine parser Dozat and Manning (2017) to get the syntax features.
The RST parser Yu et al. (2018) requires EDU span information as input. Previous studies have generally assumed the input text has been pre-processed to obtain EDUs, as state-of-the-art EDU segmentation models are very close to human performance Hernault et al. (2010); Ji and Eisenstein (2014). For our experiments, we use the pre-trained EDU segmentation model of ji2014rep to segment the input text to produce the EDUs.
Given the syntax features (from the bi-affine parser), POS tags, EDU spans, and tokenized text, we feed them to the RST parser to extract the shallow and latent discourse features.
We re-implemented the RST Parser in PyTorch and were able to reproduce the results reported in the original paper. We train the parser on the same data (385 documents from the Wall Street Journal), based on the configuration recommended in the paper.
4 Abstractive Summarization
Abstractive summarization is the task of creating a concise version of a document that encapsulates its core content. Unlike extractive summarization, abstractive summarization has the ability to create new sentences that are not in the original document; it is closer to how humans summarize, in that it generates paraphrases and blends multiple sentences in a coherent manner.
Current sequence-to-sequence models for abstractive summarization work like neural machine translation models, in largely eschewing symbolic analysis and learning purely from training data. Pioneering work such as rush2015neural, for instance, assumes the neural architecture is able to learn main sentence identification, discourse structure analysis, and paraphrasing all in one model. Studies such as gehrmann2018bottom,hsu2018unified attempt to incorporate additional supervision (e.g. content selection) to improve summarization. Although there are proposals that extend sequence-to-sequence models based on discourse structure — e.g. Cohan+:2018 include an additional attention layer for document sections — direct incorporation of discourse information is rarely explored.
hare1984direct observe four core activities involved in creating a summary: (1) topic sentence identification; (2) deletion of unnecessary details; (3) paragraph collapsing; and (4) paraphrasing and insertion of connecting words. Current approaches Nallapati et al. (2016); See et al. (2017) capture topic sentence identification by leveraging the pointer network to do content selection, but the model is left to largely figure out the rest by providing it with a large training set, in the form of document–summary pairs. Our study attempts to complement the black-box model by providing additional supervision signal related to the discourse structure of a document.
4.1 Summarization Model
where are the encoder hidden states, are the embedded encoder input words, is the decoder hidden state, and is the embedded decoder input word.
The pointer–generator network allows the model to either draw a word from its vocabulary (generator mode), or select a word from the input document (pointer mode).
is a scalar denoting the probability of triggering the generator mode, and
gives us the generator mode’s vocabulary probability distribution. To get the final probability distribution over all words, we sum up the attention weights and:
To discourage repetitive summaries, see2017get propose adding coverage loss in addition to the cross-entropy loss:
Intuitively, the coverage loss works by first summing the attention weights over all words from previous decoding steps (), using that information as part of the attention computation (), and then penalising the model if previously attended words receive attention again (covloss). see2017get train the model for an additional 3K steps with the coverage penalty after it is trained with cross-entropy loss.
4.2 Incorporating the Discourse Features
We experiment with several simple methods to incorporate the discourse features into our summarization model. Recall that the discourse features (shallow or latent) are generated for each EDU, but the summarization model operates at the word level. To incorporate the features, we assume each word within an EDU span receives the same discourse feature representation. Henceforth we use and to denote shallow and latent discourse features.
Method-1 (M1): Incorporate the discourse features in the Bi-LSTM layer (Equation (1)) by concatenating them with the word embeddings:
Method-2 (M2): Incorporate the discourse features by adding another Bi-LSTM:
Method-3 (M3): Incorporate the discourse features in the attention layer (Equation (2)):
4.3 Data and Result
We conduct our summarization experiments using the anonymized CNN/DailyMail corpus Nallapati et al. (2016). We follow the data preprocessing steps in see2017get to obtain 287K training examples, 13K validation examples, and 11K test examples.
All of our experiments use the default hyper-parameter configuration of see2017get. Every document and its summary pair are truncated to 400 and 100 tokens respectively (shorter texts are padded accordingly). The model has 256-dimensional hidden states and 128-dimensional word embeddings, and vocabulary is limited to the most frequent 50K tokens. During test inference, we similarly limit the length of the input document to 400 words and the length of the generated summary to 35–100 words for beam search.
Our experiment has two pointer–generator network baselines: (1) one without the coverage mechanism (“PG”); and (2) one with the coverage mechanism (“PGCov”; Section 4.1). For each baseline, we incorporate the latent and shallow discourse features separately in 3 ways (Section 4.2), giving us 6 additional results.
We train the models for approximately 240,000-270,000 iterations (13 epochs). When we include the coverage mechanism (second baseline), we train for an additional 3,000–3,500 iterations using the coverage penalty, following see2017get.
We use ROUGE Lin (2004)
as our evaluation metric, which is a standard measure based on overlapping n-grams between the generated summary and the reference summary. We assess unigram (R-1), bigram (R-2), and longest-common-subsequence (R-L) overlap, and present F1, recall and precision scores in Table1.
For the first baseline (PG), we see that incorporating discourse features consistently improves recall and F1. This observation is consistent irrespective of how (e.g. M1 or M2) and what (e.g. shallow or latent features) we add. These improvements do come at the expense of precision, with the exception of M2-latent (which produces small improvements in precision). Ultimately however, the latent features are in general a little better, with M2-latent produing the best results based on F1.
We see similar observations for the second baseline (PGCov): recall is generally improved at the expense of precision. In terms of F1, the gap between the baseline and our models is a little closer, and M1-latent and M2-latent are the two best performers.
4.4 Analysis and Discussion
We saw previously that our models generally improve recall. To better understand this, we present 2 examples of generated summaries, one by the baseline (“PGCov”) and another by our model (“M1-latent”), in Figure 4. The highlighted words are overlapping words in the reference. In the first example, we notice that the summary generated by our model is closer to the reference, while the baseline has other unimportant details (e.g. he told new in chess magazine : ‘ why should -lsb- men and women -rsb- function in the same way ?). In the second example, although there are more overlapping words in our model’s summary, it is a little repetitive (e.g. first and third paragraph) and less concise.
Observing that our model generally has better recall (Table 1) and its summaries tend to be more verbose (e.g. second example in Figure 4), we calculated the average length of generated summaries for PGCov and M1-latent, and found that they are of length 55.2 and 64.4 words respectively. This suggests that although discourse information helps the summarization model overall (based on consistent improvement in F1), the negative side effect is that the summaries tend to be longer and potentially more repetitive.
5 Petition Popularity Prediction
Online petitions are open letters to policy-makers or governments requesting change or action, based on the support of members of society at large. Understanding the factors that determine the popularity of a petition, i.e. the number of supporting signatures it will receive, provides valuable information for institutions or independent groups to communicate their goals Proskurnia et al. (2017).
subramanian2018content attempt to model petition popularity by utilizing the petition text. One novel contribution is that they incorporate an auxiliary ordinal regression objective that predicts the scale of signatures (e.g. 10K vs. 100K). Their results demonstrate that the incorporation of auxiliary loss and hand-engineered features boost performance over the baseline.
In terms of evaluation metric, subramanian2018content use: (1) mean absolute error (MAE); and (2) mean absolute percentage error (MAPE), calculated as , where is the number of examples and () the predicted (true) value. Note that in both cases, lower numbers are better.
Similar to the abstractive summarization task, we experiment with incorporating the discourse features of the petition text to the petition regression model, under the hypothesis that discourse structure should benefit the model.
5.1 Deep Regression Model
As before, our model is based on the model of subramanian2018content. The input is a concatenation of the petition’s title and content words, and the output is the log number of signatures. The input sequence is mapped to GloVe vectors Pennington et al. (2014)
and processed by several convolution filters with max-pooling to create a fixed-width hidden representation, which is then fed to fully connected layers and ultimately activated by an exponential linear unit to predict the output. The model is optimized with mean squared error (MSE). In addition to the MSE loss, the authors include an auxiliary ordinal regression objective that predicts the scale of signatures (e.g.), and found that it improves performance. Our model is based on the best model that utilizes both the MSE and ordinal regression loss.
5.2 Incorporating the Discourse Features
We once again use the methods of incorporation presented in Section 4.2. As the classification model uses convolution networks, only Method-1 is directly applicable.
We also explore replacing the convolution networks with a bidirectional LSTM (“Bi-LSTM w/ GloVe”), based on the idea that recurrent networks are better at capturing long range dependencies between words and EDUs. For this model, we test both Method-1 and Method-2 to incorporate the discourse features.666Our LSTM has 200 hidden units, and uses a dropout rate of 0.3, and L2 regularization.
Lastly, unlike the summarization model that needs word level input (as the pointer network requires words to attend to in the source document), we experiment with replacing the input words with EDUs, and embed the EDUs with either the latent (“Bi-LSTM w/ latent”) or the shallow (“Bi-LSTM w/ shallow”) features.
5.3 Data, Result, and Discussion
We use the US Petition dataset from Subramanian et al. (2018).777The data is collected from https://petitions.whitehouse.gov. In total we have 1K petitions with over 12M signatures after removing petitions that have less than 150 signatures. We use the same train/dev/test split of 80/10/10 as subramanian2018content.
We present the test results in Table 2. We tune the models based on the development set using MAE, and find that most converge after 8K–10K iterations of training. We are able to reproduce the performance of the baseline model (“CNN w/ GloVe”), and find that once again, adding the shallow discourse features improves results.
Next we look at the LSTM model (“Bi-LSTM w/ GloVe”). Interestingly, we found that replacing the CNN with an LSTM results in improved MAE, but worse MAPE. Adding discourse features to this model generally has marginal improvement in all cases.
When we replace the word sequence with EDUs (“Bi-LSTM w/ latent” and “Bi-LSTM w/ shallow”), we see that the latent features outperform the shallow features. This is perhaps unsurprising, given that the shallow discourse features have no information about the actual content, and are unlikely to be effective when used in isolation without the word features.
6 Conclusion and Future Work
In this paper, we explore incorporating discourse information into two tasks: abstractive summarization and popularity prediction of online petitions. We experiment with both hand-engineered shallow features and latent features extracted from a neural discourse parser, and found that adding them generally benefits both tasks. The caveat, however, is that the best method of incorporation and feature type (shallow or latent) appears to be task-dependent, and so it remains to be seen whether we can find a robust universal approach for incorporating discourse information into NLP tasks.
- Neural machine translation by jointly learning to align and translate. International Conference on Learning Representations. Cited by: §1.
Better document-level sentiment analysis from rst discourse parsing.
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 2212–2218. External Links: Cited by: §2.2, §2.2.
- Deep biaffine attention for neural dependency parsing. In Proceedings of the 2016 international conference on learning representations, pp. 1–8. Cited by: §2, §3.2, §3.3.
- Text-level discourse parsing with rich linguistic features. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, pp. 60–68. Cited by: §2, §3.
- Abstractive summarization of product reviews using discourse structure. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1602–1613. Cited by: §2.1.
HILDA: a discourse parser using support vector machine classification. Dialogue and Discourse 1 (3), pp. 1–33. Cited by: §2, §3.3.
- Representation learning for text-level discourse parsing. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, pp. 13–24. Cited by: §2.2, §2, §3.3, §3.
- Combining intra- and multi-sentential rhetorical parsing for document-level discourse analysis. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vol. 1, pp. 486–496. Cited by: §2.1.
Distributed representations of sentences and documents.
Proceedings of The 31st International Conference on Machine Learning, pp. 1188–1196. Cited by: §2.2.
- A discourse-aware neural network-based text model for document-level text classification. Journal of Information Science 44. Cited by: §2.2, §2.2.
- Recursive deep models for discourse parsing. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 2061–2069. Cited by: §3.
- Discourse parsing with attention-based hierarchical neural networks. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 362–371. Cited by: §3.
- Discourse parsing with attention-based hierarchical neural networks. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 362–371. Cited by: §2.
- ROUGE: a package for automatic evaluation of summaries. In Text Summarization Branches Out: Proceedings of the ACL-04 Workshop, pp. 74–81. Cited by: §4.3.
- From discourse structures to text summaries. In Proceedings of of ACL Workshop on Intelligent Scalable Text Summarisation, pp. 82–88. Cited by: §2.1.
- Abstractive text summarization using sequence-to-sequence rnns and beyond. In Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning, pp. 280–290. Cited by: §4.3, §4.
- Variable-length on-line document generation. In Proceedings of the 6th European Workshop on Natural Language Generation, pp. 82–91. Cited by: §2.1.
- Abstract generation based on rhetorical structure extraction. In Proceedings of the 15th conference on Computational linguistics, pp. 344–348. Cited by: §2.1.
- Glove: global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543. External Links: Cited by: §5.1.
- Predicting the success of online petitions leveraging multidimensional time-series. In WWW ’17 Proceedings of the 26th International Conference on World Wide Web, pp. 755–764. Cited by: §5.
- Using rhetorical topics for automatic summarization. Proceedings of the Society for Computation in Linguistics 1 (1), pp. 125–135. Cited by: §2.1.
- Get to the point: summarization with pointer-generator networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, pp. 1073–1083. Cited by: §1, Figure 3, §4.1, §4.
- Content-based popularity prediction of online petitions using a deep regression model. In ACL 2018: 56th Annual Meeting of the Association for Computational Linguistics, Vol. 2, pp. 182–188. Cited by: §1, §5.3.
- Transition-based neural RST parsing with implicit syntax features. In roceedings of the 27th International Conference on Computational Linguistics, pp. 559–570. Cited by: Figure 1, §1, §2.1, §2, §3.1, §3.3.