When composing multiple sentences into a paragraph, as in novels or academic papers, we often make design decisions in advance byrne1979teaching such as topic introduction and content ordering to ensure better coherence of the text. For instance, mckeown1985discourse; gopen1990science proposed effective patterns for scientific writing: a hypothesis at first, followed by supporting sentences to validate the hypothesis, and lastly a concluding sentence. We call such a logical connection between sentences in a written paragraph as a flow. A coherent flow between sentences requires an understanding of various factors including tense, coreference, plans appelt1982planning; hovy1991approaches, scripts tomkins1978script and several others. We focus on the paragraph-level plan between sentences.
In text planning, underlying relations in text are broadly categorized into two forms: an explicit human-defined relation (e.g., a discourse tree) reiter2000building or an implicitly learned latent relation yang2016hierarchical. While the former is defined and manuallly annotated based on linguistic theories, the latter is simply determinable from how people in fact put sentences together. In this work, we provide an empirical comparison between a linguistically-informed and a latent form of relations in context of a paragraph generation.
We compare the effectiveness of the two forms of relations using language modeling for paragraph generation. Due to the different characteristics of the two forms, we employ comparable but different components in addition to the base language model. For linguistic relations (e.g., discourse), we cast the problem into multi-task learning of supervised language modeling and discourse relation prediction. On the other hand, for latent relations, we learn an unsupervised hierarchical language model that is hierarchically conditioned by RNNs over linear operations between sentences.
We evaluate our models on partial paragraph generation task; producing the rest of text in a paragraph given some context of text. We observe that linguistically annotated discourse relations help produce more coherent text than the latent relations, followed by other baselines.
2 Related Work
There has been a variety of NLG systems that incorporate additional information between sentences appelt1982planning; reiter2000building; gatt2018survey which can be broadly categorized into two forms: linguistic and latent.
Linguistic relations are explicitly represented as external labels in the form of predefined rules or plans, formats, knowledge base, discourse parses, and more. Hovy1985IntegratingTP; hovy1990pragmatics; dalianis1996aggregation integrated text planning in generation, where the plans are considered in knowledge, formatted rules and so forth. However, they are limited to small scale (i.e. few examples) and hand-written rules. kang2017detecting; gardent2017creating; kang18acl; Wang2018DescribingAK used an external knowledge base to micro-planning for generating a corresponding text, while our work focuses on comparing two forms of relations from the text itself.
moore1993planning; young1994dpocl utilized discourse structures such as rhetorical structure theory (RST) mann1988rhetorical for parsing a document. A script tomkins1978script is another structured representation that describes a typical sequence of events in a particular context. zhang2016variational; ji2014representation proposed better discourse parsers using neural networks. The prior works, however, used the discourse representations to describe the structure of the paragraph, while we focus on applicability of the discourse relations to language generation.
Latent relations use implicit information in a document such as hierarchical structure of the document: lin2015hierarchical; chung2016hierarchical used hierarchical RNN for modeling a document. Similarly, the hierarchical model can be extended to other variants such as attention yang2016hierarchical, encoder-decoder framework serban2017hierarchical; sordoni2015hierarchical, auto-encoding li2015hierarchical, and multiscale chung2016hierarchical. However, the hierarchical recurrence of sentences, which is dependent on topics, are less likely modeling a flow of a document.
We further summarize the fundamental differences between the two forms of relations in Appendix.
3 FlowNet: Language Modeling with Inter-sentential Relations
We propose language models that incorporate each relation to capture a high-level flow of text.
3.1 Discourse-driven FlowNet
As a linguistic relation, we employ RST mann1988rhetorical trees to represent discourse connections in the text. For simplicity, we limit usage of the discourse trees by only considering relations between adjacent phrases222The full discourse tree can be incorporated using other types of language model such as tai2015improved.: relations are inserted between adjacent phrases and represented as a flattened sequence of phrases and relations. If two consecutive RST relations are given, the deeper level of relation is chosen. If the central elementary discourse unit (EDU) or phrase is after its dependent, the relation is excluded. We consider each sequence of the flattened discourse relations as a writing flow. For example, people often write a text by elaborating basic information (Elaboration) and then describing a following statement attributed to the information (Attribution).
We view discourse relations as additional labels to predict at the same time we predict next words in language modeling. Specifically, we propose to jointly train a model that predicts a sequence of words and a sequence of RST labels by taking advantage of shared representations, following previous sequence labeling problems such as named entity recognitioncollobert2011natural and part-of-speech tagging huang2015bidirectional. Note that the RST relations are only used during training to obtain better representation for the two tasks, but not at test time.
Figure 3(a) shows our FlowNet using discourse relations. Let a paragraph be a sequence of sentences =. This model treats adjacent sentences as pairs for learning the standard seq2seq model. The first objective is to maximize the likelihood of the current sentence given the previous sentence. Hence, we maximize the following:
where =, and is the number of tokens of .
To better guide the model with discourse context, we use the shared representations to predict RST relations at the same time. For each paragraph, we run the pre-trained RST parser ji2014representation and flatten the parse tree to obtain RST relations for each sentence =, where is the number of discourse relations in . We then make a label sequence over tokens in the sentence with by placing at the first word of EDUs and filling up the rest with a null relation : . We incorporate a sequence labeling objective by employing conditional random field Lafferty2001ConditionalRF to find the label sequence that maximizes the score function for each sentence : where , and
are the hidden representation of
, weight matrix, and the bias vector corresponding to the pair of labels, respectively. For training, we maximize the conditional likelihood:
where represents all possible discourse label sequences. Decoding is done by greedily predicting the output sequence with maximum score. Both training and decoding can be computed using dynamic programming. The final objective is represented as the sum of two objective functions:
where is a scaling parameter to control the impact of CRF objective. The value is chosen empirically by searching based on validation set.
3.2 Delta-driven FlowNet
In this model, we aim to utilize latent representations to characterize the flow between sentences. Specifically we define delta, subtractions of hidden represenations of adjacent sentences as such latent information. Figure 3(b) shows how we hierarchically model different levels of information: words, sentences, and deltas.
Each word is encoded using a RNN encoder . We take the last hidden representation of word as sentence embeddings . Similar to hierarchical RNN lin2015hierarchical, each sentence representation is encoded using another RNN encoder . While discourse flow provides an explicit relation symbols, delta flow calculates a latent relation by subtracting previous representation from current representation 333Our experiment includes a comparison among other types of linear operations between sentences such as addition or a learnable function.:
Given a sequence of - delta relations for a paragraph of sentences, we again encode them using another RNN encoder . The model takes the word, sentence and delta information altogether to predict the next (-th) word in the -th sentence:
where is a word representation, is a sentence representation and is a delta information. Note that sentence representation is from the previous sentence, and delta information is calculated by two previous sentences. If there is no previous information given, the parameters are randomly initialized.
Due to the absence of goal-oriented language generation task, we collect paragraph data and define a new task of generating partial text of a paragraph given some context.
We collect paragraphs from three different domains: Papers are paragraphs extracted from academic manuscripts in computer science domain from the PeerRead kang2018peerread, and Fantasy and SciFi are paragraphs of two frequent categories extracted from the BookCorpus moviebook, where paragraphs are extracted using the line breaker in the dataset.
We only use paragraphs whose lengths are from 4 to 7, in order to measure the performance change according to paragraph length. The dataset is randomly split by 0.9/0.05/0.05 for train, valid, and test set, respectively. Table 1 shows the numbers of paragraphs for each domain. All paragraphs are parsed into RST trees using the state-of-the-art discourse parser by ji2014representation.
4.2 Bridging: Partial Paragraph Generation
We evaluate our models on partial text generation task; given a partial information (e.g., some sentences), producing the rest of text.
| Inside the club we moved straight for the bar.  Devlin ordered a beer for himself and a glass of my favorite wine for me.  I love that I didn’t have to tell him what I wanted.  He knew me well and always thought about what I wanted or needed, in and out of bed.|
Figure 4 shows our bridging task. It requires a generation of masked sentences in the middle of a paragraph given the first and the last sentences. If only the first sentence is given, the generation can be too divergent. The existence of the last sentence makes the generation more coherent and converged to some point.
We evaluate it with one hard and one soft automatic metrics: METEOR (M) banerjee2005meteor and VectorExtrema (VE) liu2016not
by calculating cosine similarity of averaged word embeddingspennington2014glove, and human performance.
4.3 Models and Setup
We compare various baseline seq2seq models which encode the context; a concatenated first and last sentences, and decode the intermediate words: S2S is attentional seq2seq model bahdanau2014neural, and HS2S: is a hierarchical version of the S2S by combining two baselines: HRNN lin2015hierarchical hierarchically models sequence of words and sentences, and HRED serban2017hierarchical; sordoni2015hierarchical encodes the given context and decodes the words. FlowNet (delta/disc.) is our proposed language model with delta and discourse relations, respectively.
We find the best hyper-parameters on validation set using grid search. Here are the final parameters used: for batch size, for maximum sentence length, for word embedding size initialized by GloVe pennington2014glove, LSTM layer hochreiter1997long with size, clipping by , learning rate and decay rate with Adagrad duchi2011adaptive optimizer, and for the vocabulary size. The total number of distinct discourse relations is .
In Table 2, both discourse and delta driven FlowNet outperform the baseline models across most of the metrics except for VecterExtrema on SciFi. Especially, as the number of training size increases (PapersSciFiFantasy
), the improvements gained from the FlowNet become bigger. This is probably because the model learns more information of the (discourse or latent) relations from the larger data.
|First: Satyrs never wear armor, including helmets, Newel began, using his hands expressively.|
|Last: Anyhow, as we actors were laying siege, a big chunk of the battlement dislodged from atop the tower.|
|Ref: [M1] ”But years ago I was in a play, and the helm was part of my costume. [M2] During the big battle scene, a few of us were assailing a castle. [M3] We had quite a set. [M4] The main tower must have been fifteen feet tall, fashioned from real stone.|
|Human: [M1] Actually he needed to wear any protectors to prevent him from a big accident. [M2] We planned to make a prank cam to make him wear those always. [M3] ”I have a good idea,” Newel kept talking continuously. [M4] ”Let’s play a role like we are under the attack.|
|S2S: [M1] he’s a good man [M2] the UNK, the one who’s a man who’s a man and the other [M3] and the other, the one who ’s a good friend [M4] he’s a good man|
|HS2S: [M1] i’m not sure that,” he said [M2] i’m not sure that i’m not sure [M3] i’m not sure that i’m not a fool [M4] ”i’m not sure that,” he said|
|FlowNet (delta): [M1] he’s a good man [M2] i’m not sure what to do [M3] i’m not sure that i’m not going to be a vampire [M4] he’s a good man|
|FlowNet (disc.): [M1] perhaps they were not quite good, but he was not a master, and they were the most powerful [M2] the only way to do not like a little, but i’ d been in the world [M3] ”you’re right,” he said ”i am not a fool you’re here [M4] you’re going to be a bit more than the other|
Table 6 shows performance comparison among different delta operations: subtract, add, and mlp
which is a multi-layer perceptron network. All scores are macro-averaged across datasets. Whileadd shows good performance on METEOR, subtract does on the soft metric (i.e., VecExt), indicating that subtraction can help the model capture the better semantics than the other functions. Figure 6 shows how performance changes on Fantasy as the paragraph lengths increase. Both of FlowNet achieve more improvements when generating longer paragraphs. Especially, discourse relations achieve the best performance at length 6 and 7.
We conduct a comparison with human performance (See Figure 9). We randomly choose 100 samples per dataset and per paragraph length and ask an annotator to perform the bridging task on the final 1,000 samples. Human outperforms the models by large margins. FlowNet with discourse relations outperforms the FlowNet with latent relations and other baselines by a large margin. As the paragraph length increases or more data is trained, discourse relations become more useful.
Table 3 shows an example paragraph with text produced by the models as well as reference and human annotation. Given only the partial context (i.e., first and last sentences), bridging task is very challenging even for human. The reference sentences and human annotations are semantically very different indeed. Among the latent models, FlowNet (delta) produces more coherent flow of text compared to S2S and HS2S. Surprisingly, FlowNet (discourse) enables generating more diverse sentences with a bit of coherence, because each sentence is generated based on the representation conditioned on the predicted RST discourse relation.
5 Conclusion and Discussion
We explore two forms of inter-sentential relations: linguistic relation such as discourse relations and a latent representation learned from the text. The proposed models for both relations achieve significant improvements over the baselines on partial paragraph generation task.
Despite the empirical effectiveness and difference between the linguistic and latent relations, they are not directly aligned for comparison. A potential direction for future study is to directly couple them together and see whether one form contains the other, or vice versa. Another direction is to check their effectiveness on top of the recent pre-trained language models.
We also thank Jason Weston, Dan Jurafsky, and anonymous reviewers for their helpful comments.
Appendix A Details on data processing
For each dataset, we preprocess each paragraph as follows:
The sentences which length is less than 5 or higher than 25 are filtered out to remove too short or too long sentences.
Due to too much noise from News and papers corpus, we make bit aggressive filters. The paragraphs whose last sentence ends with all capital words are removed to filter out the articles with reporter’s name or other meta information (e.g., location, press name). Also, paragraphs whose last sentences don’t end with sentence-ending marks (e.g., “.”, “!”, “?”) are also filtered out.
If any adjacent sentences in a paragraph is identical, we exclude the paragraph. All the duplicate paragraphs are also removed.
We ignore the paragraph that fails to be parsed by our discourse (i.e. RST) parser. The detail of the parsing would be described in the next section. During the RST parsing, some Stanford dependency parses contain UNK token that is mismatched with our tokenizer (i.e. nltk’s word tokenizer). Then, we also ignore such cases (only 1.5% of entire dataset).
Appendix B Theoretical difference between linguistic and latent relations
We briefly summarize the fundamental differences of the two relation forms:
While labels in linguistic relations are interpretable, accuracy of the labels highly depends on the performance of the discourse parser. On the other hand, latent representations with delta operation do not suffer from out-of-domain or accuracy problems that external parsers may bring in.
Linguistic relations can hold over long distances with many things in between (e.g., Solutionhood), while the latent ones are always immediately adjacent.
Linguistic ones are fairly coarse-grained and non-continuous, often making them inapplicable to other continuous models (e.g., a neural network), while latent ones are by definition continuous, always making them applicable.
Linguistic relations are often ambiguous or unclear, while latent ones can easily hybridize and represent two more relations at the same time, at the cost of being indefinable.