A bird's-eye view on coherence, and a worm's-eye view on cohesion

11/01/2018 ∙ by Woon Sang Cho, et al. ∙ Princeton University Microsoft 0

Generating coherent and cohesive long-form texts is a challenging problem in natural language generation. Previous works relied on a large amount of human-generated texts to train language models, however, few attempted to explicitly model the desired linguistic properties of natural language text, such as coherence and cohesion. In this work, we train two expert discriminators for coherence and cohesion, respectively, to provide hierarchical feedback for text generation. We also propose a simple variant of policy gradient, called 'negative-critical sequence training', using margin rewards, in which the 'baseline' is constructed from randomly generated negative samples. We demonstrate the effectiveness of our approach through empirical studies, showing significant improvements over the strong baseline -- attention-based bidirectional MLE-trained neural language model -- in a number of automated metrics. The proposed discriminators can serve as baseline architectures to promote further research to better extract, encode essential linguistic qualities, such as coherence and cohesion.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The terms coherence and cohesion in linguistics are commonly defined as follows (Williams and Colomb, 1995).

Cohesion: sentence pairs fitting together the way two pieces of a jigsaw puzzle do.
Coherence: what all the sentences in a piece of writing add up to, the way all the pieces in a puzzle add up to the picture on the box.

In layman’s terms, cohesion indicates that two consecutive sentences are locally well-connected, and coherence indicates that multiple sentences globally hold together.

Generating cohesive and coherent natural language texts that span multiple sentences is a challenging task mainly due to two reasons. First, there is no principled way of modeling cross-sentence linguistic properties, such as cohesion and coherence of a text. Second, there is no widely accepted metric of evaluating the quality of the generated text in terms of cohesion and coherence.

Most state-of-the-art approaches to natural language generation (NLG) relied on a large amount of human-generated texts to train neural language models (Cho et al., 2014; Graves, 2013; Sutskever et al., 2014)

. Although these models can generate sentences that, if judged individually, are similar to human-generated ones, they often fail to capture the local and global dependencies among sentences, resulting in neither coherent nor cohesive text. For example, neural language models based on Recurrent Neural Networks (RNNs) are widely applied to response generation for dialogue

(Vinyals and Le, 2015; Shang et al., 2015; Sordoni et al., 2015; Li et al., 2015). Although the responses by themselves look reasonable, they are either bland such as “I don’t know”, or incoherent with the whole dialogue session. See Gao et al. (2018) for a comprehensive survey.

In this paper, we strive to address the challenge in a principled manner. We propose a pair of discriminators to score whether and to what extent a text is coherent or cohesive, respectively. The coherence discriminator measures the compatibility among all sentences in a generated text using sentence-level features, thus providing a bird’s-eye view on the text. The cohesion discriminator, on the other hand, measures the compatibility of each pair of consecutive sentences using only word-level features, thus providing a worm’s-eye view on the text. These models, given a conditional input text and multiple candidate output texts, are learned to score the candidates with respect to the criterion by optimizing a pairwise ranking loss, respectively. These scores are then used as reward signals to train an RNN-based language model to generate (more) coherent and cohesive texts.

Our main contributions are three-fold: (1) we propose two linguistic discriminators for measuring coherence and cohesion of a text, respectively; (2) we present a simple yet efficient training mechanism to encode these linguistic properties; and (3) we propose negative-critical sequence training, a variant of policy gradient method, which uses negative samples to construct its reward baseline.

To the best of our knowledge, this paper is the first attempt to explicitly capture cross-sentence linguistic properties, i.e., coherence and cohesion, for long text generation. Despite the encouraging initial results, we only scratched the surface of the problem. The proposed method is yet to be significantly improved to meet the ultimate goal of generating meaningful and logical long-form texts. We cast the text generation as an RL problem and review recent work in Section 2, and detail our approach in Section 3.

source sentences
the hotel inglaterra delivered as promised . the staff was welcoming and spoke good english . the cleaning staff did a
very good job every day . the rooms were spotless and very modern . the bathroom was large and had a very nice shower
, and there were two generously sized bath towels that were twice the size of normal towels .
the breakfast in the morning was delicious and very good . it was the only hotel where i slept very well . the staff was
very helpful in late afternoon or late times . the breakfast was adequate , with a decent range of cereals , fruit , and
fruits . there is also free use of the coffee in the reception area .
the breakfast was plentiful including fresh breads and cooked to order . the location was fantastic . it is in the north
of the marina and in a very short distance . the marina has a small swimming pool with sitting area and a small gym .
they are very popular and guests have an evening reception which is very nice .
Table 1: Sample generations from our MLE-trained baseline model, , and our discriminator-guided model . The red texts highlight a common problem in - it exhibits a repetition, and an inconsistent opinion as a review. In contrast, our discriminator-guided model is able to generate a more interesting, and sentiment-consistent continuation.

2 Related work

A word sequence generation task can be framed as a reinforcement learning (RL) problem, in which the generator

is acting as a policy , with parameters , and each generated word at time , , can be viewed as an action to be chosen by the policy from a large discrete space, or vocabulary, conditioned on state , which encodes the previously generated text sequence.

Let be the reward for a partially generated text sequence . We define the long-term expected reward , where is the initial distribution of conditional input texts. Following Sutton et al. (1999), the gradient of with respect to is


where is the stationary distribution and is the expected return from state and taking action , both following policy . For brevity, we omit the derivation. In our work, we formulate text generation as an episodic RL problem with episode length , rewards being available only at the end of episode and .

There are many works on training neural language models using reward signals, such as Ranzato et al. (2015) and Paulus et al. (2017). These works directly optimize for specific metrics, such as BLEU (Papineni et al., 2002) or ROUGE (Lin and Hovy, 2003), using REINFORCE (Williams, 1992; Sutton et al., 1999). However, it is well-known that these metrics do not give a complete picture on the quality of the generated text. Only recently have there been efforts to provide more relevant quality objectives for which to optimize (Li et al., 2015, 2016a; Holtzman et al., 2018) the quality of interest such as consistency, repetition of text. But these works use the objective function to re-rank candidate outputs, not to reward or penalize outputs when they are generated in the first place. Li et al. (2016b) constructed a set of reward models, such as information flow and semantic coherence, to tune the generator, yet they do not provide an ablation study to elaborate the relative contribution of these reward models individually.

Another line of research is to use Generative Adversarial Networks (GANs) (Goodfellow et al., 2014) to incorporate feedback signals for text generation (Yu et al., 2017; Lin et al., 2017; Zhang et al., 2017c; Guo et al., 2017; Fedus et al., 2018; Zhang et al., 2018). However, the discriminator in these works are trained to distinguish real texts from the generated ones, operating as a black-box rather than providing fine-grained feedback on particular linguistic aspects of the texts. In fact, Yang et al. (2018) has partially addressed this issue by using a trained language model as the discriminator. Although the discriminator provides a fine-grained feedback at the word-level, it does not critique on many important linguistic properties of generated texts, such as cohesion and coherence.

These text generators, when facing a long-form text generation task that span multiple sentences, are by no means perfect and often exhibit some critical errors, such as a breakdown of local connections between consecutive sentences (cohesion), let alone globally solid intent (coherence). As a result, readers can easily take these cues and discriminate such generated texts against real texts. In this paper we argue that the primary reason is the lack of an effective mechanism of measuring and controlling the text quality in the generation process. The method we propose in the next section is intended to address the problem.

3 Model

We assume that global coherence of a text depends to a large degree upon how its individual sentences with different meanings are organized. So we focus our evaluation of coherence solely on the sentence-level. If the meanings of sentences are not organized properly, we have difficulty in picking up the intent of the paragraph as a whole, regardless of seamless local connectivity between consecutive sentences.

This is not to say that local connections between any two sentences should be overlooked. One can easily distinguish a model-generated sentence from a real one, simply by looking at whether the sentence is followed by another sentence logically, regardless of their grammar.

We instill these two different yet important concepts in two discriminators, operating on the sentence level and word level, respectively. Our models closely resemble these successful models for computer vision, such as StackGAN

(Zhang et al., 2017a, b) and PatchGAN (Isola et al., 2017) in that they all provide hierarchical signals to their corresponding generators, where the signals are derived from raw low-level data. We call the sentence-level discriminator the coherence discriminator , and the word-level discriminator the cohesion discriminator .

Figure 1: Illustration of coherence and cohesion discriminators. takes in bag-of-words sentence embeddings as inputs, and takes in the raw word embeddings of consecutive sentences as inputs.

3.1 Coherence discriminator:

This discriminator measures how likely two text chunks form a coherent paragraph. Let be the source text chunk that consists of sentences, be the real target text chunk that consists of sentences, and be the model-generated target text chunk that consists of sentences. is designed to distinguish a real pair from a synthetic pair by assigning them with different scores, i.e., .

Design. Our design of is inspired by the Deep Semantic Similarity Model (DSSM) (Huang et al., 2013; Gao et al., 2014; Xu et al., 2017). Given source text chunk and target text chunk , computes their coherence score in two steps, as illustrated in Figure 1. First, a pair of convolutional networks (CNNs)111We explored with deeper networks. However, the performance difference was marginal. For simplicity, we decided to use a 1-layer convolutional network architecture (Kim, 2014; Collobert et al., 2011). are applied to encode both and

into two low-dimensional continuous vectors, respectively. Second, the coherence score is computed as the cosine similarity of the two vectors. The score is a real value between

and 1, where 1 indicates the maximal coherence score, and the minimal coherence score.

measures how likely and add up to a single coherent passage. The score depends on the parameters of the two CNNs, or in other words, how and are encoded. Since we focus solely on the sentence-level features, we view a text chunk as a sequence of sentences, and view each sentence as a bag-of-words. Therefore, we represent each word using its pre-trained word embedding vector (Pennington et al., 2014) and represent each sentence using a vector which takes the average of its word embedding vectors. A text chunk is then represented as a sequence of sentence vectors which are fed to the CNN (either or in Figure 1). The parameters of and are optimized in such a way that a real pair scores higher than a synthetic pair:

Formally, the task of optimizing can be stated as follows. Given a set of training samples of the form , we optimize parameters by minimizing the pairwise rank loss on training data defined as


is a loss function, differentiable w.r.t.


In the following subsection, we will describe in turn how we construct the pairwise training samples and the form of the loss function. Since is used as a pairwise ranker, we employ the metrics commonly used in information retrieval for evaluation, such as recall at K (RK), which is defined as the fraction of correctly identifying an item in the TOP-K retrieved list (Baeza-Yates and Ribeiro-Neto, 1999). We present the retrieval results on test data in Table 2.

TripAdvisor Target Sentences Retrieval Yelp Target Sentences Retrieval
Discriminators R@1 R@5 R@10 Discriminators R@1 R@5 R@10
0.18 0.43 0.60 0.33 0.61 0.74
0.12 0.28 0.43 0.14 0.33 0.47
Table 2: Retrieval ratios for coherence and cohesion discriminators from a collection of 100 negative candidates. The reported numbers are averages over 20 evaluations.

Training mechanism. How do we construct these list of candidate target sentences , given the source sentences ? We assume the that follows an in the data is a positive target sample, or the correct item to retrieve. Negative samples are constructed using three different methods within a batch while iterating through the training data, motivated by Wieting et al. (2016):

  • The first method is to simply rotate with fixed in a batch. For a single , this method yields negative samples, where is the batch size.

  • The second method is to shuffle the sentence order once, different from its original order, known as a derangement, in each to break coherence, and this yields one negative sample.

  • Lastly, we combine the previous two methods: rotate across a batch and shuffle sentences within , yielding negative samples.

These negative samples and a single positive sample, in sum, pose a significant challenge in learning. To fit this training task into a ranking framework, we optimize over


where the weighted arithmetic mean parametrized by : and .

In our experiments, we fix and this assigns more weight to a more challenging negative sample222We performed a coarse grid search over the values of and setting resulted in fast convergence to high recall scores on the dev dataset.. Notice that is the mean if , and approaches the max as . Empirically, training the models using the weighted mean resulted in faster convergence, as opposed to using the single most challenging negative sample score () or the mean of all negative sample batch.

3.2 Cohesion discriminator:

Our second discriminator pays attention only to low-level features, such as grammar of each of the sentences and the logical flow between arbitrary two consecutive sentences. These two aspects combined, despite more linguistic aspects that we do not mention, heavily influence readability.

For simplicity, is similar to , except that its architecture, input, and negative sample construction are modified to encode cohesion between any pair of sentences on the word-level. A single input sample to is a pair of two consecutive sentences: and , where denotes the -th word in sentence . We construct the negative samples using the three methods for training , where shuffling occurs on the word level within each sentence, rather than shuffling multiple sentences on the sentence level.

3.3 Generator:

The two pre-trained discriminators, and , are used to modify the text generation behavior of . is an attention-based bidirectional sequence-to-sequence model. It is initially pre-trained via maximizing the word-level likelihood given the training data, and we denote this model as .

However, the model-generated texts from often does not hold to the standards of two discriminators. Therefore, we need to change the text generation behavior of with respect to the criteria. To this end, the scores from the criteria are used as direct reward or penalty signals to modify the parameters of . Given these signals, we use our proposed variant of the policy gradient, negative-critical sequence training, to update its parameters and generate (more) coherent and cohesive texts. We discuss the details in the next section.

4 Negative-Critical Sequence Training

Actor-critic methods (Barto et al., 1983; Witten, 1977)

parameterized by neural networks typically require learning a separate critic network to estimate the expected future reward as a

baseline, which in many cases is a difficult task by itself. In NLP, we have observed similar practices and challenges by Ranzato et al. (2015), Bahdanau et al. (2016), and Nguyen et al. (2017). However, recently Rennie et al. (2017) proposed an effective self-critical sequence training (SCST) mechanism that avoids learning a separate critic network. Similarly, our method does not require learning a separate critic network, instead we directly use the scores of negative samples assigned by the discriminators as the baseline.

For an arbitrary pair of and , which is the generator’s output conditioned on , we compute the coherence and cohesion scores by calling and , respectively. Since each review consists of multiple sentences, the overall cohesion score is computed as the average of scores of all consecutive sentence pairs. These scalar scores, however, have no interpretation since the discriminators are trained by optimizing a margin ranking loss. Instead, the differences between positive sample scores and the maximal or average negative sample scores provide insight of how well the models can distinguish between the positives and the negatives. Therefore, these margins can be considered as rewards with baselines, and thus we define the reward functions as:

where denotes a negative sample for a given source condition, and ( and ) are computed by averaging over an ensemble of negative samples. Notice that this reward resembles the ranking loss we use to train our discriminators, except that our baseline is an average score (instead of the weighted arithmetic mean) over negative samples. The rational for this difference is that: the maximal or the weighted arithmetic mean score baseline is in fact noisy to be used as rewards, because the best randomly constructed negative samples may be a formidably good sample. To alleviate such noise, we use the average discriminator scores of negative samples as the baseline, and this turns out to be an empirically better alternative.

Finally, we use policy gradient (Williams, 1992; Sutton et al., 1999) to maximize a weighted combination of the coherence and cohesion rewards. For illustrative purposes, we equally weigh them for updating our policy, i.e., the generator .

5 Experiments

In this section, we show results of training both and , and compare our RL-tuned generators , , and with the baseline model . We argue that through the use of feedback from our simple discriminators to , the quality of text generations improves significantly. See Table 1 for a comparison.

Dataset. We use the publicly available TripAdvisor’s hotel reviews dataset collected by Wang et al. (2010) and the Yelp review dataset333https://www.yelp.com/dataset. We consider only subsets of the two review datasets satisfying the following two conditions: a review must have (1) at least 10 sentences, and (2) each sentence should have more than 5 and less than 30 words. This yields roughly 60,000 TripAdvisor reviews and 220,000 Yelp reviews, split into ratio for train/dev/test. We merge the source and target vocabularies, and limit it to the top 50,000 frequent words, excluding special tokens. For each of these reviews, as in Holtzman et al. (2018), we consider the first five sentences as the source input to

, and the following five sentences as the target output

from .

Evaluation metrics. It is widely known that there is no accurate metric to evaluate the generator. Nevertheless, we report scores of standard metrics, such as negative log-likelihood (NLL), perplexity (PPL), BLEU and proportion of unique -grams within a single generation (intra-unique-), and across generations (inter-unique-), as in Gu et al. (2018). Results are shown in Table 3.

TripAdvisor Model NLL PPL BLEU-3 BLEU-4 BLEU-5
(baseline) 0.86 2.36 0.38 0.19 0.08 0.66 0.93 0.40 0.72 1.08
0.77 2.18 0.46 0.27 0.14 0.64 0.94 0.38 0.71 0.97
0.80 2.24 0.44 0.25 0.12 0.64 0.94 0.39 0.72 1.06
0.80 2.25 0.44 0.24 0.12 0.65 0.94 0.40 0.72 1.02
(baseline) 1.32 3.84 0.37 0.17 0.07 0.68 0.95 0.54 0.86 1.07
1.26 3.65 0.44 0.23 0.11 0.69 0.95 0.55 0.86 1.05
1.24 3.56 0.44 0.23 0.11 0.69 0.95 0.55 0.87 1.00
1.25 3.59 0.43 0.22 0.11 0.69 0.95 0.56 0.88 1.05
Table 3: An ablation study with automated evaluation metric scores: NLL, PPL, BLEU-, intra/inter-unique-, along with the length ratio with the length of corresponding true target sentences as 1. Results show that our proposed discriminators helped improve notably in BLEU scores, NLL and PPL, with marginal difference in diversity. We used equally weighted rewards, and the best numbers are highlighted in bold before rounding.

5.1 Implementation details

takes individual words as inputs and embeds into a pre-trained 300-dimensional word vectors from GloVe (Pennington et al., 2014). This embedding layer is fixed throughout training.

uses a gated recurrent unit with two layers and a hidden size of 1024 for both bidirectional encoder and attention-based decoder. During optimization using Adam

(Kingma and Ba, 2014), we set the learning rate to and clip the gradient’s L2-norm to 1.0. We initially train

by maximizing the word-level likelihood estimation (MLE) from data that consist of positive samples for 60 epochs on the TripAdvisor data and 30 epochs on the Yelp dataset, separately. These are our baseline models against which to empirically prove value of our hierarchical discriminators.

also uses the pre-trained GloVe word vectors444The vector dimension can be different from that of . The differences were marginal for sizes 50, 100, and 300. For results shown in this paper, we used the same dimension of size 300., which are fixed. The source processing network and the target processing network have the same structure, but different parameters. The convolutional layer has filters of sizes 2, 3, 4, and 5, each with 512 filters. Each convolution filter is followed by a

activation. Then we max-pool in time over the features and append a fully connected layer into a feature embedding of dimension 512, followed by a batch normalization layer and a

activation. We use an Adam optimizer with a learning rate of . is the same as , except it has convolutional filters of sizes 3, 4, 5, and 6. We train both discriminators for 50 epochs and choose models with the best R1 validation scores.

In the tuning stage, we use the negative-critical sequence training as explained in Section 4 up to 5 epochs, with a learning rate of

. We also continue with supervised learning to

to limit the policy search within a grammatically correct space, similar to Paulus et al. (2017); Wu et al. (2016); Lewis et al. (2017). In practice, sequence-level rewards are only available upon a completed generation, so they are sparse signals for the generator. Typically, sparse end-of-sequence rewards entail a noisy training, yet would want the learning generalize to the testing data. We observed that, for our particular task, most noises were caused by exploration, and the learning generalized to the testing data. Thus, reward shaping was unnecessary, unlike previous works (Li et al., 2017; Yang et al., 2018) that further provided signals for partially generated sequences.

5.2 Sanity check on and

Most of the reviews written by the hotel guests are considered coherent. Suppose we randomly select a negative sample from a pool of other continuations in the data. Even for a layman in linguistics, one can effortlessly discern if the review repeats similar ideas, albeit in different wordings, whether supports or contradicts and is considered as a natural continuation from . To show that our does these jobs and likewise for , we show some randomly selected positive and negative samples and their assigned margin scores in Table 4.

source cohesion coherence
this hotel was unbelievably overpriced . 0.0002
we were looking for something cheaper but thought we would at least
be staying in a decent hotel having paid that much when booking .
it wasn t clear when booking that we would have to share a bathroom . 0.0084
there was one shower for the whole floor which was tiny and unclean . 0.0054
the room was old and lacking in facilities .
the beds were very uncomfortable and the linen was very old . 0.0768
breakfast was ok , but the staff were incompetent . 0.0591
on our last day they were too lazy to clean our table and never bothered taking our order . -0.0097
we had to leave having had no breakfast , as we ran out of time . 0.0457
they saw us get up and leave and didn t even apologise for the appalling lack of service . +0.3735
negative target
the staff recommended great restaurants with very reasonable prices within walking distance . 0.0514
the paris hop on bus stops nearby . 0.0798
the gare l est is within 3 blocks . -0.0156
we paid 75 euro per nite excluding breakfast but paid for breakfast one day and found it very
good and reasonably priced .
the rooms are clean and bathrooms ensuite . -0.2001
more examples of cohesion
once you get there you are greeted by the staff .
they explain everything to you , and in english , not the best , but good enough . 0.1004
the coffee was even good for a coffee snob like myself .
the hotel is much smaller than i thought and only has six floors . -0.1103
the only negative was the curtain in the bathroom .
it was very shear and we felt that people in the building across the street could look
right in at night .
the beer at the lobby bar was stale .
there are many friendly cats on the grounds . -0.0830
Table 4: Coherence and cohesion margin scores on test data. The cohesion score at the end of each line is computed with its next sentence. This is a common example of contradiction and inconsistent sentiment, implying incoherence. We append more examples with extreme cohesion margin scores.

6 Discussion

We first comment on the text-to-text retrieval results in Table 2. Compared to image-to-textual caption retrieval tasks, the numbers are lower. One plausible reason is that an image is rich in semantics: an image is worth a thousand words. In contrast, a few sentences are limited in their capacity to convey a message, in addition to grammar and readability constraints. For this reason, the discriminator models face a more difficult challenge in identifying the correct query (source sentences) - item (target sentences) pairs.

Furthermore, we note that our methods to construct negative samples, in spite of its simplicity, are not thorough. For example, a randomly selected next sentence, given an arbitrary sentence, may actually be a valid continuation for . In this work, given an unlabelled dataset, we ignore such a problem, which may not be negligible and explain why we see a lower performance compared to that of . Despite potential drawbacks of our methodology, we have shown significant improvements with imperfect discriminators. After experimenting with different architectures and hyper-parameters, we conclude that the table numbers are reasonable for the task.

We do note that results will get better with more data - our discriminators, as well as the generator, will be well-trained by seeing more data samples. We consider the dataset to be rather small because each pre-processing condition is quite restrictive. However, our goal is to demonstrate the efficacy of our discriminator models, rather than to show good results arising from a large amount of data.

6.1 What do and capture?

Ideally, given how is constructed: independently processes the source and target sentences, we would want to detect whether the target sentences support the message delivered in the source sentences. Some cues that signal lacking such support are repetitive, contradicting, irrelevant, or sentiment inconsistent statements without appropriate transition phrases. Although only computes scores and does not reason out with specific cues, we observed that it learns to pick up these ideas that govern coherence of a multi-sentence paragraph. Despite some randomness from scores of randomly constructed negative samples, its critique margin scores are fairly consistent, based on a collection of margin scores on hotel reviews. Examples of more detailed incoherent aspects learns to distinguish include parts of a review commenting on different countries, cities or locations, and listing prices in different currency denominations. This is a positive result of training on randomly selected in the same batch.

The role of is similar to , except it operates in-between sentences. We observed that captures some low-level logical connection. For example, it favorably scores artifacts that are commonly mentioned together from positive samples, and this is again a result of our methods to construct negative samples through shuffling and rotating target sentences in the same batch. As a consequence, artifacts that do not appear together as many times in consecutive sentences yield low cohesion scores.

However, we note that low cohesion scores among consecutive sentences do not necessarily imply a bad writing. Although one needs to avoid consistently low cohesive writing, the writer may simply enumerate seemingly disparate aspects written into respective sentences, and these likely imply a low connection between immediately neighboring sentences. Therefore, we do not solely optimize for the cohesion criterion.

6.2 Potential improvements in our approach

We would also like to comment on our imperfect discriminators. When we tuned for up to as many epochs as in the pre-training stage, we noticed that learns to find a policy that maximizes the discriminators’ margin scores, yet diverge from what we consider a good writing. Although this problem can be partially overcome by training on a larger dataset, we determined that these two discriminators do not give a comprehensive critique, and there exist other equally, if not more, important linguistic properties that we did not address. We hope to extend our hands-free model framework to encode these features and provide richer signals for to improve upon.

While reinforcing linguistic properties such as coherence and cohesion is the first attempt and an important research direction, we consider our results to be preliminary, and many of our experiment figures allude to plenty of room for further improvements, such as recall scores. We admit that we can argue the structure of and either way, whether to process the source and target sentence(s) independently with different parameters, or together. Nevertheless, we are convinced that our architecture for both and generalize well to unseen texts, and we plan to provide online a collection of examples with corresponding margin rewards, due to limited space in this paper, and release all of our resources to promote this line of research.

7 Conclusion

In this paper we propose to encode essential linguistic properties, coherence and cohesion, with a simple neural network architecture, and quantify them using negative-critical margin scores. The coherence discriminator provides a bird’s-eye view on coherence. It assesses how likely two text chunks form a coherent paragraph, using sentence-level features. On the other hand, the cohesion discriminator provides a worm’s-eye view on cohesion. It assesses how cohesive two consecutive sentences are using word-level features.

The scores computed by these discriminators are used as reward signals for training neural language models via policy gradient. Empirical results on two long-form text generation tasks show that our method outperforms the strong baseline, an attention-based bidirectional MLE-trained sequence-to-sequence model in a number of automatic metrics.

Future work will focus on casting the long-form text generation task using the GANs framework. In this framework, the coherence and cohesion discriminators are modified against model-generated texts, and in turn, provide signals to learn neural language models. This work is an extension of GANs in that we use multiple discriminators, similar to Durugkar et al. (2016), but each discriminator reinforces a distinct linguistic behavior in .