Rethinking Self-Supervision Objectives for Generalizable Coherence Modeling

Although large-scale pre-trained neural models have shown impressive performances in a variety of tasks, their ability to generate coherent text that appropriately models discourse phenomena is harder to evaluate and less understood. Given the claims of improved text generation quality across various systems, we consider the coherence evaluation of machine generated text to be one of the principal applications of coherence models that needs to be investigated. We explore training data and self-supervision objectives that result in a model that generalizes well across tasks and can be used off-the-shelf to perform such evaluations. Prior work in neural coherence modeling has primarily focused on devising new architectures, and trained the model to distinguish coherent and incoherent text through pairwise self-supervision on the permuted documents task. We instead use a basic model architecture and show significant improvements over state of the art within the same training regime. We then design a harder self-supervision objective by increasing the ratio of negative samples within a contrastive learning setup, and enhance the model further through automatic hard negative mining coupled with a large global negative queue encoded by a momentum encoder. We show empirically that increasing the density of negative samples improves the basic model, and using a global negative queue further improves and stabilizes the model while training with hard negative samples. We evaluate the coherence model on task-independent test sets that resemble real-world use cases and show significant improvements in coherence evaluations of downstream applications.



There are no comments yet.


page 1

page 2

page 3

page 4


Neural Net Models for Open-Domain Discourse Coherence

Discourse coherence is strongly associated with text quality, making it ...

A bird's-eye view on coherence, and a worm's-eye view on cohesion

Generating coherent and cohesive long-form texts is a challenging proble...

CohEval: Benchmarking Coherence Models

Although coherence modeling has come a long way in developing novel mode...

Discourse Coherence in the Wild: A Dataset, Evaluation and Methods

To date there has been very little work on assessing discourse coherence...

Contrastive Learning with Adversarial Perturbations for Conditional Text Generation

Recently, sequence-to-sequence (seq2seq) models with the Transformer arc...

Can Transformer Models Measure Coherence In Text? Re-Thinking the Shuffle Test

The Shuffle Test is the most common task to evaluate whether NLP models ...

Adaptive Offline Quintuplet Loss for Image-Text Matching

Existing image-text matching approaches typically leverage triplet loss ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Coherence is a property of a well-written text that makes it different from a random set of sentences: sentences in a coherent text are connected in systematic ways such that each sentence follows naturally from previous ones and leads into the following ones (Halliday76; Grosz1986AttentionIA). Coherence models (Barzilay:2005) that can distinguish a coherent text from incoherent ones have a wide range of applications in language generation, summarization, and coherence assessment tasks such as essay scoring and sentence ordering. With the advancements of neural methods in recent years, claims of fluency in summarization (Liu2017GenerativeAN; celikyilmaz-etal-2018-deep), language modeling (GPT2-Blog; GPT-3), response generation (zhang2019dialogpt; hosseiniasl2020simple) and human parity in machine translation (Hassan2018AchievingHP) have led to calls for finer-grained discourse-level evaluations (Lubli2018HasMT; sharma2019entity; CUBBITT), since traditional metrics such as BLEU and ROUGE are unable to measure text quality and readability (Paulus2018ADR; reiter2018structured). Coherence models that can evaluate machine-generated text have become the need of the hour. A majority of coherence models proposed optimize their learning objectives on the permuted document task that uses the Penn Treebank (WSJ) corpus. The current paradigm of coherence modeling that uses permuted documents to train pairwise ranking models was originally proposed by Barzilay:2005; Barzilay2008ModelingLC to emulate entity-based incoherence, which has its origins in Centering Theory (Grosz1995CenteringAF). An original article is considered a ‘positive’ sample of a coherent document, while a permutation of its sentences is considered a ‘negative’ or incoherent sample (see creftypecap A.1 for an example). Models are usually trained in a pairwise ranking fashion to distinguish the two. The basic entity-grid model proposed by Barzilay:2005; Barzilay2008ModelingLC was extended to incorporate entity-specific features (Elsner:2011), multiple ranks (Feng:2012), and coherence relations (Lin:2011; Feng:2014). Their neural extensions have also been proposed (dat-joty:2017; joty-etal-2018-coherence). More recent state-of-the-art models like the Transferable Neural model (xu-etal-2019-cross) consider coherence at a local level by training a forward and backward model only on adjacent sentences, in addition to generative pre-training of the sentence encoders. The Unified Coherence model (unifiedcoherence) uses bi-linear layer and lightweight convolution-pooling in a Siamese framework to capture discourse relations and topic structures, along with an explicit language model loss to capture syntactic patterns. rethinkingEACL recently tested these state-of-the-art models by conducting coherence evaluations on the WSJ permuted document task, machine translation, summarization and next utterance ranking tasks. They found that while models performed well on the permuted document task, when tested off-the-shelf, models generalized poorly to downstream evaluation tasks. They call for more comprehensive evaluations of coherence models. Pishdad2020HowCA also reached a similar conclusion. They retrained several neural coherence models for tasks analogous to coherence modeling such as detecting connective substitution and topic switching. They found that performance on the permuted document task is only partially indicative of a model’s coherence modeling capabilities. In light of these recent findings, our aim in this work is to propose a coherence model that generalizes well to other tasks, and can be used off-the-shelf for coherence evaluations of downstream applications such as machine generated text. We train our model purely through self-supervision, without tailoring the model architecture to be specific to the permuted document task or any other form of supervision. Our main hypothesis is that large-scale pre-trained models like XLNet (XLNet) are expressive enough to capture coherence information given the right self-supervision. li-jurafsky:2017 point out that coherence models are exposed to a limited number of incoherent samples in the pairwise setup, since only a small sample of all possible incoherent permutations of a document are used to train models. Learning with more negatives can better maximize the mutual information between their representations (Oord2018RepresentationLW). By using a contrastive learning (pmlr-v9-gutmann10a) setup, where each ‘positive’ document is compared with multiple ‘negative’ documents, we increase the proportion of negative samples that the model is exposed to, and show that the coherence model shows significant improvements in performance. Wu2020OnMI recently show that the difficulty of the negative samples used for contrastive training can strongly influence model success for visual representation learning. Guided by this principle, we train the model with hard negative samples that are automatically mined, coupled with a large global negative queue encoded by a momentum encoder (he2019moco). We evaluate our model on various independent test sets that demonstrate its applicability in downstream applications: machine generated summaries, language model outputs and commonsense reasoning, in addition to testing on coherence-specific test sets. In summary, our contributions are:

  • [leftmargin=*,topsep=2pt,itemsep=2pt,parsep=0pt]

  • A neural coherence model trained purely through well-designed self-supervision tasks that generalizes well to downstream applications and can be used off-the-shelf for coherence evaluation.

  • Evaluation on multiple independent test sets that are more indicative of real-world performance of the coherence model.

  • Empirical results demonstrating that an increase in the density and quality of negative samples leads to better generalization for coherence models.

2 Datasets

In order to ensure that our coherence model is useful for evaluation in downstream applications, we use a selection of task-independent test sets that cover a variety of domains and genres, including machine generated text from summarization systems and language models. Following Pishdad2020HowCA, we also evaluate the models on a commonsense reasoning narrative dataset. Since our objective is to find the best training paradigm that can be used off-the-shelf for coherence evaluation, we train (and validate) the coherence models on standard WSJ data, while using the rest as “independent” test sets to indicate the generalizability of the trained models. All evaluations on the independent test sets are conducted in a pairwise setting to enable a fair comparison.

2.1 Training Data


The Wall Street Journal (WSJ) corpus consists of news articles which are divided into 1,240 documents for training, 138 documents for development and 1,053 documents for testing in the standard setup. We exclude documents with fewer than 4 sentences and truncate them to a maximum length of 600 tokens. In order to maximally utilize documents which are otherwise truncated due to GPU memory constraints, we partition documents with 20+ sentences into blocks of 10 sentences and consider each block as a separate positive document. This increases the number of coherent ‘documents’ that we can use to generate a much larger training set. unifiedcoherence use upto 20 permutations of a document to train their model; since their training setup is pairwise, it means that the original positive document is repeated 20 times. We regenerate the permuted documents similarly, sampling a larger set of permutations for our contrastive learning setup.111We ensure that the generated permuted documents are not repeated. For example, our contrastive learning setup requires 5 negative samples per instance; because each positive document appears 20 times in the original dataset, 100 unique permutations would be generated and divided accordingly. This gives us 46,522 instances of positive and their corresponding negative documents for training and 4,522 instances for development. We use the original pairwise test set used by unifiedcoherence with 20,411 instances for testing.

2.2 Machine Generated Texts


summeval conduct a manual coherence evaluation of the summaries generated by different summarization systems for 100 source articles based on the CNN/DailyMail (Hermann2015TeachingMT) dataset. Likert-style coherence ratings from expert annotators are available for each summarized text. We adapt this to the pairwise setting by creating pairs of summaries from every system for each unique source article. The summary with the higher average coherence rating is designated as the positive document, while the summary with the lower rating is the negative document for that pair. This results in pairs for evaluation.


To cover a wider variety of machine generated text, we generated texts from various language models using prompts taken from the validation and test sets of the WritingPrompts dataset (WritingPrompts). Four language models were chosen for this purpose: GPT2-Small, GPT2-XL, CTRL and GPT3. The continuations produced by these models for each prompt were truncated at approximately tokens and paired together. Using these texts, we conducted a user study on Amazon Mechanical Turk. Workers were instructed about the concept of coherence and shown examples of coherent and incoherent texts. Given the prompt, they were asked to choose the more coherent text out of two given language model outputs; they were also given an option to choose neither in case the texts were equally coherent/incoherent (see creftypecap A.3 for more details such as the study interface). After removing the samples with low agreements and ties, a total of pairs with judgments from annotators each were collected. The Krippendorff’s alpha coefficient (Krippendorff2011ComputingKA) between the annotators was 0.84. We calculate the agreements of the coherence model ranking with these judgments, designated LMvLM.

2.3 Curated Test Sets


ailishen2021 propose a sentence intrusion detection task in order to test the coherence modeling capabilities of pre-trained language models. Incoherent documents are created by substituting a sentence from a document with another sentence from a different document, ensuring that the replacement sentence is similar to the original document to make the task sufficiently hard. We adapt their task to the pairwise setting by pairing the original coherent and the corrupted incoherent document, giving us 7,168 instances from their CNN test set (INSteD-CNN) and 3,666 instances from their Wikipedia test set (INSteD-Wiki) for evaluation. ailishen2021 also create a hand-crafted linguistic probe test set, where incoherence is manually inserted based on a range of linguistic phenomena; we use this test set for analysis (creftype 4).


The StoryCloze dataset (created from RocStories (StoryCloze)) consists of a short narrative-style text with two possible endings, one of which is implausible. The test set labels are not public so we use the validation set. We designate the text with the correct ending as the positive document and the text with the incorrect ending as the negative document, resulting in a total of pairs for evaluation.

3 Methodology

3.1 Model Architecture

Previous work on coherence modeling proposed elaborate architectures to capture various aspects of coherence (see creftype 1

). However, our key hypothesis is that large-scale pre-trained models already capture much of this information, and are expressive enough to model coherence given the right self-supervision. Effective bi-directional encoding through large Transformer networks

(VaswaniNIPS2017) can consider longer language context, while language modeling objectives enforce syntactic and local coherence patterns in the model. In our work, we adopt XLNet (XLNet) as the backbone model. It is trained using a permuted language modeling objective, in which the expected log-likelihood of a sequence with respect to all permutations of the factorization order is maximized. This allows the modeling of bi-directional context, while maintaining the auto-regressive property and avoiding the pretrain-finetune discrepancy. In addition, XLNet also incorporates segment recurrence (or memory) and the relative encoding scheme of Transformer-XL (Dai2019TransformerXLAL), which makes it effective in modeling longer text sequences. This makes it suitable for our purpose of coherence modeling. Given a document with sentences as input, our model uses the representations obtained through XLNet (parameterized by in creftypecap 1) to assign a coherence score to the model. Specifically, for each sentence with tokens

, XLNet maps each token to its vector representation

where is the dimension of the embedding. In addition, the complete input is also mapped to a document representation (i.e., the representation of the [cls] token). We simply add a linear layer to convert document representation to obtain the final coherence score: , where and are the weight and bias of the linear layer with being the entire parameter set of the model (see the upper part of creftypecap 1).

3.2 Margin-based Pairwise Ranking


Traditionally, coherence model training has been done in a pairwise ranking setup. In this setup, the model is trained to score the coherent or positive document higher than the incoherent or negative document, using a pairwise ranking loss (collobert2011natural) defined as follows:


where is the coherence score of the positive document, is the coherence score of the negative document and is the margin.


Results from evaluation of existing coherence models by both Pishdad2020HowCA and rethinkingEACL indicate that the Unified Coherence model or UNC (unifiedcoherence) is overall the best-performing model. We retrain their model with our training data for comparison222Code taken from


The results for the baseline models are given in creftypecap 1 (see first two rows). We see that despite a relatively high performance on the WSJ test set (94.11%), UNC’s performance on the independent test sets is quite poor, often failing to do better than the random baseline of 50%. The performance on the INSteD-CNN dataset, which is the same domain (news) as the training data, is relatively better at 67.21%. Our basic model XLNet-Pairwise not only outperforms the SOTA UNC model on the standard WSJ permuted document task, but also significantly outperforms this model on the independent test sets, showing an absolute improvement of 15-20% on the SummEval, INSteD-CNN, INSteD-Wiki and the StoryCloze datasets. On LMvLM, the UNC model has a better performance; we suspect that its explicit conditional language modeling loss might provide an additional advantage for this particular task. Overall, our results are consistent with observations from rethinkingEACL that show poor generalizability in the previous SOTA model.

Model WSJ SummEval LMvLM INSteD-CNN INSteD-Wiki StoryCloze
Our - Pairwise
Our - Contrastive
Our - Full Model
Table 1: Results on the WSJ permuted document test set and the various independent test sets of the previous SOTA UNC model and our XLNet based models. Except for the LMvLM results which are reported in terms of Krippendorff’s alpha agreement with human annotators, all other results are reported in terms of accuracy of the models in scoring the positive document higher than the negative document. All results are averaged over 5 runs with different seeds.

3.3 Contrastive Learning


In the pairwise ranking setup, each positive sample is only compared to one negative sample at a time. Contrastive learning (pmlr-v9-gutmann10a) makes it general, where a single positive sample can be compared to multiple negative samples, which can be particularly useful in the permuted document task where the number of possible incoherent samples per coherent document can be very large. The number of negatives considered and their quality can affect the model performance (pmlr-v97-saunshi19a). Wu2020OnMI show that contrastive loss maximizes a lower bound on the mutual information between representations. A larger number of negatives increases the tightness of the bound; learning with more negatives can better maximise the mutual information. We train our model with a margin-based contrastive loss defined as:


where is the coherence score of the positive document, are the scores of the negative documents, and is the margin.


We use the same training data as the baseline models to train our contrastive model; the positive documents remain the same, while we use 5 negative documents per instance (instead of only 1 in the pairwise setup). Effectively, the model sees the same number of positive or coherent documents, but five times as many negative samples during training compared to the pairwise setting. See creftypecap A.4

for the full set of our hyperparameters.


From the results in creftypecap 1, we see that the contrastive model (row 3) further improves the results across all the independent test sets; the results on the LMvLM dataset also improve, now surpassing the UNC model performance. Although the improvement on the WSJ permuted document task is small, the improvement in the generalizability of the model is more significant.

3.4 Momentum Encoder with Hard Negative Mining

While increasing the number of negative samples per instance has been shown to be effective for constrastive learning, resource constraints can limit the number of negatives that can be considered per instance. One solution is to consider other positive instances in the same training batch as negatives (Karpukhin2020DensePR; Chen2020ASF). However, this method is not suitable for the permuted document task since the negatives are instance-specific. While a permuted document is still independently incoherent, training with permuted versions of other documents will not provide the same cues for coherence modeling as the original self-supervision. Another solution is to maintain a large global queue of negative samples that are independent of the current training instance. During training, negative samples (more specifically, their representations) from the latest batch are enqueued to build a queue upto some size . As training continues, the negative samples from the oldest batch are dequeued to accommodate newer samples. However, representations of the documents will evolve through training as the model parameters get updated; this will make the negative samples in the queue inconsistent with each other and the training instances in the current batch. Moreover, the issue of mismatched self-supervision with negatives that are permuted versions of other documents still remains.

Figure 1: Our coherence model with the auxiliary momentum encoder. is our base encoder similar to our setup in creftypecap 3.3, while is our momentum encoder. and are the coherence scores of the positive and negative documents respectively. Note that only the parameters of

and the linear layer are updated through backpropagation.

Momentum Encoder.

To address these issues, we add an auxiliary momentum encoder (he2019moco), which is also XLNet (XLNet). creftypecap 1 shows the overall architecture. Keeping the base contrastive setup the same (the upper part), we add an additional contrastive objective based on representations from the momentum encoder. Specifically, we re-encode the positive and negative samples through the momentum encoder; the negative samples thus encoded are used to build the queue. We train the model to promote the similarity between the positive representations from the momentum encoder and the positive representations from our base encoder over the similarity with the negative samples from the queue, . Specifically, we define a momentum loss as:


where and are the positive representations from the base encoder () and the momentum encoder () respectively, indexed by are the negative representations from in the queue, and is the margin. The momentum encoder is updated based on the base encoder as:


where is the momentum coefficient; only is updated through backpropagation. Our full model is trained with a combination of the original contrastive learning objective (Eq. 2) and the momentum encoded contrastive similarity objective (Eq. 3):


where is a weighting hyperparameter.

Length Invariance Training.

In the permuted document task, both the positive and the negative samples have the same number of sentences. This is not necessarily the case for real world applications. In order to incorporate length invariance into our model, we encode a random contiguous slice of the positive document through the momentum encoder .333Minimum sentence length is 4 and maximum is full document length.

Hard Negative Mining.

It has been shown that the difficulty of the negative samples used for contrastive training can strongly influence model success (Wu2020OnMI). We therefore automatically mine hard negative samples during training. For the permuted document task, we can take advantage of the fact that the negative sample space can be huge; for a document with sentences, the candidate pool of permutations has incoherent documents from which we can mine hard negatives. For the problem of dense text retrieval, Xiong2021ApproximateNN find global hard negatives by computing document encodings using a recent checkpoint to build an asynchronous index of the entire corpus, and sampling negative documents from the index. However, the huge candidate pool for permuted documents also makes it infeasible to mine global negatives in our case. Instead, we perform local negative sample ranking. For each positive instance in the training data, we sample a larger number of permuted documents () per instance than we need for training (i.e., ). We score these negative documents using the model updated thus far and use the highest ranking negative documents for training. Specifically, the model is first trained with instances ( is a hyperparameter) of data, by using 5 negative samples randomly chosen out of . The updated model is then used to score all the negative samples each for another set of instances from the training data. The scores of the negative samples are ranked and the top scoring 5 negative samples for each instance are used to train the model for the next gradient steps. This process is repeated throughout training; the model therefore iteratively mines harder and harder negative samples as it improves. See creftypecap 1 in creftypecap A.2 for the pseudocode. We use the hard negative training in combination with the momentum encoder since we find that using hard negative samples directly leads to instability in model training (see creftypecap 4). The global negatives queue is thus also constructed from the mined hard negative samples used for training. Our model is therefore trained to rely not only on comparative coherence cues from the traditional permuted document setup, but also to recognize more independent cues for coherence through the global queue, which is additionally enhanced by incorporating length invariance and automatically mined hard negative samples.


We train the model with the same training data, this time sampling negatives444As previously described in creftypecap 2, we ensure the sampled negative documents are unique even when the positive documents are repeated. This ensures that a much larger sample of the overall candidate pool is considered during training. Since we sample and rank 50 negative documents per positive instance, accounting for 20 repetitions of the positive documents, total negative documents are considered for hard negative mining. This is 10 times larger than the contrastive setup (100 unique negatives) and 50 times larger than the pairwise setup (only 20 unique negatives). per instance for hard negative ranking, and setting the training steps (or instances) . We use a queue size of and set our momentum coefficient , with loss weighting parameter . Due to GPU memory constraints (24GB, Quadro RTX 6000), we train our model with a batch size of 1. See creftypecap A.4 for the full set of hyperparameters.


The results in creftypecap 1 (last row) show that our momentum encoder model with hard negative mining outperforms all previous models across the independent testsets. This improvement comes despite a very similar performance on the WSJ test set; we believe that our model truly improves in generalizability without overfitting to the permuted document task. The improvements on the out-of-domain test sets, particularly on LMvLM and StoryCloze, support this conclusion.

4 Analysis

Figure 2: (fig:moco_vs_hardneg) A plot of the development accuracy during training our contrastive model with and without hard negative mining, and our complete model with hard negative mining. The accuracies are evaluated after every 1000 gradient steps. (fig:rank_neg) Results on the various test sets for our model trained with hard negative mining by sampling different number of negatives () for ranking. (fig:mocoefficient) Results on the various test sets for our complete model trained with different momentum coefficient () values. (fig:queue_plot) Results on the various test sets for our model trained with different global queue sizes. Please note that the agreement values for LMvLM test set have been scaled by a factor of 100 to facilitate visualization in figures (fig:rank_neg), (fig:mocoefficient) and (fig:queue_plot).

4.1 Hard Negative Training with Momentum Model

We only train our complete model (i.e., base contrastive plus momentum model) by mining hard negative samples (creftypecap 3.4), because we find that training the base contrastive model directly with hard negatives leads to instability during training. creftypecap 1(a) plots development set accuracies of our base model trained with and without hard negative mining, and our complete model trained with hard negative mining (evaluated every 1000 steps). As seen in the figure, the contrastive model displays significant volatility when trained with hard negatives, while the complete model is quite stable.

4.2 Effects of Hyperparameters

Number of Ranked Negatives.

creftypecap 1(b) shows the results across the test sets for different numbers of negative samples considered for ranking () during hard negative mining. We see that increasing the number of negatives considered improves results across the board, with results on out-of-domain test sets LMvLM and StoryCloze showing particular improvement.

Momentum Coefficient.

creftypecap 1(c) shows the variation in the model performance across the test sets for different values of the momentum coefficient . We see that apart from a slight drop on the INSteD-Wiki dataset at , overall an increasing value leads to better generalization on the independent test sets, presumably due to a more consistent global negative queue.

Queue Size.

creftypecap 1(d) shows the variation in model performance across different test sets for various sizes of the global negative queue . We see that while increasing the queue size generally leads to an improvement in scores, at high queue sizes the improvement is limited to test sets from the same domain (WSJ, SummEval and INSteD-CNN), and the model’s generalizability is affected.

4.3 Effects of Varying Task & Dataset

So far, we have reported the results of training our model on the permuted document task using documents from the WSJ corpus as was done by most prior work (Elsner:2011; unifiedcoherence). We now test the effectiveness of other datasets, both by varying the task itself and by using a different dataset for the permuted document task.

Sentence Intrusion.

As described in creftypecap 2.3, ailishen2021 propose a sentence intrusion task to test coherence modeling capabilities of pre-trained language models. We adapt their dataset to the pairwise setting by pairing the original coherent document (positive) with the corrupted (negative) document; setting aside 10% of the data for development gives us 25,852 positive-negative training pairs for INSteD-CNN and 41,135 pairs for INSteD-Wiki. We train our pairwise (creftypecap 3.2) model on this task. From the results in creftypecap 2 (first two rows), we see that the performance on the same domain/task (as the training) and the performance on the LMvLM dataset is high, but the models trained on this task generalize poorly to the other independent test sets.

Train Dataset Neg. Type Model WSJ SummEval LMvLM INSteD-CNN INSteD-Wiki StoryCloze
INSteD-Wiki Intrusion Pairwise
INSteD-CNN Intrusion Pairwise
INSteD-CNN Permuted Pairwise
INSteD-CNN Permuted Contrastive
Table 2: Results on the WSJ permuted document test set and other independent test sets on the pairwise and contrastive models trained on different datasets. All results are averaged over 5 runs with different seeds.

Permuted Document Task with INSteD-CNN

We now train our model on the permuted document task using the INSteD-CNN dataset.555We chose the INSteD-CNN dataset as the model trained on this dataset for the sentence intrusion task generalized better to the independent test sets than the model trained on INSteD-Wiki despite being smaller. We generate 52,607 positive-negative pairs by sampling permutations, similar to our training data (see creftypecap 2.1), and train both our pairwise and contrastive models with this data. The results in creftypecap 2 show that the contrastive model leads to improvement across several test sets over the pairwise model, confirming our hypothesis on a different dataset. Specifically for machine generated texts, the sentence intrusion task training does better on the LMvLM dataset. On the other hand, the permuted document task training does better on SummEval. This could be because the documents in SummEval are summaries of the same source article and therefore similar in content (detecting incoherence through permutations might help here), while the text generated by language models even for the same prompt tends to differ in content more significantly (detecting intruder sentences might help here).

4.4 Linguistic Probe Analysis

Linguistic Probe UNC Our Example
Pronoun Animacy Downgrade 76.0 100.0 SheIt was the mother of twins Lakshmana and Shatrughna.
Pronoun Animacy Upgrade 63.0 100.0 ItShe has been collected in two tankōbon volumes.
Pronoun Gender Flip 55.0 100.0 SheHe is also well known for herhis role as Mary, the mother of Jesus.
Past to Future Flip 86.0 96.0 The Danes finishedwill finish first in the 2014 World Junior Hockey Championship.
Single Determiner Flip 62.1 83.2 In 1969, he was again sold, thisthese time to the Milwaukee Bucks.
Number 58.0 80.0 He had a career record of 676.7 wins and 62-6.2 losses.
Conjunction Flip 55.0 78.0 The school was founded in 1908, andbut has been a non-profit organization since 1956.
Negation 60.0 78.0 He was not named as the Australian squad captain and was not captain of the Wallabies.
Table 3: Accuracies of the best performing UNC and our full model on the hand-crafted linguistic probe dataset constructed by ailishen2021. Examples (abridged for brevity) shown indicate the manual changes made to make the text incoherent; the original words are shown in blue while the modified/added words are shown in red. Checks (✔) indicate our model correctly scored the coherent text higher for that example, while crosses (✘) indicate that our model failed to do so.

ailishen2021 create eight hand-crafted linguistic probe test sets by manually modifying words in coherent texts based on various linguistic phenomena, ensuring that the incoherent text produced as a result remains syntactically correct. Except for the words targeted by the probe, the rest of the text remains identical. Each test set has 100 samples each.666Except for the test set with determiner flipping, which has 95. We evaluate the best performing UNC and our full models on these test sets. The results are shown in creftypecap 3 along with some examples from the dataset. The UNC model has the most success with the tense agreement test set and mixed success on the pronoun test sets. We see that our model has perfect accuracy on all pronoun-related test sets and near-perfect accuracy on the tense agreement test set. This shows that our model is indeed capturing the discourse-level phenomena that constitute coherence. Where our model falters is in cases which may require commonsense knowledge, such as identifying that 6.7 wins is not possible. Overall, our model is quite successful in detecting several kinds of incoherence.

5 Conclusion

With the goal of making our coherence model generalizable and useful for off-the-shelf evaluations, in this work we have explored self-supervision objectives to improve coherence models without adapting our model architecture to a specific training task like previous work. We upgrade the self-supervision objective from the existing pairwise ranking paradigm to a contrastive learning setup. We further enhance this model with a momentum encoder to maintain a large global queue of negative samples, and also perform hard negative mining to refine the quality of the negative samples. We show empirically that increasing the ratio and quality of negative samples improves the generalizability of the coherence model. We also test our model on a wide-ranging collection of independent test sets that resemble real-world applications, including machine generated text, on which our model significantly outperforms the previous SOTA model. Our work thus also sets a new evaluation standard for future research in coherence modeling. We will open source our code base to encourage research in a new paradigm of coherence modeling.


Appendix A Appendix

a.1 Wsj Permuted Document Task

The examples for the permuted document task on the WSJ data are shown in creftypecap 4.

Original Document
(S1) Judy and I were in our back yard when the lawn started rolling like ocean waves.
(S2) We ran into the house to get Mame, but the next tremor threw me in the air and bounced me as I tried to get to my feet.
(S3) We are all fine here, although Mame was extremely freaked.
(S4) Books and tapes all over my room.
(S5) Not one thing in the house is where it is supposed to be, but the structure is fine.
Permuted Document
(S4) Books and tapes all over my room.
(S3) We are all fine here, although Mame was extremely freaked.
(S2) We ran into the house to get Mame, but the next tremor threw me in the air and bounced me as I tried to get to my feet.
(S5) Not one thing in the house is where it is supposed to be, but the structure is fine.
(S1) Judy and I were in our back yard when the lawn started rolling like ocean waves.
Table 4: Examples showing the original coherent document and the incoherent document created by permuting the sentences of the original. Text taken from WSJ-1778.

a.2 Hard Negative Ranking Pseudocode

The pseudocode for our hard negative mining through local sample ranking is given in creftypecap 1.

1:Training data in which each instance consists of a positive document and negative documents, model
2:Initialize empty hard negative array for each instance
3:procedure HardNegativeRanking()
4:     Partition the dataset into sets of instances
5:     for  do
6:         if i==0 then No hard negatives for first iteration
7:              for  do
8:                  Randomly sample negatives from and store in                             
9:     Train with (, )
10:     for  do
11:         Score all the negative documents in
12:         Sort in descending order of scores
13:         Get top scoring negative documents and store in
14: Store hard negatives for the next iteration      
Algorithm 1 Local Negative Sample Ranking

a.3 LMvLM User Study

The instructions and the interface provided to the workers in the user study comparing pairs of language model outputs is given in creftypecap 3. Workers were restricted to the native English speaking regions of Canada, United Kingdom and the United States and could only participate in our task if they had completed HITs with a

acceptance rate. Each task was estimated to take 2 minutes, and workers were paid the equivalent of 16 USD per hour.

Figure 3: Instructions and study interface for the user study conducted on language model outputs.

a.4 Hyperparameters

The hyperparameters used in our experiments are given in creftypecap 5.

Parameters Values
Margin-based Pairwise Ranking
- margin 0.1
- optimizer AdamW
- scheduler SWALR
- learning rate 5e-6
- annealed to 1e-6
- anneal rate 5000 steps
- batch-size 1
- XLNet model base
- dimension size 768
Contrastive Learning
- margin 0.1
- optimizer AdamW
- scheduler SWALR
- learning rate 5e-6
- annealed to 1e-6
- anneal rate 5000 steps
- batch-size 1
- XLNet model base
- dimension size 768
Momentum Encoder with Hard Negative Mining
- margin 0.1
- optimizer AdamW
- scheduler SWALR
- learning rate 5e-6
- annealed to 1e-6
- anneal rate 1000 steps
- batch-size 1
- XLNet model base
- dimension size 768
Table 5: Configuration parameters for training

a.5 Comparison of Existing State-of-The-Art Coherence Models

We report the results obtained by rethinkingEACL and Pishdad2020HowCA on their evaluation tasks for SOTA neural coherence models in creftypecap 6.

As reported by rethinkingEACL
Task Dataset UNC xu-etal-2019-cross
Permuted Document WSJ 93.19 91.77
Abstractive Summarization (Agr.) CNN 0.68 0.55
Extractive Summarization (Agr.) DUC 0.35 0.38
Machine Translation (Agr.) WMT 0.77 0.78
(Trained) Machine Translation (Agr.) WMT 0.83 0.75
As reported by Pishdad2020HowCA
Task Dataset UNC mesgar-strube-2018-neural
Permuted Document Visual Storytelling 88.42 82.25
Permuted Document ROCStories 94.80 89.55
Permuted Document Dialogue 97.21 90.79
Permuted Document HellaSwag 83.92 69.38
Permuted Document PDTB 92.85 61.96
Connective Substitution PDTB 96.46 84.99
Topic Switching Visual Storytelling 92.10 64.81
Topic Switching ROCStories 94.62 67.85
Topic Switching Dialogue 71.74 68.41
Topic Switching PDTB 70.89 52.33
Table 6: Results reported by rethinkingEACL and Pishdad2020HowCA on various tasks and datasets that compare the UNC model to two other SOTA neural coherence models proposed by xu-etal-2019-cross and mesgar-strube-2018-neural. Except those marked by (Agr.) which report agreement with humans, all other tasks report accuracies. We only include tasks that directly test discourse coherence phenomena.