N-best Response-based Analysis of Contradiction-awareness in Neural Response Generation Models

by   Shiki Sato, et al.
Tohoku University

Avoiding the generation of responses that contradict the preceding context is a significant challenge in dialogue response generation. One feasible method is post-processing, such as filtering out contradicting responses from a resulting n-best response list. In this scenario, the quality of the n-best list considerably affects the occurrence of contradictions because the final response is chosen from this n-best list. This study quantitatively analyzes the contextual contradiction-awareness of neural response generation models using the consistency of the n-best lists. Particularly, we used polar questions as stimulus inputs for concise and quantitative analyses. Our tests illustrate the contradiction-awareness of recent neural response generation models and methodologies, followed by a discussion of their properties and limitations.


page 1

page 2

page 3

page 4


Neural Generation of Dialogue Response Timings

The timings of spoken response offsets in human dialogue have been shown...

Negative Training for Neural Dialogue Response Generation

Although deep learning models have brought tremendous advancements to th...

Retrieve and Refine: Improved Sequence Generation Models For Dialogue

Sequence generation models for dialogue are known to have several proble...

Evaluating Dialogue Generation Systems via Response Selection

Existing automatic evaluation metrics for open-domain dialogue response ...

Sparse and Dense Approaches for the Full-rank Retrieval of Responses for Dialogues

Ranking responses for a given dialogue context is a popular benchmark in...

Non-Autoregressive Neural Dialogue Generation

Maximum Mutual information (MMI), which models the bidirectional depende...

Sarcasm Detection using Context Separators in Online Discourse

Sarcasm is an intricate form of speech, where meaning is conveyed implic...

1 Introduction

Recent advanced response generation models zhang:acl2020demo:dialogpt; adiwardana:arxiv2020:meena; roller:eacl2021:blenderbot can generate relevant and meaningful responses, which can resolve dull response problems vinyals:icml2015ws:neuralconv; sordoni:naacl2015:gen-context-sensitive; serban:aaai2016:HRED. This advancement reveals additional flaws in the quality of neural model responses, such as contradiction. Contradiction is a critical error in dialogue because a single contradictory response can disrupt the flow of the dialogue higashinaka:sigdial2015:taxonomy.

A generation model outputs a response by selecting the candidate with the highest likelihood (-best) from an -best candidate list. Prior work has demonstrated that generating the -best lists with noncontradictory -bests is an open challenge nie:acl2020:i-like-fish; kim:emnlp2020:will-i-sound-like-me; li:acl2021:addressing. Thus, one practical technique for avoiding contradiction is to have an accurate contradiction detector that eliminates all contradictory candidates from the -best list nie:acl2020:i-like-fish. In this scenario, the consistency of all candidates in the -best list, not just the -best, substantially impacts whether the final output is contradictory because the final response is chosen from the -best list. Nonetheless, earlier quantitative investigations of contradiction relied solely on -bests from models li:acl2021:addressing.

Figure 1: Overview of our analysis framework. The framework analyzes -best lists by (i) synthesizing a stimulus input that induces contradictions, (ii) automatically determining whether responses in the -best lists are contradictory, and (iii) computing Certainty and Variety.

In this study, we analyze the -best lists generated by the models to explore methods for enhancing neural response generation to avoid contradiction. Specifically, we first consider how analyzing an -best list should be approached. Then, we propose a method for statistically analyzing the -best lists (Figure 1). Since it is impractical to study all conceivable contradictions in a dialogue, we first focus on contradictions in response to polar questions.111Codes and test set are available at
We use our method to highlight the contradiction-awareness of recent high-performance neural response generation models and methodologies. Our results show that beam search has limitations in terms of avoiding contradiction and that the newer techniques, such as unlikelihood training welleck:iclr2020:unlikelihood, can help overcome these limitations.

NLI data Dialogue context for our test
Entailment Premise: yeah i’m in North Carolina EntQ History: Yeah I’m in North Carolina.
Hypothesis: I’m in North Carolina. Message: Are you in North Carolina?
Contradiction Premise: yeah i’m in North Carolina CntQ History: Yeah I’m in North Carolina.
Hypothesis: I’m in South Carolina. Message: Aren’t you in South Carolina?
Table 1: Acquiring dialogue context by transforming the Natural Language Inference (NLI) data.

2 Analysis perspectives

First, -best lists must be generated to prevent contradiction, assuming the filters can remove contradictory responses. An ideal model produces output that is noncontradictory and outperforms in many other criteria, such as relevance or informativeness. A model must generate at least one noncontradictory candidate to deliver a noncontradictory output. Furthermore, even noncontradictory candidates could be eliminated based on other criteria (e.g., relevance, informativeness). Therefore, it can be hypothesized that having more noncontradictory responses in an -best list would enhance the final output quality across various criteria. Taking the above into account, we examine -best lists based on the certainty of the existence of noncontradictory responses (Certainty), and the variety of noncontradictory responses (Variety):

  • Certainty: The proportion of the -best lists that have at least one noncontradictory response.

  • Variety: The proportion of noncontradictory responses in each -best list when only the -best lists with at least one noncontradictory response are collected.

Given a set of inputs , we calculate them as follows:

where is an -best list generation function and is a function that returns the number of noncontradictory responses from a given -best list. For example, the Certainty of a model that generates -best lists with a combination of noncontradictory and contradictory responses is high, but its Variety is low. However, a model that always generates -best lists with only noncontradictory or contradictory responses has a high Variety but a low Certainty. We anticipate that -best lists must include noncontradictory responses (Certainty), with a high proportion (high Variety).

3 Analytical inputs and evaluation

To analyze a model from the aforementioned viewpoints, we consider how to prepare the analytical inputs and evaluate the generated responses in this section.

3.1 Inputs for highlighting contradictions

Polar echo question.

An echo question noh:lp1998:echo confirms or clarifies the context information by repeating the utterance of another speaker. It is commonly used when the speaker did not hear or understand what was said correctly, or when the speaker wishes to express incredulity. Based on li:acl2021:addressing’s discovery, contradictions emerge mostly when speakers refer to earlier information communicated in dialogue; we use echo questions as stimulus input in our analysis to elicit contradictory responses. We use polar-typed echo questions to make our analysis more succinct and quantitative. Since polar questions allow for basically only two responses, yes or no

, we can clearly determine whether the generated response is contradictory or not. Furthermore, by analyzing the produced responses as a yes/no binary classification issue, it allows for quantitative discussion of experimental outcomes based on the probability level.

Input preparation.

We use the dataset from the natural language inference (NLI) task to effectively obtain the analytical inputs described in the preceding paragraph. This dataset specifies the logical relationship (i.e., entailment, neutrality, or contradiction) between a premise and its associated hypothesis. We transform the NLI dataset into dialogue data using a set of basic rewriting rules.222The details are described in Appendix A.

Our test involves two types of inputs, which can be classified as follows:

  • EntQ: generating a confirmation response.

  • CntQ: generating a refutation response.

Table 1 displays the input samples and how they are transformed from the initial NLI data. Each input is made up of the following two utterances: the history and message. In our analysis, the model generates responses to a given message, assuming the model has generated the history in the preceding turn.

3.2 Contradiction detection for output

To compute the Certainty and Variety, we must first determine whether each generated response in the -bests compared to the inputs is contradictory. The simplest method for detecting the contradictions is to check whether the response begins with yes or no. However, in the event of an indirect expression (e.g., Why not?), this method cannot detect the contradictions. Therefore, we use an automated yes-no classifier to categorize the -best responses to EntQ/CntQ. We train the classifier by fine-tuning RoBERTa liu:arxiv2019:roberta using the Circa dataset louis:emnlp2020:circa, which comprises pairs of polar questions and indirect responses, as well as annotations for the answer’s interpretation, to categorize utterances as affirmations or refutations.333The details are described in Appendix B.

4 Experiments

We demonstrate how our framework shows the properties of -best lists, which could be quite influential in terms of avoiding contradiction. We demonstrate this by comparing the -bests generated by conventional beam search (BS) versus recently proposed techniques.

4.1 Experimental settings

Inputs preparation.

We used the Multi-Genre NLI Corpus  williams:naacl2018:challengecorpus to obtain analytical inputs, which is a large scale and is consistent in good quality NLI data. We created EntQ/CntQ inputs by extracting samples labeled with entailment or contradiction.444We used the samples in the Telephone domain; this domain covers open-domain conversations.

Response generation models.

We used the following two recently developed high-performance models: DialoGPT zhang:acl2020demo:dialogpt and Blender roller:eacl2021:blenderbot.555The details of the settings are described in Appendix C.

4.2 Analysis of n-best using beam search

Let denote the beam size during generation. It has been empirically found that using beam search with to generate a response yields excellent quality results and has a frequently used value zhang:acl2020demo:dialogpt; roller:eacl2021:blenderbot. Table 2 displays the Certainty and Variety of -best lists generated using these methods. Figure 2 also depicts the Certainty and Variety of -best lists generated using different beam sizes.

Certainty Variety
Model EntQ CntQ EntQ CntQ
Blender 400M 0.806 0.747 0.780 0.775
Blender 1B 0.832 0.752 0.832 0.753
Blender 3B 0.856 0.768 0.824 0.737
DialoGPT 345M 0.938 0.917 0.750 0.669
DialoGPT 762M 0.883 0.918 0.671 0.713
Table 2: Certainty and Variety of -best lists using beam search with beam size .
Figure 2: Certainty and Variety of -best lists using beam search with various beam sizes.


Table 2 illustrates that in approximately of CntQ-type inputs, even the highest scoring model generates -best lists full of contradictory responses. Even with a perfect response filter, the models are unable to provide noncontradictory answers to these questions. It should be emphasized that the error rate is not low, given that the inputs are polar questions with highly restricted viable responses. Expanding the beam size can increase the number of -best lists with at least one noncontradictory response. Indeed, increasing the beam size enhances the Certainty ((a) and (b) in Figure 2). By increasing to , the Certainty of using DialoGPT M for both EntQ- and CntQ-type inputs achieve .


With , all the models’ Variety are more than (chance rate) (Table 2). Therefore, rather than being fully random, the models generate -best lists with a degree of directionality toward avoiding contradictions. However, increasing the size of beam reduces the Variety ((c) and (d) in Figure 2), resulting in lower output quality. For example, the Variety of DialoGPT M with for CntQ-type inputs (a model with Certainty of for both EntQ- and CntQ-type inputs) decreases to .


In terms of avoiding contradiction, our analytical framework demonstrated the features of the -best lists of the beam search. The Certainty did not achieve in the commonly used configuration (). When the beam size is increased, the Certainty increases to , whereas the Variety reduces dramatically. These results show the trade-off between Certainty and Variety as a function of beam size; in this example, we found constraints in obtaining high Certainty and Variety with beam search. Furthermore, it is found that the Certainty obtained using DialoGPT is greater than that obtained using Blender, whereas the opposite is true for Variety, suggesting that various models behave differently in terms of Certainty and Variety. This study emphasizes the significance of examining the Certainty and Variety of each model.

4.3 Analysis of n-best by various techniques

How to achieve high Certainty and Variety?

One method to increase Certainty is to generate -best lists with a wider range of responses, such that each -best list is guaranteed to contain a specific number of noncontradictory responses. The diverse beam search (DBS) vijayakumar:aaai2018:diverse and nucleus sampling (NS) holtzman:iclr2020:nucleus methods are used to construct such -best lists. Furthermore, li:acl2020:dontsaythat recently proposed models that use unlikelihood (UL) training to assign low probabilities to contradict responses. Using these models to generate -best lists will almost certainly enhance both Certainty and Variety. We assess the -best lists generated using these three strategies to see how much these techniques enhance Certainty and Variety (-best lists generated using DBS and NS, and -best lists generated using beam search together with the UL training). Appendix C contains a description of the techniques used for this analysis.

Certainty Variety
Technique EntQ CntQ EntQ CntQ
BS 0.856 0.768 0.824 0.737
DBS 0.999 0.981 0.758 0.478
NS 1.000 0.994 0.755 0.462
UL () 1.000 0.996 0.406 0.759
UL () 0.943 0.900 0.920 0.938
UL () 0.910 0.937 0.969 0.968
Table 3: Certainty and Variety of -best lists using various techniques with Blender B.


Table 3 displays the Certainty and Variety of the -best lists generated using BS, DBS, NS, and UL.666For the BS, DBS, and UL, we obtained the -best lists setting beam size to . For the NS, we got the -best lists by performing nucleus sampling ten times. The values of show the degree of UL loss during fine-tuning. Here UL with used the response generation model fine-tuned with maximum likelihood in the same training settings as those used for UL with . Thus, note that comparing UL with and allows a fair comparison between likelihood and unlikelihood training. The results reveal the properties of the -best lists obtained for the three techniques, as well as the extent to which the techniques increase Certainty and Variety. The Certainty obtained using the DBS and NS method reach for significantly lower search sizes than that for the BS to attain a Certainty of ; the Variety for CntQ-type inputs are less than (chance rate). Thus, using the DBS and NS methods efficiently improves Certainty compared with the results obtained using the beam search; nevertheless, the methods do not simultaneously attain high Certainty and Variety. However, the Certainty obtained using UL with are greater than those obtained using the BS, and this was accomplished while maintaining higher Variety than those obtained using the BS and UL with (likelihood training). Our findings show that generation models are advancing toward high Certainty and Variety, which is particularly true for the recently proposed UL loss method. Despite the highly restricted viable responses, i.e., yes or no, the Certainty obtained using UL with does not reach . Thus, we conclude that there is still room for improvement in -best list generation in terms of avoiding contradiction.

5 Conclusion

Based on the recent development of contradiction detectors, removing contradictory candidates from models’ -best lists is a practical method for avoiding contradiction. In this method, the consistency of all candidates in the -best lists substantially affects whether the final outputs are contradictory.

We quantitatively examined the properties of the -best lists in terms of avoiding contradiction, using polar-typed questions as analytical inputs. We demonstrated that the proposed framework exhibits the properties of -best lists based on Certainty and Variety. Certainty determines whether an n-best list has at least one noncontradictory response, whereas Variety evaluates how many noncontradictory responses each n-best list has. The results, particularly, demonstrated the present limitations on achieving high Certainty and Variety when using the well-established beam search method. In addition, our method emphasizes the improvements in Certainty and Variety achieved by recently proposed response generation strategies.

Our approach, which analyzes models’ -best lists based on Certainty and Variety, can be applied to any response generation problem, not just polar-typed response generation, which will be future work.


We would like to thank all anonymous reviewers for their insightful comments. We also thank Ana Brassard and Yosuke Kishinami for their valuable feedback and support. This work was partly supported by JSPS KAKENHI Grant Numbers JP21J22383, JP22K17943, JST Moonshot R&D Grant Number JPMJMS2011, and a Bilateral Joint Research Program between RIKEN AIP Center and Tohoku University.


Appendix A Details of transforming NLI data

As described in Section 3.1, we obtain an analytical input from the NLI dataset. Specifically, we convert the hypothesis sentence of an NLI sample into a yes-no question. We describe the procedure as follows:

  1. Detect the first verb of a sentence.

  2. Move the verb to the beginning of the sentence, or put one of {Do, Does, Did} at the front of the sentence, changing the verb back to its base (e.g., made make).

  3. Change first-person pronouns to second-person pronouns and second-person pronouns to first-person pronouns (e.g., my your).

  4. Change the punctuation mark at the end of the sentence to a question mark.

We used spaCy (en_core_web_smspacy2 to detect the verbs of hypothesis sentences. We did not use NLI samples with syntactically complex hypothesis sentences, such as those containing coordinating conjunctions, to avoid obtaining ungrammatical inputs. Further details are provided in our source codes.777https://github.com/shiki-sato/nbest-contradiction-analysis

Appendix B Details of yes-no classifier

Training settings.

On the Circa dataset, we fine-tuned the pretrained RoBERTa (roberta-large) implemented by Hugging Face wolf:emnlp2020:transformers. We divided the dataset at random into trainvalid . The other training parameters were identical to those used by louis:emnlp2020:circa.

Performance of classifier.

To investigate the performance of the classifier, we measured the classification accuracy. First, we manually labeled the top- responses in the -best lists generated by the analysis presented in Section 4.2 with one of the two following labels: Contradictory or Noncontradictory. The accuracy with which the automated evaluation categorized the labeled responses was then evaluated. We selected responses888 responses generated by each of generation models. from EntQ/CntQ inputs drawn at random from our test for the evaluation. The classifier classified / responses (see Appendix C), and the accuracy was . Some examples of the classification are shown in Table 4. The classifier correctly detected the contradiction in the model response using an indirect expression, in Example . However, in Example , the classifier failed to detect the contradiction of the model response, having both a noncontradictory direct expression (“No”) and a contradictory indirect expression (the part of the response after “No”). We found that the classifier tended to misclassify model responses containing the contradictions with themselves, such as Example .

History: and we didn’t ever call it uh Cokes and such you know we call it soda.
Message: Don’t you always call it Coke?
Model Response: We call it coke.
Human Label: Contradictory
Predicted Label: Contradictory
(a) Example
History: The buying a house was the last thing that i wanted to do.
Message: Weren’t you desperate to buy a house?
Model Response: No, I just wanted to buy a house.
Human Label: Contradictory
Predicted Label: Non-contradictory
(a) Example
Table 4: Examples of the response classification results by the yes-no classifier. The model responses were generated by Blender M using beam search with beam size .
Model EntQ CntQ
Blender 400M 1331 / 2000 1270 / 2000
Blender 1B 1413 / 2000 1316 / 2000
Blender 3B 1566 / 2000 1403 / 2000
DialoGPT 345M 1126 / 2000 0924 / 2000
DialoGPT 762M 1044 / 2000 0956 / 2000
Table 5: Number of stimulus inputs analyzed to calculate the Certainty and Variety described in Table 2.
Model EntQ CntQ
BS 1566 / 2000 1403 / 2000
DBS 0991 / 2000 0882 / 2000
NS 0818 / 2000 0684 / 2000
UL () 1914 / 2000 1871 / 2000
UL () 1806 / 2000 1887 / 2000
UL () 1654 / 2000 1811 / 2000
Table 6: Number of stimulus inputs analyzed to calculate the Certainty and Variety described in Table 3.

Appendix C Details of experiments

Number of analyzed stimulus inputs.

To simplify the analysis, we omitted from Section 4 and Appendix B the analytical inputs with one or more ambiguous responses in the -best lists. We defined ambiguous responses as those that were not identified by the classifier as either affirmations or refutations.999Circa dataset has seven different labels such as “Yes” and “Probably/sometimes yes.” We regard the responses classified into “In the middle” or “I am not sure” as ambiguous ones. Table 5 and Table 6 display the number of analytical inputs from the total of EntQ/CntQ used for the two analyses in Section 4.

Generation model settings.

In Section 4 experiments, we used DialoGPT zhang:acl2020demo:dialogpt and Blender roller:eacl2021:blenderbot as response generation models. We used the codes of ParlAI miller:emnlp2017demo:ParlAI with its default settings, except for beam_length_penalty to generate responses.

Unlikelihood training settings.

We used unlikelihood training with Blender B for the study of Section 4.3. To use the unlikelihood training proposed by li:acl2020:dontsaythat, we require training data that includes the following three elements: input (here, history, and message), gold response, and negative response. These training samples were created by altering the NLI data with entailing and contradicting hypotheses.101010Note that we did not use the identical NLI samples to synthesize EntQ/CntQ. Table 7 displays the original NLI data and the transformed training samples. One NLI data set yields four types of questions (PositiveQ, PositiveQ, NegativeQ, and NegativeQ). We synthesized samples from NLI data and randomly divided them into . We tuned the learning rate and the number of warmup updates for each for training. The rest of the training parameters are identical to those used by roller:eacl2021:blenderbot. It is worth noting that we only trained the models marked as UL in Section 4.3 on these transformed data.

Premise: yeah i’m in North Carolina
Hypothesis – Entailment: I’m in North Carolina.
Hypothesis – Contradict: I’m in South Carolina.
(a) Original NLI data
 History: Yeah I’m in North Carolina.
 Message: Are you in North Carolina?
 Gold: Yes, I’m in North Carolina.
 Negative: No, I’m in South Carolina.
 History: Yeah I’m in North Carolina.
 Message: Are you in South Carolina?
 Gold: No, I’m in North Carolina.
 Negative: Yes, I’m in South Carolina.
 History: Yeah I’m in North Carolina.
 Message: Aren’t you in North Carolina?
 Gold: Yes, I’m in North Carolina.
 Negative: No, I’m in South Carolina.
 History: Yeah I’m in North Carolina.
 Message: Aren’t you in South Carolina?
 Gold: No, I’m in North Carolina.
 Negative: Yes, I’m in South Carolina.
(b) Training samples for UL
Table 7: Example of transforming (a) original NLI data to (b) training sample for UL. We synthesized four questions, i.e., PositiveQ, PositiveQ, NegativeQ, and NegativeQ, from each NLI sample.