Dialogue Natural Language Inference

by   Sean Welleck, et al.
NYU college

Consistency is a long standing issue faced by dialogue models. In this paper, we frame the consistency of dialogue agents as natural language inference (NLI) and create a new natural language inference dataset called Dialogue NLI. We propose a method which demonstrates that a model trained on Dialogue NLI can be used to improve the consistency of a dialogue model, and evaluate the method with human evaluation and with automatic metrics on a suite of evaluation sets designed to measure a dialogue model's consistency.


page 1

page 2

page 3

page 4


Generating Persona Consistent Dialogues by Exploiting Natural Language Inference

Consistency is one of the major challenges faced by dialogue agents. A h...

DialogueScript: Using Dialogue Agents to Produce a Script

We present a novel approach to generating scripts by using agents with d...

Public Self-consciousness for Endowing Dialogue Agents with Consistent Persona

Although consistency has been a long-standing issue in dialogue agents, ...

I like fish, especially dolphins: Addressing Contradictions in Dialogue Modeling

To quantify how well natural language understanding models can capture c...

Stochastic Natural Language Generation Using Dependency Information

This article presents a stochastic corpus-based model for generating nat...

Naturalness Evaluation of Natural Language Generation in Task-oriented Dialogues using BERT

This paper presents an automatic method to evaluate the naturalness of n...

Deal or No Deal? End-to-End Learning for Negotiation Dialogues

Much of human dialogue occurs in semi-cooperative settings, where agents...

1 Introduction

A long standing issue faced by dialogue models is consistency [10, 18, 21]. An example from [18] shows a two-round dialogue in which their neural sequence model first responds to what is your job? with i’m a lawyer, then responds to what do you do? with i’m a doctor. Even when inconsistencies are relatively rare and semantically plausible, they are jarring, and because semantic plausibility is not enough to root them out, preventing them is challenging.

One approach to increase the consistency of a chit-chat dialogue model was proposed in [21], where the dialogue agent was given a set of personal facts describing its character (a persona) and produces utterances that reflect the persona. The intended outcome is that the agent produces utterances consistent with its given persona. However, these models still face the consistency issue, as shown in Figure 1.

Separately, the framework of Natural Language Inference (NLI) [2]

involves learning a mapping between a sentence pair and an entailment category. It is hypothesized that the NLI task is a proxy for general goals in natural language processing, such as language understanding

[2, 20]. Thus, the NLI task has been used for learning general sentence representations [4] and for evaluating NLP models [15, 19], with the expectation that such models will be useful in downstream tasks.

Despite this expectation, leveraging an NLI model for a downstream task remains an under-explored research direction. An NLI model may improve downstream task performance if properly used, while downstream tasks may yield new datasets or identify issues with existing NLI models, thus expanding the NLI research domain.

In this paper, we reduce the problem of consistency in dialogue to natural language inference. We first create a dataset, Dialogue NLI111The dataset will be available through the ParlAI [13] framework (http://parl.ai/)., which contains sentence pairs labeled as entailment, neutral, or contradiction.

Then, we demonstrate that NLI can be used to improve consistency of dialogue models using a simple method where utterances are re-ranked using a NLI model trained on Dialogue NLI. The method results in fewer persona contradictions on three evaluation sets. The evaluation sets can be used independently to automatically evaluate a dialogue model’s persona consistency, reducing the need for human evaluation. We discuss several future research directions involving this approach.

Figure 1: Persona-based dialogue with a Key-Value Memory Network trained on Persona-Chat [21]. Figure 2: Relating triples, persona sentences, and utterances to derive annotated sentence pairs. Shown here is a “relation swap” contradiction.

2 Dialogue Consistency and Natural Language Inference

First, we review the dialogue generation and natural language inference problems, and the notions of consistency used throughout.

Dialogue Generation

Dialogue generation can be framed as next utterance prediction, in which an utterance (a sequence of tokens representing a sentence) is predicted given a conversation prefix . A sequence of utterances is interpreted as a dialogue between agents. For instance, an alternating two-agent dialogue which starts with agent and ends with agent is written as .

Persona-Based Dialogue

In persona-based dialogue, each agent is associated with a persona, and . An utterance is now predicted using the conversation prefix and the agents own persona, e.g. for agent . It is assumed that an agent’s utterances are conditionally dependent on its persona, which can be interpreted as the utterances being representative of, or reflecting, the persona.

A typical approach for representing the persona is to use a set of utterances .


A consistency error, or contradiction, occurs when an agent produces an utterance that contradicts one of their previous utterances. Similarly, a persona consistency error, or persona contradiction, occurs when an agent produces an utterance that contradicts a subset of its persona.

A contradiction may be a clear logical contradiction, e.g. I have a dog vs. I do not have a dog, but in general is less clearly defined.

As a result, in addition to logical contradictions, we interpret a consistency error as being two utterances not likely to be said by the same persona. For instance, “i’m looking forward to going to the basketball game this weekend!” vs. “i don’t like attending sporting events”, as well as “i’m a lawyer” vs. “i’m a doctor” would be viewed here as contradictions, although they are not strict logical inconsistencies.

Similarly, a persona consistency error is interpreted here as an utterance which is not likely to be said given a persona described by a given set of persona sentences, in addition to logical contradictions.

Natural Language Inference

Natural Language Inference (NLI) assumes a dataset which associates an input pair to one of three classes . Each input item comes from an input space , which in typical NLI tasks is the space of natural language sentences, i.e. is a sequence of words where each word is from a vocabulary .

The input are referred to as the premise and hypothesis, respectively, and each label is interpreted as meaning the premise entails the hypothesis, the premise is neutral with respect to the hypothesis, or the premise contradicts the hypothesis. The problem is to learn a function which generalizes to new input pairs.

Reducing Dialogue Consistency to NLI

Identifying utterances which contradict previous utterances or an agent’s persona can be reduced to natural language inference by assuming that contradictions are contained in a sentence pair.

That is, given a persona for agent A and a length dialogue , it is assumed that a dialogue contradiction for agent A is contained in an utterance pair , and a persona contradiction is contained in a pair . Similarly, we assume that entailments and neutral interactions, defined in Section 3, are contained in sentence pairs. Note that these assumptions discard interactions which require more than two sentences to express.

A natural language inference model can then identify entailing, neutral, or contradicting utterances. Section 3 proposes a dialogue-derived dataset for training , and Section 4 proposes a method which incorporates with a dialogue model for next utterance prediction.

3 Dialogue NLI Dataset

The Dialogue NLI dataset consists of sentence pairs labeled as entailment (E), neutral (N), or contradiction (C).


Sentences originate from a two-agent persona-based dialogue dataset. In this setting, a dialogue between agents and consists of a sequence of utterances , and each agent has a persona represented by a set of persona sentences and . The Dialogue NLI dataset consists of and pairs222We also release additional pairs, but experiments in this paper are not based on them. from the Persona-Chat [21] dataset333The dataset collection process is applicable to other persona-based dialogue datasets, e.g. [12]., labeled as follows.

Labels In order to determine NLI labels four our dataset, we require human annotation of the PersonaChat dataset utterances, as the dataset does not contain this information. We perform such annotation by first associating a human-labeled triple with each persona sentence, and a subset of all utterances, detailed in 3.1. The triple contains the main fact conveyed by a persona sentence, such as (i, have_pet, dog) for the persona sentence I have a pet dog, or a fact mentioned in an utterance, such as No, but my dog sometimes does.

Persona sentences and utterances are grouped by their triple (e.g. see Figure 2), and pairs and are defined as entailment, neutral, or contradiction based on their triple as follows. Refer to Table 1 for examples and Table 2 for a summary.

Triple Premise Hypothesis Triple Label
(i, like_activity, chess) i listen to a bit of everything . it helps me focus for my chess tournaments . i like to play chess . (i, like_activity, chess) E
- how are you today? i drink espresso . (i, like_drink, espresso) N
(i, like_goto, spain) i love spain so much , i been there 6 times . i think i will retire in a few years . (i, want_do, retire) N
(i, have_vehicle, car) my vehicle is older model car . i have pets . (i, have_pet, pets) N
(i, dislike, cooking) i really do not enjoy preparing food for myself . i like to cook with food i grow in my garden . (i, like_activity, cooking) C
(i, physical_attribute, short) height is missing from my stature . i am 7 foot tall . (i, physical_attribute, tall) C
(i, have_family, 3 sister) i have a brother and 3 sisters . i have a brother and four sisters . (i, have_family, 4 sister) C
Table 1: Examples from the validation set.
Train Valid Test

Data Type
Matching Triple E 43,000 57,000 5,000 500 4,500 900
Misc. Utterance N 50,000 - 3,350 - 3,000 -
Persona Pairing N 20,000 10,000 2,000 - 2,000 -
Relation Swap N 20,000 - 150 - 400 -
Relation Swap C 19,116 2,600 85 14 422 50
Entity Swap C 47,194 31,200 4,069 832 3,400 828
Numerics C 10,000 - 500 - 1,000 -
Dialogue NLI Overall 310,110 16,500 16,500
Table 2: Dialogue NLI Dataset Properties. and refer to (utterance, persona sentence) and (persona sentence, persona sentence) pairs, respectively. Numerics consist of and pairs.


Each unique pair of sentences that share the same triple are labeled as entailment.


Neutral pairs are obtained with three different methods.

First, a miscellaneous utterance is a pair where is not associated with any triple. This includes greetings (how are you today?) and sentences unrelated to a persona sentence (the weather is ok today), so such utterances are assumed to be neutral with respect to persona sentences.

The second method, persona pairing, takes advantage of the fact that each ground-truth persona is typically not redundant or contradictory. A persona sentence pair is first selected from a persona if and do not share the same triple. Then each sentence associated with the same triple as is paired with each sentence associated with the same triple as .

Lastly, we specify relation swaps for certain relations (see Appendix A.2) whose triples are assumed to represent independent facts, such as have_vehicle and have_pet. A sentence pair whose first sentence is associated with a triple and whose second sentence has triple is labeled as neutral. See Table 1 for an example.


Contradictions are obtained with three methods. See Figure 2 for an example.

First, the relation swap method is used by specifying contradicting relation pairs (see Appendix A.2), such as , then pairing each sentence associated with the triple with each sentence associated with .

Similarly, an entity swap consists of specifying relations, e.g. physical_attribute, that would yield a contradiction when the value of is changed to a different value , e.g. (see Appendix A.3). Sentences associated with are then paired with sentences associated with .

Finally, a numeric contradiction is obtained by first selecting a sentence whose triple contains a number, where the number occurs in the sentence (e.g. see Table 1). A contradicting sentence is generated by replacing the sentence’s numeric surface form with a different randomly sampled integer in number or text form.

3.1 Triples Annotation

Each persona sentence is annotated with a triple through a Mechanical Turk task as follows. We first define a schema consisting of rules, such as , where the relation comes from a fixed set of relation types , listed in Appendix A.1. Given a sentence, the annotator selects a relation from a drop-down populated with the values in . The annotator then selects the categories and values of the entities and using drop-downs that are populated based on the schema rules. An optional drop-down contains numeric values for annotating entity quantities (e.g. 3 brothers). If selected, the numeric value is concatenated to the front of the entity value. The annotator can alternatively input an out-of-schema entity value in a text-box.

Using this method, each of the 10,832 persona sentences is annotated with a triple , where , , and . Here is the set of all annotated from the drop-downs or the text-box, and is similarly defined.

Finally, utterances are associated with a triple as follows. Let be a persona sentence with triple . We start with all utterances, , from agents that have in their persona. An utterance is then associated with the triple and persona sentence when is a sub-string of , or word similarity444

We use cosine similarity between the mean of tf-idf weighted GloVe


word vectors, and

. is suitably large.

3.2 Dataset Properties

Table 2 summarizes the dataset and its underlying data types. The label, triple, and data type are supplied as annotations for each sentence pair. All sentences were generated by humans during the crowdsourced dialogue collection process of the Persona-Chat dataset [21]. The resulting sentence pairs are thus drawn from a natural dialogue domain that differs from existing NLI datasets, which are either drawn from different domains such as image captions or use synthetic templating [2, 7, 11, 16, 19, 20].

4 Consistent Dialogue Agents via Natural Language Inference

We now present a method which demonstrates that natural language inference can be used to improve the downstream task of the consistency of dialogue agents. Candidate utterances are re-ranked based on whether the candidate is predicted to contradict a persona sentence. If the NLI model predicts that a candidate contradicts a persona sentence, the candidate’s score is penalized, with the penalty weighted by the NLI model’s confidence555In our experiments, the softmax output corresponding to the contradiction class from Dialogue NLI. and a scaling term.

Specifically, assume a dialogue model and a Dialogue NLI model .

Given a persona , previous utterances , and a set of candidate next-utterances , the dialogue model outputs a ranked list of scores corresponding to next-utterance candidates .

The NLI model is then run on each pair, predicting a label with confidence . A contradiction score is computed for each candidate as:

That is, if the candidate does not contradict any persona sentence according to the NLI model, is zero. If contradicts one or more persona sentences, is the highest confidence, , out of the contradicting 666Future work could consider filtering previous-utterance contradictions as well.

New candidate scores are then computed as


and the candidates are sorted according to . Hyper-parameters and control the degree of re-ranking. For example, if the top candidate has a contradiction score of , then with , it will be moved to the ’th position in the ranking. corresponds to no re-ranking.

5 Experiments

5.1 Experiment 1: NLI

Model Valid Test
ESIM 86.31 88.20
InferSent 85.82 85.68
InferSent SNLI 47.86 46.36
InferSent Hypothesis-Only 55.98 57.19
Most Common Class 33.33 34.54
ESIM Ground-Truth Triples 99.53 99.49
Table 3: Dialogue NLI Results


Many recently proposed NLI models can be separated into sentence encoding methods of the form , and attention-based methods of the form [9].

We train representative models of each type which have achieved competitive performance on existing NLI benchmark datasets. For the sentence encoding method, we use InferSent [4]

, which encodes a sentence using a bidirectional LSTM followed by max-pooling over the output states. As the representative attention-based method we use the Enhanced Sequential Inference Model (ESIM)

[3], which computes an attention score for each word pair.

Haves Likes Attributes
Orig. Rerank Orig. Rerank Orig. Rerank
Hits@1 30.2 37.3 16.9 18.7 35.2 36.4
Contradict@1 32.5 8.96 17.6 4.1 8.0 5.7
Entail@1 55.2 74.6 77.9 90.6 87.5 88.6
Table 4: Effect of NLI re-ranking on persona consistency in dialogue. The reported metrics are percentages computed over each validation set.

Additionally, we report results for a model trained and evaluated using the hypothesis sentence only (InferSent Hypothesis-Only)[5, 17], a model trained on the existing SNLI dataset [2] but evaluated on Dialogue NLI (InferSent SNLI), and a model which returns the most common class from the Dialogue NLI training set (Most Common Class).


Table 3 shows the performance of the two NLI models and three baselines on the Dialogue NLI validation and test sets.

The test performance for ESIM (88.2%) and InferSent (85.68%) is similar to performance reported on the existing SNLI dataset (88.0% [3] and 85.5% [4] respectively), showing our task is equally challenging.

However, as seen in Table 3, an InferSent model trained on SNLI performs poorly when evaluated on Dialogue NLI (46.36%). This is likely due to a mismatch in sentence distributions between SNLI, which is derived from image captions, and Dialogue NLI, whose sentences more closely resemble downstream dialogue applications.

The hypothesis-only performance (57.19%) is lower than the hypothesis-only baseline for SNLI (69.00% [17]), and shows that using information from both the utterance and persona sentence is necessary to achieve good performance on Dialogue NLI.

ESIM’s reasonably strong performance on Dialogue NLI suggests that the model may be useful for downstream tasks - a claim which we evaluate in Experiment 5.1. However, there is also room for improvement. In particular, we report performance for a model which takes the ground-truth triples as input instead of sentences. As seen in Table 3, each sentence’s underlying triple contains sufficient information to achieve high performance (99.49%). This suggests developing NLI models with a component that identifies relevant triples in a sentence may be valuable.

Overall Score % Consistent % Contradiction
Raw Calibrated Raw Calibrated Raw Calibrated
KV-Mem 2.11 1.12 2.21 0.26 0.24 0.27 0.07 0.23 0.25 0.08
KV-Mem + NLI 2.34 1.21 2.38 0.26 0.28 0.35 0.08 0.19 0.16 0.06
Table 5: Human evaluation results (meanstandard deviation).

5.2 Experiment 2: Consistency in Dialogue

This experiment evaluates the effect of the re-ranking method from Section 4 on a dialogue model’s persona consistency.

Experiment Setup

The re-ranking method of Section 4 uses a dialogue next utterance prediction model and the Dialogue NLI model.

For the dialogue model we train the Key-Value Memory Network of [21] on the Persona-Chat dataset, which uses persona sentences and the conversation prefix as context. This model achieved the best performance on Persona-Chat in [21].

For the NLI model we use the ESIM model trained on Dialogue NLI, based on the results of Experiment 5.

To study the effect of re-ranking on persona consistency, we form evaluation sets which contain next-utterances which are likely to yield persona contradictions or entailments, as follows.

Evaluation Sets

Each example is formed by first finding a next-utterance in the Persona-Chat validation set which has an associated triple of interest, e.g. . If a sentence in the agent’s profile has triple , we form the validation example . Figure 3 shows an example.

Each example is associated with candidates , consisting of the ground-truth utterance , 10 entailment candidates with the same triple as , 10 contradicting candidates with a different triple than that of , and 10 random candidates. The dialogue model must avoid ranking a contradicting candidate highly.

Specifically, suppose the ground-truth next-utterance is associated with triple , e.g. . Entailment candidates are utterances from the validation or training sets such that is associated with triple . Since by construction a sentence in the profile also has triple , these candidates entail a profile sentence. A contradicting candidate is an utterance associated with a specified contradicting triple , e.g. .

Three evaluation sets, Haves, Likes, and Attributes are formed using this process.


The construction above allows for automatic evaluation metrics for consistency, since candidates that contradict or entail a persona are known. We introduce variants of the ranking metric Hits@k, called

Contradict@k and Entail@k.

Contradict@k measures the proportion of top-k candidates returned by the model which are contradicting candidates, averaged over examples. This measures the propensity of a model to highly rank contradictions. Contradiction@1 is the proportion of consistency errors made by the model. Hence, for this metric lower values are better, in contrast to Hits@k.

Entail@k measures the proportion of top-k candidates returned by the model which are entailment candidates, averaged over examples. Entailment candidates share the same underlying triple as the ground-truth next utterance, so this metric rewards highly ranked candidates that convey similar meaning and logic to the ground-truth utterance. Thus it can be interpreted as a more permissive version of Hits@k.


Table 4 shows re-ranking results on the three evaluation sets (). The NLI re-ranking improves all metrics on all evaluation sets. Overall dialogue performance improves, as measured by Hits@1. The NLI re-ranking substantially reduces the number of contradicting utterances predicted by the model, and increases the number of utterances which entail a profile sentence, as seen in the Contradict@1 and Entail@1 scores.

Figure 3 shows an example dialogue with candidates, contradictions predicted by the NLI model, and the corresponding re-ranked candidates.

5.3 Experiment 3: Human Evaluation

This experiment evaluates the effect of the proposed NLI re-ranking method on a dialogue model’s consistency, where consistency is judged by human evaluators in an interactive persona-based dialogue setting.

Experiment Setup

We use ParlAI [13] which integrates with Amazon Mechanical Turk for human evaluation. A human annotator is paired with a model, and each is randomly assigned a persona from 1155 persona sets. The human and model are then asked to make a conversation of at least either five or six turns (randomly decided). After the conversation, the annotator assigns three scores to the conversation, described below. Each annotator is allowed to participate in at most ten conversations per model, and we collect 100 conversations per model. Two models are evaluated: the same Key-Value Memory Network used in Experiment 5.1 without re-ranking (KV-Mem), and with re-ranking (KV-Mem + NLI).

Scoring and Calibration

Following a conversation, an annotator is shown the conversation and the model’s persona, and assigns three scores: an overall score of how well the model represented its persona ({1,2,3,4,5}), a marking of each model utterance that was consistent with the model’s persona ({0,1}), and a marking of each model utterance that contradicted a previous utterance or the model’s persona ({0,1}).

To adjust for annotator bias, we calibrate scores by assuming a model with observed scores and latent variables for the unobserved score of model and for the bias of annotator

. We then estimate the posterior mean and variance for the unobserved scores given the observed scores. See Appendix

C for details.


Table 5 shows the human evaluation results. The natural language inference re-ranking improves all metrics, notably the fine-grained consistency score (0.27 vs. 0.35) and contradiction score (0.25 vs. 0.16). The results are consistent with the conclusions from the automatic evaluation in Experiment 5.1.

6 Conclusion

In this paper, we demonstrated that natural language inference can be used to improve performance on a downstream dialogue task. To do so, we created a new dialogue-derived dataset called Dialogue NLI, a re-ranking method for incorporating a Dialogue NLI model into a dialogue task, and an evaluation set which measures a model’s persona consistency. The dataset offers a new domain for NLI models, and suggests avenues such as developing models which identify relevant triples when determining an entailment category, or devising alternative methods for using natural language inference components in downstream tasks.

Figure 3: Example from the Likes Evaluation Set, showing dialogue model candidates, NLI model predictions, and reranked candidates using the method proposed in Section 4.


  • [1] Eli Bingham, Jonathan P Chen, Martin Jankowiak, Fritz Obermeyer, Neeraj Pradhan, Theofanis Karaletsos, Rohit Singh, Paul Szerlip, Paul Horsfall, and Noah D Goodman. Pyro: Deep Universal Probabilistic Programming. arXiv preprint arXiv:1810.09538, 2018.
  • [2] Samuel R Bowman, Gabor Angeli, Christopher Potts, Christopher D Manning, and Stanford Linguistics. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 632–642. Association for Computational Linguistics, 2015.
  • [3] Qian Chen, Xiaodan Zhu, Zhenhua Ling, Si Wei, Hui Jiang, and Diana Inkpen. Enhanced LSTM for Natural Language Inference. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1657–1668. Association for Computational Linguistics, 2017.
  • [4] Alexis Conneau, Douwe Kiela, Holger Schwenk, Lo¨ıc Lo¨ıc Barrault, and Antoine Bordes. Supervised Learning of Universal Sentence Representations from Natural Language Inference Data. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 670–680, Copenhagen, Denmark, 2017. Association for Computational Linguistics.
  • [5] Suchin Gururangan, Swabha Swayamdipta, Omer Levy, Roy Schwartz, Samuel R Bowman, and Noah A Smith. Annotation Artifacts in Natural Language Inference Data. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 107–112, New Orleans, Louisiana, 2018. Association for Computational Linguistics.
  • [6] Matthew D. Hoffman and Andrew Gelman. The no-u-turn sampler: Adaptively setting path lengths in hamiltonian monte carlo. J. Mach. Learn. Res., 15(1):1593–1623, 2014.
  • [7] Tushar Khot, Ashish Sabharwal, and Peter Clark. SCITAIL: A Textual Entailment Dataset from Science Question Answering. In AAAI, 2018.
  • [8] Diederik P Kingma and Jimmy Lei Ba. Adam: A Method For Stochastic Optimization. arXiv preprint arXiv:1412.6980, 2014.
  • [9] Wuwei Lan and Wei Xu. Neural Network Models for Paraphrase Identification, Semantic Textual Similarity, Natural Language Inference, and Question Answering. In Proceedings of the 27th International Conference on Computational Linguistics, pages 3890–3902, Santa Fe, New Mexico, USA, 2018. Association for Computational Linguistics.
  • [10] Jiwei Li, Michel Galley, Chris Brockett, Georgios P Spithourakis, Jianfeng Gao, and Bill Dolan. A Persona-Based Neural Conversation Model. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 994–1003, Berlin, Germany, 2016. Association for Computational Linguistics.
  • [11] M Marelli, S Menini, M Baroni, L Bentivogli, R Bernardi, and R Zamparelli. A SICK cure for the evaluation of compositional distributional semantic models. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC-2014), Reykjavik, Iceland, 2014. European Language Resources Association (ELRA).
  • [12] Pierre-Emmanuel Mazaré, Samuel Humeau, Martin Raison, and Antoine Bordes. Training Millions of Personalized Dialogue Agents. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 2018. Association for Computational Linguistics.
  • [13] Alexander H Miller, Will Feng, Adam Fisch, Jiasen Lu, Dhruv Batra, Antoine Bordes, Devi Parikh, and Jason Weston. ParlAI: A Dialog Research Software Platform. arXiv preprint:1705.06476, 2017.
  • [14] Jeffrey Pennington, Richard Socher, and Christopher D Manning. GloVe: Global Vectors for Word Representation. In Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543, 2014.
  • [15] Adam Poliak, Yonatan Belinkov, James Glass, and Benjamin Van Durme.

    On the Evaluation of Semantic Phenomena in Neural Machine Translation Using Natural Language Inference.

    In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 513–523, New Orleans, Louisiana, 2018. Association for Computational Linguistics.
  • [16] Adam Poliak, Aparajita Haldar, Rachel Rudinger, J Edward Hu, Ellie Pavlick, Aaron Steven White, and Benjamin Van Durme. Collecting Diverse Natural Language Inference Problems for Sentence Representation Evaluation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 67–81. Association for Computational Linguistics, 2018.
  • [17] Adam Poliak, Jason Naradowsky, Aparajita Haldar, Rachel Rudinger, and Benjamin Van Durme. Hypothesis Only Baselines in Natural Language Inference. In The Seventh Joint Conference on Lexical and Computational Semantics (*SEM), 2018.
  • [18] Oriol Vinyals, Google Quoc, and V Le. A Neural Conversational Model. In

    ICML Deep Learning Workshop

    , 2015.
  • [19] Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. arXiv preprint arXiv:1804.07461, 2018.
  • [20] Adina Williams, Nikita Nangia, and Samuel R Bowman. A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1112–1122, New Orleans, Louisiana, 2018. Association for Computational Linguistics.
  • [21] Saizheng Zhang, Emily Dinan, Jack Urbanek, Arthur Szlam, Douwe Kiela, and Jason Weston. Personalizing Dialogue Agents: I have a dog, do you have pets too? In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2204–2213, Melbourne, Australia, 2018. Association for Computational Linguistics.

Appendix A Dataset Details

a.1 Schema

Relation Types

place_origin, live_in_citystatecountry, live_in_general, nationality, employed_by_company, employed_by_general, has_profession, previous_profession, job_status, teach, school_status, has_degree, attend_school, like_general, like_food, like_drink, like_animal, like_movie, like_music, like_read, like_sports, like_watching, like_activity, like_goto, dislike, has_hobby, has_ability, member_of, want_do, want_job, want, favorite_food, favorite_color, favorite_book, favorite_movie, favorite_music, favorite_music_artist, favorite_activity, favorite_drink, favorite_show, favorite_place, favorite_hobby, favorite_season, favorite_animal, favorite_sport, favorite, own, have, have_pet, have_sibling, have_children, have_family, have_vehicle, physical_attribute, misc_attribute, has_age, marital_status, gender, other.

Additional triples with a not_have relation were extracted using a dependency tree pattern.

Entity Categories

: ability, activity, animal, color, citystate, country, company, cuisine, degree_type, drink, family, food, gender, general_location, job_status, language, marital, media_genres, media_other, movie_title, music_artist, music_genre, music_instrument, noun, number, organization, person, person_attribute, person_label, personality_trait, profession, read_author, read_genre, read_title, read_other, school_name, school_status, school_type, season, sport_type, subject, time, vehicle, location, other.

a.2 Relation Swaps

Relation swaps for contradictions include (have_*, not_have),
(own, not_have),
(has_hobby, not_have),
(like_*, dislike),
(favorite_*, dislike).

Neutral relation swaps include (have_x, have_y), e.g. have_pet, have_sibling. Additional (have_* A, not_have B) swaps were defined for entities A which are a super-type of B, namely (A,B) pairs ({pet, animal}, {dog, cat}), ({sibling}, {brother, sister}), ({child, kid}, {son, daughter}), ({vehicle}, {car, truck}); this includes sentence pairs such as “i have a sibling”, “i do not have a sister”. Similarly, (not_have B, have_* A) swaps were defined using the (A, B) pairs above.

a.3 Entity Swaps

For contradictions, swapping entities for the following relation types was assumed to yield a contradiction:

attend_school, employed_by_company, employed_by_general, favorite_animal, favorite_book, favorite_color, favorite_drink, favorite_food, favorite_hobby, favorite_movie, favorite_music, favorite_music_artist, favorite_place, favorite_season, favorite_show, favorite_sport, gender, has_profession, job_status, live_in_citystatecountry, marital_status, nationality, place_origin, previous_profession, school_status, want_job.

Additionally, for physical_attribute, misc_attribute, or other relations, an entity swap was done using all WordNet antonym pairs in the personality_trait and person_attribute entity categories, as well as the swaps ({blonde}, {brunette}), ({large}, {tiny}), ({carnivore, omnivore}, {vegan, vegetarian}), ({depressed}, {happy, cheerful}), ({clean}, {dirty}) where each entity in the left set is swapped with each entity in the right set.

Appendix B Experiment Details

Experiment 1

The InferSent model used the Adam [8] optimizer with learning rate 0.001, and otherwise used the hyper-parameters from the open source implementation777https://github.com/facebookresearch/InferSent. The ESIM model used a 1-layer bidirectional LSTM with hidden dimension 1024 and Adam optimizer with learning rate 0.0001, with the remaining hyper-parameters set to those used by the InferSent model.

Experiment 2

The dialogue model was trained using ParlAI [13] on the personachat:self_original task, using the hyper-parameters given for the KVMemnnAgent in the ConvAI2 competition. The NLI model was the same ESIM model from Experiment 1.

Appendix C Score Calibration

1-5 star rating

Let be the unobserved, underlying quality of the -th approach, where . Let be the unobserved annotator bias, indicating whether the -th annotator is more or less generous. We observe a score given by the -th annotator to the

-th approach, and this score follows a normal distribution with its mean given by the sum of the underlying model score and annoator bias, i.e.,

. We observe some of these scores, and given these scores, the goal is to infer and for all .

Utterance-pair selection

Each annotator is asked to label each utterance-pair as consistent and/or contradictory with respect to the personas. In this case, the unobserved, underlying model score is modelled as a pre-sigmoid normal variable, i.e., , and the annotator bias as a usual normal variable, i.e., , similarly to the 1-5 star rating case above. We however also introduce a turn bias

to incorporate the potential degradation of a neural dialogue model as the conversation lengthens. An observed score for each utterance pair then follows a Bernoulli distribution with its mean given as the sigmoid of the sum of these three latent variables, i.e.,

. The goal of inference is to compute and .


We use Pyro[1] and the no-u-turn sampler (NUTS)[6] for posterior inference.