Generating Persona Consistent Dialogues by Exploiting Natural Language Inference

11/14/2019 ∙ by Haoyu Song, et al. ∙ Harbin Institute of Technology 0

Consistency is one of the major challenges faced by dialogue agents. A human-like dialogue agent should not only respond naturally, but also maintain a consistent persona. In this paper, we exploit the advantages of natural language inference (NLI) technique to address the issue of generating persona consistent dialogues. Different from existing work that re-ranks the retrieved responses through an NLI model, we cast the task as a reinforcement learning problem and propose to exploit the NLI signals from response-persona pairs as rewards for the process of dialogue generation. Specifically, our generator employs an attention-based encoder-decoder to generate persona-based responses. Our evaluator consists of two components: an adversarially trained naturalness module and an NLI based consistency module. Moreover, we use another well-performed NLI model in the evaluation of persona-consistency. Experimental results on both human and automatic metrics, including the model-based consistency evaluation, demonstrate that the proposed approach outperforms strong generative baselines, especially in the persona-consistency of generated responses.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


Despite the recent success of dialogue generation in open-domain by training from large volumes of human-to-human interaction data [12, 11, 23, 24], conversing to a dialogue agent is still in its infancy, and one major issue for these data-driven models is the lack of a consistent persona [16, 8, 22, 13]. Figure 1 shows how consistency affects the quality of dialogues.

One practical approach to increase the consistency of a dialogue agent is to explicitly define a set of personal facts describing the characters (the personas) of the agent and learn to generate responses that reflect the predefined personas [22]

. However, due to the lack of consistency modeling and the maximum-likelihood estimation (MLE) objective function, these persona-based models still face the inconsistency issue 


Figure 1: Naturalness is an important attribute of dialogue responses. In persona-based dialogue generation, the consistency with persona is another essential factor to consider. An ideal response should be not only natural but also consistent with the persona.

Natural Language Inference (NLI) learns a mapping between a sentence pair and an entailment category. Taking advantages of the NLI techniques in natural language understanding [1], the detection of persona-consistency can be modeled as an NLI task [17], which assigns a label of entailment, neutral, or contradiction

to an “(utterance, persona)” pair. Meanwhile, existing persona-based dialogue models are limited by their loss functions. For these deep generative models, it is difficult to design a differentiable training objective to exploit the NLI based consistency detection method. Besides designing a differentiable training objective, reinforcement learning (RL) offers another solution to this problem, which backpropagates the reward signals to guide the generator.

In this paper, different from re-ranking the archived responses [17]

, we take advantages of the NLI techniques in guiding the generation of persona consistent dialogues. Specifically, we propose a system trained using reinforcement learning. Our model has one evaluator with two modules and one generator. The evaluator consists of a naturalness module and a consistency module. The naturalness module is trained adversarially for higher accuracy. As for the consistency module, we use an NLI styled classifier to detect the consistency between responses and personas. We further employ two different NLI classifiers in our experiments to investigate the role of NLI signals. The generator, which is a persona-based attentive Seq2Seq model 

[22], takes message and persona texts as input and generates responses that reflect the persona texts. Note that more advanced generative models such as MASS [14] can also be exploited as our generator.

We summarize the contributions as follows:

  • We propose an RL framework for persona consistent dialogue generation, thus addressing the challenge of training objective need to be differentiable in persona-based dialogue models.

  • To the best of our knowledge, this is the first work that exploits NLI techniques to enhance the generation of persona consistent dialogues.

  • Evaluations are carried out both quantitatively and qualitatively, and experimental results show that our model outperforms strong baselines, especially in terms of persona-consistency.

Related Work

Persona-based Dialogue Generation

In open-domain dialogue generation, zhang2018personalizing zhang2018personalizing initiate a new line of research (the persona-based dialogue) by introducing the Persona-Chat dataset, with explicit persona information in each dialogue session. They further propose two generative models, persona-Seq2Seq and Generative Profile Memory Network, to incorporate persona texts into responses. In the persona-based scenario, a model is associated with a persona, which is composed of several persona texts (See the top two sentences in Figure 1). A response is then generated using both the input message and the assigned persona. Following this line, deepcopy deepcopy apply the DeepCopy model in the persona-based dialogue generation. These works have laid a solid foundation for this area. Through attention or copy, generated responses can reflect the predefined persona. However, the loss functions in these models do not take the consistency issue into account, and inconsistency is still a problem to be addressed in the existing approaches [17].

Natural Language Inference

The task of Natural Language Inference (NLI) is to learn a function , where and denote premise and hypothesis respectively. The output E, N and C represent entailment, neutral and contradiction between the premise and hypothesis. Since the release of large scale corpus SNLI [1]

, deep neural network methods have made promising progress 

[2, 6, 7]. WelleckDNLI WelleckDNLI model the detection of dialogue consistency as an NLI task and propose the DNLI (Dialogue NLI) dataset, which is similar to the SNLI but in the domain of persona-based dialogue. Further, they verify the effectiveness of using NLI model to re-rank candidate responses in a retrieval-based dialogue model. Compared with the retrieval-based model, the responses from generative models are not limited to the given dataset. Moreover, exploiting consistency detection method in deep generative dialogue models has not been explored yet.

Reinforcement Learning

In recent years, deep reinforcement learning has been widely applied in natural language processing, such as machine translation

[18], visual question generation [5], paraphrase generation [10], anaphora resolution [20] etc. The advantage of reinforcement learning lies in that it does not need a differentiable objective function. In open-domain dialogue generation, li_deeprl li_deeprl manually defined three rewards and use reinforcement learning to train the dialogue agent. Further, li2017adversarial li2017adversarial apply adversarial learning method [21] for dialogue generation and propose the REGS model. This model shows its strength in the naturalness of generated responses. However, natural responses can also be inconsistent, especially in the persona-based scenario (as shown in Figure 1).

Figure 2: The overall framework of our model, which mainly consists of a generator and an evaluator. The dashed connection only appears in the generation process. The and denote that the generation of a response is encouraged and discouraged by the reward signals respectively.

Proposed Approach

Problem Definition

Our goal is to learn a generative model to deliver persona consistent dialogues, which can be formally defined as: given an input message , and a set of persona texts , to generate a response , based on both input message and the persona text set , i.e., . Moreover, should be consistent with the persona text set , which means the NLI category between and any should be entailment or neutral, rather than contradiction, i.e., , , where E and N denote and respectively.


The proposed reinforcement learning framework consists of two components: an evaluator and a sequence generator, as illustrated in Figure 2.

As forementioned, an ideal response should be not only natural but also consistent with personas. Therefore, we consider these two attributes of responses while training the generator. More concretely, whether a response is as natural as from human (natural or unnatural) and whether a response is consistent with predefined personas (entailment, neutral or contradiction). These two attributes are independent of each other, and an ideal response should satisfy:


The key idea is to encourage the generator to generate responses that satisfy Formula (1). We use the policy gradient method in reinforcement learning to train the generator. We will discuss this in detail later.

Notice that our evaluator consists of two modules, rather than one jointly trained module, which is due to the difference between the two attributes. For the naturalness module, it can benefit from the adversarial training scheme [21, 9], as naturality is reflected in the training data. Naturalness as a submodule in the evaluator can achieve higher accuracy with adversarial training. In contrast, no labels are available in the training process to improve the performance of natural language inference.

Naturalness Module

The naturalness module is proposed to distinguish between human responses and model generated ones. As the generator is updating during the training process, new examples from models are generated. Therefore, we shall update the .

It is safe to assume that responses from humans are always more natural than the ones from models. From this observation, we take responses from the training data as positive examples and responses from the generator as negative examples. guides the sequence generator to predict responses closer to the examples from the training data, which is more natural.

In more detail, the naturalness module is a binary classifier that takes response or as input111In our experiments, we found that the way of taking as input to didn’t bring significant performance improvements in the accuracy of , so we choose the more straightforward way.

and produces a softmax distribution over two classes, indicating whether the response is from human (natural) or model (unnatural). The input is encoded into a vector representation using a bidirectional GRU encoder, which is then fed to a highway architecture, followed by a fully-connected layer with two-class softmax function.

The objective function of

is to minimize the cross-entropy between the ground truth label and the predicted probability. And the reward from



where is the output probability of from the human.

Consistency Module

The consistency module is an NLI classifier. We introduce this module to detect the consistency in dialogues by distinguishing {entailment, neutral, controdiction} between generated responses and the persona texts. Recent NLI models [3, 2, 6, 7] are usually trained on large-scale datasets like SNLI [1]. The domain adaption problem could lead to a performance gap. Therefore, a better dataset for our task is the recently released DNLI [17], which is in the persona-based dialogue domain.

The consistency module is not updated in the adversarial training process of . Due to the assumption that responses from humans are natural, can always get positive examples (the human responses from training data) and negative examples (the generated responses from ) during the adversarial training process. However, as exemplified in Figure 1, a natural response does not necessarily entail persona texts and vice versa. Due to this difference, cannot be iteratively updated like .

In addition to exploiting NLI models in dialogue generation, another issue worth exploring is how the performance of different NLI signals affects the quality of dialogue generation. Thus in our experiments, we apply two NLI classifiers with performance differences:

  • Base Model

    We use the GRU to learn the sentence representations and then put them into a multilayer perceptron (MLP) classifier. The MLP has a hidden layer with

    tanh activation and a softmax output layer in our experiments. For training, we use a multi-class cross-entropy loss. In the following sections, we abbreviate this model as .

  • Finetuned BERT With multilayer bidirectional Transformers [15], BERT [4] has achieved state-of-the-art results on various natural language understanding tasks, including NLI. We finetune the BERT model on the DNLI dataset and achieve best results compared with several other reported results on this dataset. In the following sections, we abbreviate this model as .

Finally, we can get the three-class confidences from the output layer of a consistency module. The reward from can formulate as:


where E is the confidence for entailment and C is the confidence for contradiction. Index (or ) denotes the confidence is calculated between and (or ). This reward is designed to encourage entailment and discourage contradiction between and .

Reinforcement Learning

We formalize the persona consistent dialogue generation problem as a reinforcement learning task. That is, we train a generator to produce a response , where represents a word in the vocabulary. At each timestep t, the state is the current produced word , and the action is the next selected word . The policy model defines the probability that selecting the -th word depending on the previously generated words, which is the current state.

Sequence Generator

Our generator takes a form similar to Seq2Seq model with attention mechanism. The only difference is that we prepend persona texts to the input sequence, i.e., , where denotes the concatenation. The same strategy is also applied to the generative model in zhang2018personalizing zhang2018personalizing.

Reward Estimation

In reinforcement learning, the training objective is to maximize the accumulated future rewards. We encourage the generator to generate responses that are close to human and consistent with the predefined persona. Based on rewards from the naturalness module and consistency module, the final reward function is:


where is our evaluator with the parameter . We train the generator to generate a response from the initial state to maximize its expected final reward:


where is the action-value function at timestep . When there is a finished response , the evaluator can provide a reward by Eq. (4) for the action-value function:


Rollout Policy

Our evaluator is trained to predict based on a complete sequence. Thus the reward from Eq. (6) can only be used for the final states in a response (the generation of a response must be finished), which will hurt the effectiveness of training the generator.

To evaluate the action-value at an intermediate state (t T), a common strategy is to apply rollout policy, such as Monte Carlo search, to sample the last T t words for the partially decoded response [21, 9, 5]. When applying a rollout policy, the model keeps sampling words from the generative model until the decoding is finished. This process is repeated for times, and the average reward of the sampled responses by Eq. (4) is used as the action-value for the state :


With rollout policy, the gradient of Eq. (5) can be solved by policy gradient method:


and the expectation can be approximated by sampling methods.

When are large enough, MC search leads to a reasonable estimate of the sentence rewards. However, this comes at a high computational cost. When decreases for the balance of computational time, the diversity of the sampled responses are affected, which could lead to a poor estimate. Therefore, we propose a different rollout policy: 1. at step , the model first generates a words’ subsequence with beam search. 2. at step , the model keeps different words with the top probabilities. 3. after step , the model continues to generate words with a sampling-based method for the partially decoded sequences with words.

We apply this rollout policy for a balance of the computational time and the sample diversity. In this way, we can get diverse samples, even when is relatively small.

Adversarial Training

Requires: generator , evaluator and ,
      dialogue corpus , nli dataset .

1:Randomly initialize , and .
2:Pretrain using MLE on .
3:Pretrain using negative samples from by minimizing the cross-entropy loss.
4:Pretrain on accordingly.
5:for number of training iterations do  
6:     for G-steps do
7:         Sample from
8:         for t in 1 : T  do
9:              Compute by Eq. (7)          
10:         Update via Policy Gradient by Eq. (8)       
11:     for teacherforce-steps do
12:         Update via MLE      
13:     for -steps do
14:         Sample from new and sample from
15:         Update via cross-entropy loss       
Algorithm 1 Sketch of the training procedure

As forementioned, the needs adversarial training to get higher accuracy. Algorithm 1 shows the overall training process of the proposed approach, including the adversarial training of . In Eq. (8), the ground-truth responses are not directly exposed to the generator in the training process. Practically, updating the generator only using the gradients from Eq. (8) leads to unstable training. The same issue is also reported in li2017adversarial li2017adversarial. To alleviate this issue, we follow sutskever2014sequence sutskever2014sequence and use the Teacher Force strategy to train , via MLE loss together with rewards from the evaluator.




We perform persona-based dialogue generation experiments on the Persona-Chat dataset [22]. The conversations are obtained from crowdworkers who were randomly paired and asked to act the part of a given persona. Also, the persona is created by another set of crowdworkers. This dataset contains 164,356 utterances in 10,981 dialogues and has a set of 1,155 personas, each consisting of four or five persona texts. The testing set contains 1,000 dialogues (15,119 utterances) and 200 never seen before personas. We set aside 968 dialogues (15,705 utterances) together with its personas from the training set for validation. The final data has 10,000/968/1,000 dialogues for train/validate/test222Note that the test set in ConvAI2 is different from the test set in zhang2018personalizing zhang2018personalizing and is not publicly available..

As reported in zhang2018personalizing zhang2018personalizing, pretraining on larger datasets would yield better results. Thus we use another two million input-response pairs from OpenSubtitles to pretrain all models in our experiments, and we report this instead.


The recently released Dialogue Natural Language Inference dataset [17] offers a new domain for NLI models. DNLI mainly consists of utterance-persona pairs, which are labeled as entailment (E), neutral (N), or contradiction (C). This dataset has 310,110/16,500/16,500 pairs for train/validate/test. Due to the length limit, we show other statistics of the DNLI dataset in the appendix.


In the persona-based dialogue generation area, to the best of our knowledge, no previous work has explicitly modeled the consistency issue. To evaluate our model, we compared the proposed approach with the following strong models:

  • S2SA Seq2Seq is a generative dialogue model with the context attention mechanism [12]. This is the only model without persona information.

  • Transformer Transformer is one of the state-of-the-art sequence transduction models [15]. We concatenate persona texts to the message as its input.

  • REGS Reward for Every Generation Step is an adversarially trained model with Monte Carlo search for response generation [9]. We regard persona texts as dialogue context while training this model.

  • Per-S2S This is a Seq2Seq model that prepends all persona texts to the input message [22].

  • GPMN Generative Profile Memory Network is a generative model that encodes persona as individual memory representations in a memory network [22].

  • DeepCopy DeepCopy is a hierarchical pointer network, which extends the pointer-generator network to copy tokens from relevant persona texts [19].

To make the following sections more concise, we abbreviate the proposed Reinforcement Learning based Consistent Dialogue Generation approach as RCDG. Considering we have two different implementations ( and ) of the consistency module , we use RCDG and RCDG to denote implemented with and , respectively.

Experimental Settings

For the generator, both encoder and decoder are two-layer GRU with a hidden size 500. Embeddings of size 300 are randomly initialized and updated during training. Vocabulary size is 18,300, and other tokens are replaced with the UNK token. Encoder and decoder share the same vocabularies and embeddings. The model parameters are optimized using Adam with an initial learning rate of 0.0003. Learning rate decay is 0.98. Training minibatch size is 32. We set to 0.4 and to 5. We implement the model in OpenNMT-py.

Evaluation Metrics

Consistency Evaluation

First, we evaluate the persona-consistency of different models. nli_eval nli_eval has shown that entailment techniques can be used as a surrogate for human judgment in evaluating dialogue consistency. Following this work, we employ NLI model to automatically evaluate the persona-consistency of the generated responses. For a generated response and a set of persona texts , an NLI model can assign an entailment category to each (,) pair, where . Then we simulate the human evaluator in deciding the entailment category between and by:


where .

Considering we have used BERT as a consistency evaluator in the training process, it is not fair to use the same model again for evaluation. Thus we introduce another well-performed NLI model DIIN [6], as a third party, to evaluate all the dialogue models.

Dialogue Quality Evaluation

Second, the quality of generated dialogues is also an essential factor to consider. We evaluate the dialogue quality of different models with the following metrics:

  • Perplexity Following zhang2018personalizing zhang2018personalizing, we use perplexity (ppl) to measure the fluency of responses. Lower perplexity means better fluency.

  • Embedding metrics

    Following serban2016building serban2016building, we use Embedding Average (Ave.), Embedding Greedy (Grd.), and Embedding Extrema (Ext.) as evaluation metrics. These metrics are based on word embeddings, and they measure the relevance of a response regarding a target response. We use GoogleNews 300D word vectors.

  • Distinct Following li2015diversity li2015diversity, we calculate the token ratios of distinct bigrams (Distinct-2, abbreviated as Dst. for convenience). We use this metric to measure how diverse the responses are.

Model Dev Test
Welleck InferSent 85.82 85.68
et al. 2019 ESIM 86.31 88.20
Models DIIN 86.72 88.84
in 80.48 81.26
this work 87.67 89.14
Table 1: Accuracy of different models on the DNLI dataset.
Model Entail.(%) Contr.(%)
Human 48.00 1.16
S2SA 8.37 12.94
GPMN 12.98 11.53
Per-S2S 13.27 12.19
REGS 14.08 10.83
Transformer 14.20 9.00
DeepCopy 14.62 12.17
RCDG 18.71 (28.0%) 5.93 (34.1%)
RCDG 19.07 (30.4%) 5.56 (38.2%)
Table 2: NLI model-based persona-consistency evaluation results. denotes entailment (the higher the better). denotes contradiction (the lower the better). Best results are in bold, and the percentages in the parentheses are improvements regarding baselines’ best results. We show some contradiction examples of Human in the appendix.

Human Evaluations

In addition to the automatic evaluations, we also recruit five well-educated human judges to evaluate the generated responses.

Quantitatively evaluating the persona-consistency in generative models is a non-trivial task for humans. One major challenge is that the majority of the responses are neutral regarding the persona texts. As we can see in the first row of Table 2, even in the test set of Persona-Chat (from Human), half of the responses are neutral regarding the personas. This is plausible because many conversations in the real world do not ground on personas, such as greeting and question. With the limited sample size, we did not get statistically significant results in human evaluation when quantitatively evaluating the persona-consistency: the human judges labeled most of the sampled responses neutral.

Instead, we exploit human evaluations to verify the effectiveness of the model-based evaluation. Responses from all models are divided into three categories, and we randomly sample 150 response-persona pairs from each category. The judges are instructed to give a 5-scale score to each pair: 0: definitely contradiction; 1: potential contradiction; 2: definitely neutral; 3: potential entailment; 4: definitely entailment. Note that the judges evaluate samples from each category (predicted by the DIIN), rather than from each model.

For dialogue quality, the evaluation is conducted following the usual practice. We sample 100 responses from each model and randomly shuffle them for judges. The five judges rate each response with a 3-scale criteria: 0: persona contradiction, irrelevant to the input, or grammatically broken; 1: the response reply to the message, but is not informative; 2: the response is relevant and informative.

Figure 3: Boxplot of the human scores for consistency versus the model prediction categories. Three categories of model predictions are on the horizontal axis. The area I is the score interval that is likely to be Entailment, as there are at least two judges out of five agree on that. Similarly, area II is likely to be Contradiction. This figure shows the correlation between human scoring and model prediction.

Results of Consistency

Table 1 shows the performance of different models on the DNLI dataset. The first two rows of results are reprinted from WelleckDNLI WelleckDNLI. We implement the other three models. DIIN is the model for persona-consistency evaluation. The last two models ( and ) are two different implementations of the consistency module.

Automatic Results

We report the model-based persona-consistency evaluation results in Table 2. With the explicit modeling of persona-consistency and reinforcement learning, our approach achieves the highest entailment percentage and a much lower contradiction percentage, compared with all other baselines.

The last two rows in Table 2 are the results of our approach, with different implementations of the consistency module. Both of them outperform other baselines significantly. Our RCDG gets better results, but this comes with higher computational costs, compared with our RCDG. The results could be interpreted to mean that the NLI signals work well, regardless of the NLI model structure.

Human Validation

The human evaluation scores of each category are depicted in Figure 3

. For the entailment category, more than half of the samples get an average score in the interval I, which means at least two judges out of five agree that the sample is likely to be entailment (rating 4 or 3). It is similar for the contradiction category. For the neutral category, most samples get an average score of 2, and there are only a few outliers. This leads to the overlapping of the boxplot quartile lines. Figure

2 shows that the model-based evaluation is in a relatively good agreement with human evaluation. We have done a preliminary experiment in the evaluation of consistency, while a full study is beyond the scope of this paper.

Results of Dialogue Quality

We first report the automatic evaluation results of dialogue quality in Table 3. Our methods are the best in the three embedding metrics, which indicates that our generated responses are most relevant to the ground truth. As our model is designed to address naturalness and consistency issues effectively, these results are within expectation. We notice that Transformer gets the best results in perplexity and distinct-2. It could be interpreted to mean that Transformer has a better language model compared with all other RNN based models. This also inspires us to use more advanced sequence models as our generator in future work. Except for the Transformer, our methods perform best in these RNN-based models. A plausible explanation is that our evaluator provides informative reward signals in the training process.

We report the human evaluation results in Table 4. Our model has the highest ratio of 2, which means our generated responses are of higher quality. The Transformer also performs well in human evaluation, but it gets many 1 points. One reason could be that this model generates more questions rather than declarative sentences, which makes human judges feel less informative.

Model ppl Ave. Grd. Ext. Dst.
DeepCopy 42.8 62.1 43.2 45.1 863
Per-S2S 36.3 61.5 45.1 42.5 719
S2SA 34.8 59.8 41.9 43.5 473
GPMN 34.3 65.3 45.7 43.2 741
REGS 33.6 64.3 44.2 44.8 1009
Transformer 28.1 63.4 43.9 43.6 1505
RCDG 30.2 66.7 46.9 46.4 1289
RCDG 29.9 66.9 47.2 46.8 1275
Table 3: Automatic evaluation results. Numbers with the

mean that improvement from the model on that metric is statistically significant over all baselines (t-test, p-value

0.01). Dst. is scaled by . Best results are in bold.
Model 0 1 2 Avg
S2SA 0.378 0.406 0.216 0.838 0.54
GPMN 0.250 0.446 0.304 1.054 0.46
Per-S2S 0.224 0.482 0.294 1.068 0.45
REGS 0.242 0.440 0.318 1.076 0.42
DeepCopy 0.224 0.450 0.326 1.102 0.48
Transformer 0.212 0.458 0.330 1.118 0.43
RCDG 0.182 0.440 0.378 1.196 0.50
RCDG 0.180 0.436 0.384 1.204 0.47
Table 4: The results of human evaluation for response quality, together with the Fleiss Kappa (). The coefficient between 0.41 and 0.6 means moderate agreement.

Ablation Study

As the proposed model achieves better performance than previous approaches, we conduct an analysis to gain further insight on how the integration of different modules helps the response generation. We report the results in Table 5.

As we can see, the performance of the vanilla generator is not outstanding. With the help of the adversarially trained naturalness module , the dialogue quality is improved. Meanwhile, if we directly apply reinforcement learning without the naturalness module, although the consistency of the generated response is improved, the quality has decreased significantly, as shown in the and . When we integrate the naturalness module and the consistency module, the performance achieves the best.

Finally, we show some generated examples in Table 6.

Conclusion and Future Work

In this paper, we consider modeling the persona-consistency in open-domain dialogue generation by exploiting natural language inference. To this end, we cast the task as a reinforcement learning problem and leverage natural language inference signals in the deep generative model. We demonstrate the effectiveness of our approach in comparison with several baselines by experiments on the Persona-Chat dataset. In the future, we plan to apply our model to larger scale datasets. Furthermore, we plan to use more advanced generators in our approach to achieve higher performance.

Model ppl Ave. Dst. E.(%) C.(%)
Generator 34.5 63.8 823 13.23 12.08
+ 32.3 64.8 1078 14.01 11.32
+ 40.9 60.3 639 16.98 7.52
+ 41.3 59.9 517 17.13 7.19
+ + 30.2 66.7 1289 18.71 5.93
+ + 29.9 66.9 1275 19.07 5.56
Table 5: Ablation test of different modules. Dst. is scaled by 10. E. denotes entailment and C. denotes contradiction. is the adversarially trained naturalness module. and are two different implementations of .
Persona I’m a 12 years old girl
Input Hi there! I am age 24, how about you?
Transformer 30. I want to be a baseball announcer.
REGS I’m 30 years old.
GPMN I am 32 and older than you.
DeepCopy I’m 12 years old.
Ours I am 12 years old.
Persona I only listen to country music
Input What sort of music do you listen to?
Transformer What kind of music do you like?
REGS I love country. Especially tyler farr swoon.
GPMN I love rap, Tupac is my favorite. You?
DeepCopy My sort of music pays well.
Ours My favorite is country, what about you?
Table 6: Sampled dialogues from different models.


  • [1] S. R. Bowman, G. Angeli, C. Potts, and C. D. Manning (2015) A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 632–642. Cited by: Introduction, Natural Language Inference, Consistency Module.
  • [2] Q. Chen, X. Zhu, Z. Ling, S. Wei, H. Jiang, and D. Inkpen (2017) Enhanced lstm for natural language inference. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1657–1668. Cited by: Natural Language Inference, Consistency Module.
  • [3] A. Conneau, D. Kiela, H. Schwenk, L. Barrault, and A. Bordes (2017) Supervised learning of universal sentence representations from natural language inference data. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 670–680. Cited by: Consistency Module.
  • [4] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: 2nd item.
  • [5] Z. Fan, Z. Wei, S. Wang, Y. Liu, and X. Huang (2018) A reinforcement learning framework for natural question generation using bi-discriminators. In Proceedings of the 27th International Conference on Computational Linguistics, pp. 1763–1774. Cited by: Reinforcement Learning, Rollout Policy.
  • [6] Y. Gong, H. Luo, and J. Zhang (2018) Natural language inference over interaction space. In International Conference on Learning Representations, Cited by: Natural Language Inference, Consistency Module, Consistency Evaluation.
  • [7] S. Kim, J. Hong, I. Kang, and N. Kwak (2019) Semantic sentence matching with densely-connected recurrent and co-attentive information. In AAAI, Cited by: Natural Language Inference, Consistency Module.
  • [8] J. Li, M. Galley, C. Brockett, G. P. Spithourakis, J. Gao, and B. Dolan (2016) A persona-based neural conversation model. arXiv preprint arXiv:1603.06155. Cited by: Introduction.
  • [9] J. Li, W. Monroe, T. Shi, S. Jean, A. Ritter, and D. Jurafsky (2017) Adversarial learning for neural dialogue generation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 2157–2169. Cited by: Evaluator, Rollout Policy, 3rd item.
  • [10] Z. Li, X. Jiang, L. Shang, and H. Li (2018) Paraphrase generation with deep reinforcement learning. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 3865–3878. Cited by: Reinforcement Learning.
  • [11] I. V. Serban, A. Sordoni, Y. Bengio, A. C. Courville, and J. Pineau (2016) Building end-to-end dialogue systems using generative hierarchical neural network models.. In AAAI, Vol. 16, pp. 3776–3784. Cited by: Introduction.
  • [12] L. Shang, Z. Lu, and H. Li (2015) Neural responding machine for short-text conversation. arXiv preprint arXiv:1503.02364. Cited by: Introduction, 1st item.
  • [13] H. Song, W. Zhang, Y. Cui, D. Wang, and T. Liu (2019-07) Exploiting persona information for diverse generation of conversational responses. In

    Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19

    pp. 5190–5196. External Links: Document, Link Cited by: Introduction.
  • [14] K. Song, X. Tan, T. Qin, J. Lu, and T. Liu (2019-09–15 Jun) MASS: masked sequence to sequence pre-training for language generation. In

    Proceedings of the 36th International Conference on Machine Learning

    , K. Chaudhuri and R. Salakhutdinov (Eds.),
    Proceedings of Machine Learning Research, Vol. 97, Long Beach, California, USA, pp. 5926–5936. Cited by: Introduction.
  • [15] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: 2nd item, 2nd item.
  • [16] O. Vinyals and Q. Le (2015) A neural conversational model. arXiv preprint arXiv:1506.05869. Cited by: Introduction.
  • [17] S. Welleck, J. Weston, A. Szlam, and K. Cho (2019-07) Dialogue natural language inference. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 3731–3741. External Links: Link Cited by: Introduction, Introduction, Introduction, Persona-based Dialogue Generation, Consistency Module, DNLI.
  • [18] L. Wu, F. Tian, T. Qin, J. Lai, and T. Liu (2018-October-November)

    A study of reinforcement learning for neural machine translation

    In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 3612–3621. Cited by: Reinforcement Learning.
  • [19] S. Yavuz, A. Rastogi, G. Chao, and D. Hakkani-Tür (2018) DEEPCOPY: grounded response generation with hierarchical pointer networks. Advances in neural information processing systems. Cited by: 6th item.
  • [20] Q. Yin, Y. Zhang, W. Zhang, T. Liu, and W. Y. Wang (2018) Deep reinforcement learning for chinese zero pronoun resolution. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vol. 1, pp. 569–578. Cited by: Reinforcement Learning.
  • [21] L. Yu, W. Zhang, J. Wang, and Y. Yu (2017) Seqgan: sequence generative adversarial nets with policy gradient. In Thirty-First AAAI Conference on Artificial Intelligence, Cited by: Reinforcement Learning, Evaluator, Rollout Policy.
  • [22] S. Zhang, E. Dinan, J. Urbanek, A. Szlam, D. Kiela, and J. Weston (2018) Personalizing dialogue agents: i have a dog, do you have pets too?. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2204–2213. Cited by: Introduction, Introduction, Introduction, 4th item, 5th item, Persona-Chat.
  • [23] T. Zhao, R. Zhao, and M. Eskenazi (2017)

    Learning discourse-level diversity for neural dialog models using conditional variational autoencoders

    In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, pp. 654–664. Cited by: Introduction.
  • [24] Q. Zhu, L. Cui, W. Zhang, F. Wei, and T. Liu (2019) Retrieval-enhanced adversarial training for neural response generation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 3763–3773. Cited by: Introduction.