Generate, Delete and Rewrite: A Three-Stage Framework for Improving Persona Consistency of Dialogue Generation

04/16/2020 ∙ by Haoyu Song, et al. ∙ Harbin Institute of Technology Tencent 0

Maintaining a consistent personality in conversations is quite natural for human beings, but is still a non-trivial task for machines. The persona-based dialogue generation task is thus introduced to tackle the personality-inconsistent problem by incorporating explicit persona text into dialogue generation models. Despite the success of existing persona-based models on generating human-like responses, their one-stage decoding framework can hardly avoid the generation of inconsistent persona words. In this work, we introduce a three-stage framework that employs a generate-delete-rewrite mechanism to delete inconsistent words from a generated response prototype and further rewrite it to a personality-consistent one. We carry out evaluations by both human and automatic metrics. Experiments on the Persona-Chat dataset show that our approach achieves good performance.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In an open-domain conversation scenario, two speakers conduct open-ended chit-chat from the initial greetings and usually come to focus on their characteristics, such as hobbies, pets, and occupations, etc., in the course of the conversation. For humans, they can easily carry out conversations according to their personalities Song et al. (2019), but fulfilling this task is still a challenge for recent neural dialogue models Welleck et al. (2019).

Figure 1: A common problem for persona-based dialogue models is that they can hardly avoid the generation of inconsistent persona words. Although the model generates a response which looks good, it is an inconsistent one. With further rewriting, the model can focus more on improving persona consistency.

One main issue is that these models are typically trained over millions of dialogues from different speakers, and the neural dialogue models have a propensity to mimic the response with the maximum likelihood in the training corpus Li et al. (2016), which results in the frequent inconsistency in responses Zhang et al. (2018). Another issue is the user-sparsity problem Qian et al. (2017) in conventional dialogue corpora Serban et al. (2015). Some users have very few dialogue data, which makes it difficult for neural models to learn meaningful user representations Li et al. (2016).

To alleviate the above issues, Zhang et al. (2018) introduced the Persona-Chat dataset to build more consistent dialogue models. Different from conventional dialogue corpora, this dataset endows dialogue models with predefined personas, which is in the form of textually described profile (as shown in the first line of Figure 1). The persona-based dialogue models also adopt an encoder-decoder architecture and are enhanced with persona encoding components, such as memory network Sukhbaatar et al. (2015) and latent variable Kingma and Welling (2013). These models turn out to produce more consistent responses than the persona-free ones Zhang et al. (2018); Song et al. (2019).

Despite the successful application of the encoder-decoder framework in persona-based dialogue models, one concern is that they lack extra attention to the key persona information. The model will learn to minimize the overall loss of every decoded word, but this may lead to the neglect of the key personas: change of one persona-related word may not significantly affect the overall loss, but could turn a good response into a totally inconsistent one. As shown in Stage 1 of Figure 1, only one improper word “Colorado” leads the response to be inconsistent.

A desirable solution should be able to capture personas and automatically learn to avoid and refine inconsistent words before the response. In this paper, we present a Generate-Delete-Rewrite framework, GDR, to mitigate the generation of inconsistent personas. We design three stages specifically for the goal of generating persona consistent dialogues: The first Generate stage adopts a transformer-based generator to produce a persona-based response prototype. The second Delete stage employs a consistency matching model to identify inconsistencies and delete (by masking) the inconsistent words from the prototype. Finally, in the Rewrite stage, a rewriter polishes the masked prototype to be more persona consistent. To examine the effectiveness of our GDR model, we carried out experiments on the public available Persona-Chat dataset Zhang et al. (2018).

We summarize the main contributions as follows:

  • A three-stage end-to-end generative framework, GDR, was proposed for the generation of persona consistent dialogues.

  • A matching model was integrated into the generation framework to detect and delete inconsistent words in response prototype.

  • Experimental results show the proposed approach outperforms competitive baselines on both human and automatic metrics.

2 Related Work

End-to-end dialogue generation approaches are a class of models for building open-domain dialogue systems, which have seen growing interests in recent years Vinyals and Le (2015); Shang et al. (2015); Serban et al. (2016); Li et al. (2016); Zhao et al. (2017); Li et al. (2017). These dialogue models adopted recurrent units in a sequence to sequence (seq2seq) fashion Sutskever et al. (2014). Since the transformer has been shown to be on par with or superior to the recurrent units Vaswani et al. (2017), some dialogue models began to take advantage of this architecture for better dialogue modeling Dinan et al. (2018); Su et al. (2019).

Besides the advancements in dialogue models, the emergence of new dialogue corpus has also contributed to the research field. Zhang et al. (2018) introduced the Persona-Chat dataset, with explicit persona texts to each dialogue. Based on seq2seq model and memory network, they further proposed a model named Generative Profile Memory Network for this dataset. Following this line, Yavuz et al. (2019) designed the DeepCopy model, which leverages copy mechanism to incorporate persona texts. Song et al. (2019) integrated persona texts into the Per-CVAE model for generating diverse responses. However, the persona-based models still face the inconsistency issue Welleck et al. (2019). To model the persona consistency, Welleck et al. (2019) annotated the Persona-Chat dataset and introduced the Dialogue Natural Language Inference (DNLI) dataset. This dataset converts the detection of dialogue consistency into a natural language inference task Bowman et al. (2015).

There is a growing interest in personalized dialogue generation task Li et al. (2016); Qian et al. (2017); Zhang et al. (2018); Zheng et al. (2019, 2020). In parallel with this work, Song et al. (2020) leveraged adversarial training to enhance the quality of personalized responses. Liu et al. (2020) incorporated mutual persona perception to build a more explainable Liu et al. (2019) dialogue agent.

Other relevant work lies in the area of multi-stage dialogue models Lei et al. (2020). Some retrieval-guided dialogue models Weston et al. (2018); Wu et al. (2019); Cai et al. (2019a, b) also adopted a multi-stage framework, but the difference from our work is obvious: we generate the prototype rather than retrieve one. A high-quality retrieved response is not always available, especially under the persona-based setting.

3 Model

3.1 Overview

In this work, we consider learning a generative dialogue model to ground the response with explicit persona. We focus on the persona consistency of single-turn responses, and we leave the modeling of multi-turn persona consistency as future work.

Figure 2: The overall architecture of our three-stage GDR model, including a prototype generator (Generate stage), a consistency matching model (Delete stage), and a masked prototype rewriter (Rewrite stage). The italics denote the inputs of each stage, and the boldfaces denote the outputs. All the attentions (attn) here refer to the multi-head attention. For the sake of brevity, we omitted some layers of the transformer in this figure.

Formally, we use uppercase letters to represent sentences and lowercase letters to represent words. Let denotes the input query with words, and let be the different persona texts, where is the -th persona text with words. Our goal is to learn a dialogue model to generate a response , which is consistent with the persona, based on both query and persona . In abbreviation, .

More concretely, as shown in Figure 2, the proposed model consists of three parts:

1) Prototype generator G. This component takes persona texts and query as input and generates a response prototype for further editing. It adopts an encoder-decoder architecture Sutskever et al. (2014), with the transformer Vaswani et al. (2017) applied in both the encoder and the decoder.

2) Consistency matching model D. This model is designed to detect and delete those words in the prototype that could lead to inconsistency. We train this model in a natural language inference fashion on the DNLI Welleck et al. (2019) dataset.

3) Masked prototype rewriter R. The rewriter learns to rewrite the response prototype to a more consistent one. It is also a transformer decoder, which adopts a similar architecture as the decoder of G. The difference lies in that it takes the masked prototype, rather than the query, as input.

We discuss the details in the following sections.

3.2 Generate: Prototype Generator

We apply the encoder-decoder structure to build our prototype generator G. For the encoder, we use the self-attentive encoder in the transformer. For the decoder, built upon the transformer decoder, we propose a tuple-interaction mechanism to model the relations among persona, query, and response.

Self-Attentive Encoder

As the persona is composed of several sentences, we unfold all words in into a sequence . The input embedding for word is the sum of its word embedding and position embedding.

Then we use the self-attentive encoder Vaswani et al. (2017) to compute the representations of the persona texts and query separately. The multi-head attention is defined as , where ,, are query, key, and value, respectively. The encoder is composed of a stack of identical layers. Take the first stack encoding of for example:


where is the first layer result of the multi-head self-attention and is the embedding function of the input. denotes the output of the first layer feed-forward network. For other layers:


where 2,…,. We applied layer normalization to each sublayer by . is encoded in the same way. After identical layers, we can get the final representations and , where and are the encoded persona and encoded query, respectively.

Tuple-Interaction Decoder

In the decoding phase, there are three types of information, persona , query , and response , which make up a tuple (,,). Accordingly, three inter-sentence relations need to be considered: (1) The alignment between and is beneficial to yield better results Bahdanau et al. (2014). (2) As the persona is composed of several sentences and describes different aspects, we need to find the most relevant persona information according to the relations between P and Y. (3) We also want to know whether the query needs to be answered with the given persona. Thus we should take the relations between and into account.

Considering the above factors, we design a two-layer tuple-interaction mechanism in the decoder, as shown in the first part of Figure 2. There are three attentions in two layers: query attention (Q-Attn) and persona attention (P-Attn) in the first layer, and persona-query attention (PQ-Attn) in the second layer. such identical layers compose of the decoder. For the first layer:


where and are the results of the first layer P-Attn and Q-Attn. is the result of the first layer PQ-Attn. denotes the first layer output. Note that the here is masked to ensure depending only on the known words Vaswani et al. (2017). Repeatedly, for other layers:


where 2,…,. After layers, the decoder output is projected from hidden size to vocabulary size, then followed up by a softmax

function to get the words’ probabilities:


where is a weight matrix and is the bias term with vocabulary size dimension. And denotes the output distribution of the first stage. Now we can get the response prototype from the .

Figure 3: The architecture of our consistency matching model. “” and “” denote element-wise product and difference. The dotted line shows inference process, including consistency matching and word deleting.

3.3 Delete: Consistency Matching Model

The goal of the consistency matching model D is to reveal word-level consistency between the persona texts and the response prototype, thus the inappropriate words can be deleted from the prototype.

This model is trained to estimate the sentence-level entailment category 

Bowman et al. (2015) of a response for the given persona texts, which includes entailment, neutral and contradiction. The key is that if the category is not entailment, we can delete the most contributing words by replacing them with a special mask token, thus giving the model a chance to rephrase. The attention weights can measure each word’s contribution.

The architecture of our consistency matching model is shown in Figure 3. From bottom to top are the self-attentive encoding layer, cross attention layer, and consistency matching layer.

As described in section 3.2, the self-attentive encoder () performs self-attention over the input to get sentence representations. Because the task of consistency matching is quite different from dialogue generation, we did not share the encoders between the generator G and matching model :


where is a matrix. and . The and are the number of words in persona and response prototype . Here we applied average pooling stragety Liu et al. (2016); Chen et al. (2017) to get the summary representations:


and we can get the response attention weights and attentive response representations by:


where is attention weights and is response representations. Similarly, we can get and .

Once and are generated, three matching methods Chen et al. (2017)

are applied to extract relations: concatenation, element-wise product, element-wise difference. The results of these matching methods are concatenated to feed into a multi-layer perceptron, which has three layers and tanh activation in between. The output is followed up by a SoftMax function to produce probabilities.

In the inference process, as shown in Figure 3, the response attention weights is leveraged to illustrate the inconsistent words, which will be deleted111In this paper, “delete” a word means replacing this word with a special mask token.

. In practice, we use a simple heuristic rule for deleting words: as long as the category is not

, we will delete 10% of the words (at least one word)222In our experiments, we found that deleting more words made it difficult for rewriter R to learn., with the highest attention weight, in the prototype . In this way, we get the masked prototype .

3.4 Rewrite: Masked Prototype Rewriter

The rewriter R takes the masked prototype and persona texts as input and outputs the final response.

R is also a transformer decoder, which is similar to the decoder of G in section 3.2, but with a minor difference: the masked prototype is close to the target response, thus the direct attention between the prototype and target response is needless. The architecture of R can be seen in the third part of Figure 2, which can be formalized as:


where is the encoded masked prototype and is the self-attentive encoder of G. is the encoded persona. After identical layers, the same generation process as in G is applied to the , and we can get the final response .

3.5 Training

The consistency matching model D is trained separately from the prototype generator G and rewriter R. As forementioned, the matching model D is trained in a natural language inference fashion on the DNLI dataset Welleck et al. (2019), which has been well defined by the previous studies Bowman et al. (2015); Chen et al. (2017); Gong et al. (2018). We minimize the CrossEntropy loss between the outputs of D and the ground truth labels.

The G and R share the same training targets. We trained them by the standard maximum likelihood estimate. Notice that there are two different deleting strategies in training: (1) In the warm-up phase, because the G can hardly generate high-quality prototypes at this period, we randomly delete each word, with a 10% probability, from the prototype. (2) After that, the trained consistency matching model D is leveraged to delete words.

Data Train Valid Test
Persona Texts 74,522 5,843 4,483
Q-R Pairs 121,880 9,558 7,801
Table 1: Some statistics of Persona-Chat dataset. Valid denotes Validate and Q-R denotes Query-Response.
Label Train Valid Test
Entailment 100,000 5,500 5,400
Neutral 100,000 5,500 5,400
Contradiction 110,110 5,500 5,700
Table 2: Key statistics of DNLI dataset.

4 Experiments

4.1 Datasets

We carried out the persona-based dialogue generation experiments on the public available Persona-Chat dataset Zhang et al. (2018). Furthermore, we trained the consistency matching model on the recently released Dialogue Natural Language Inference (DNLI) dataset Welleck et al. (2019).

We show the statistics of the Persona-Chat dataset in Table 1. The DNLI dataset Welleck et al. (2019) is an enhancement to the Persona-Chat. It is composed of persona-utterance pairs from the Persona-Chat, and these pairs are further labeled as entailment, neutral, and contradiction. Some statistics of this dataset are given in Table 2.

4.2 Compared Models

To the best of our knowledge, this is an early work in modeling explicit persona consistency. To show the effectiveness of our models, we mainly compare it with the persona-based dialogue models:

  • S2SA S2SA is an RNN-based attentive seq2seq model Bahdanau et al. (2014). It only takes the query as input.

  • Per-S2SA This is a seq2seq model that prepends all persona texts to the query as input Zhang et al. (2018).

  • GPMN Generative Profile Memory Network is an RNN-based model that encodes persona texts as individual memory representations in a memory network Zhang et al. (2018).

  • DeepCopy An RNN-based hierarchical pointer network, which leverages copy mechanism to integrate persona Yavuz et al. (2019).

  • Per-CVAE This is a memory augmented CVAE model to exploit persona texts for diverse response generation Song et al. (2019).

  • Transformer Different from the RNN-based models, transformer is a self-attention based sequence transduction model Vaswani et al. (2017). The persona texts are concatenated to the query to serve as its input.

4.3 Experimental Settings

For all the RNN-based baseline models, they are implemented by two-layer LSTM networks with a hidden size 512. For the Transformer, the hidden size is also set to 512, and the layers of both encoder and decoder are 3. The number of heads in multi-head attention is 8, and the inner-layer size of the feedforward network is 2048. The word embeddings are randomly initialized, and the embedding dimension of all models is set to 512.

Our model applies the same parameter settings as the transformer. The number of layers . G and R share the word embeddings, but the matching model D uses independent embeddings. We use token-level batching with a size 4096. Adam is used for optimization, and the warm-up steps are set to 10,000. We implemented the model in OpenNMT-py Klein et al. (2017).

4.4 Evaluation Metrics

In the evaluation, there are two essential factors to consider: persona consistency and response quality. We apply both human evaluations and automatic metrics on these two aspects to compare different models.

Model Const. Fluc. Relv. Info. PPL Dist-1. Dist-2.
S2SA 15.9% 3.17 2.84 2.63 34.8 1.92 4.86 9.80% 1.83%
GPMN 34.8% 3.78 3.57 3.76 34.1 1.89 7.53 14.5% 7.36%
Per-S2S 35.3% 3.43 3.22 3.32 36.1 2.01 7.31 13.5% 6.15%
DeepCopy 36.0% 3.26 3.08 2.87 41.2 2.35 8.93 16.7% 8.81%
Transformer 38.8% 3.46 3.65 3.54 27.9 3.12 15.8 14.2% 9.52%
Per-CVAE 42.7% 3.53 2.97 3.66 - 3.83 20.9 17.2% 7.36%
GDR (ours) 49.2% 3.86 3.68 3.77 16.7 3.66 22.7 21.5% 13.0%
Table 3: Results of human evaluations (on the left) and automatic metrics (on the right). The Dist-1.& 2. are scaled by

. Significant tests (t-test) are performed, and our method is significantly better than all methods on most metrics (p-value

0.05), with the exceptions marked by . We also present two model-based ratios, the and the , as an additional reference for persona consistency assessments. Note that the automatic metrics are calculated on the whole test set. * The sampling process in CVAE leads to very unstable PPL.
Human Evaluation

We recruit five professional annotators from a third-party company. These annotators have high-level language skills but know nothing about the models. We sampled 200 persona-query-response tuples per model for evaluation. Duplicated queries (such as greetings which appear more than once) will not be sampled twice.

First, we evaluate the persona consistency of a response. The annotators are asked to decide whether the response is consistent with the given persona. 0 indicates irrelevant or contradictory and 1 indicates consistent (Const.).

Second, we evaluate the quality of a response on three conventional criteria: fluency (Fluc.), relevance (Relv.), and informativeness (Info.). Each aspect is rated on five-scale, where 1, 3, and 5 indicate unacceptable, moderate, and excellent performance, respectively. 2 and 4 are used for unsure.

Automatic Metrics

Dziri et al. (2019) has shown that natural language inference based entailment ratio can be used as an indicator of dialogue consistency. Here we trained two well-performed NLI models, DIIN Gong et al. (2018) and BERT Devlin et al. (2019), to automatically predict the category of persona-response pairs, and we calculated the ratio of entailment as an additional reference to the persona consistency. In our experiments, DIIN and BERT achieved 88.78% and 89.19% accuracy on the DNLI test set, respectively, compared with previous best results 88.20%. The trained models are then leveraged for calculating entailment ratios. Two model-based entailment ratios are abbreviated as and .

For dialogue quality, we follow Zhang et al. (2018) to use perplexity (PPL) to measure the fluency of responses. Lower perplexity means better fluency. Besides, we also use Dist-1 / Dist-2 Li et al. (2016) to examine the model’s ability to generate diverse responses, which is the ratio of distinct uni-grams / bi-grams.

GDR vs Win(%) Tie(%) Lose(%)
S2SA 48.0 38.2 13.8
Per-CVAE 46.1 29.8 24.1
DeepCopy 43.8 35.5 20.7
Per-S2S 41.3 36.1 22.6
GPMN 35.0 31.0 34.0
Transformer 34.7 32.1 33.2
Table 4: GDR response quality gains over other baseline methods on a pairwise human judgment.

4.5 Main Results

We report the main evaluation results in Table 3. Compared with baseline methods, our GDR model obtains the highest consistency score of 49.2% in human evaluation, which is significantly better than other methods. The target responses in the sampled data are also annotated, and 65.4% of them expressed persona information. Moreover, the two model-based entailment ratios, which are calculated on the whole test set, also prove the effectiveness of our GDR model. Although the two NLI models differ in results, our GDR model ranks first under the evaluation of both DIIN and BERT.

For dialogue quality, our proposed model has a remarkably lower perplexity of 16.7 than all other baseline methods. An analysis can be seen in Section 4.6. Besides, our distinct-2 metric is even significantly better than the Per-CVAE model, which is designed to generate diverse responses.

Additionally, we carried out pairwise response comparison to see the dialogue quality gains. We report the results in Table 4. While the GDR model remarkably improves persona consistency, it can still generate high-quality responses like the transformer and GPMN.

4.6 More Analysis

As the proposed model achieves better performance than baseline methods, we turn to ablation tests to further quantify the contributions made by different components. We ablated our model through several different approaches:

  • GR It removes the matching model D, i.e., generates a prototype and rewrites it directly.

  • GRdR This approach replaces the matching model D with 10% random deleting (Rd), thus to see if the masked prototype, extracted by our matching model D, is beneficial.

  • G Our model’s generator, without further consistency matching and rewriting.

  • T It is a transformer generator but removes the tuple-interaction in section 3.2 and directly concatenates persona texts to the query. This model is equivalent to the vanilla transformer.

Model Const. Fluc. Relv. Info. PPL
GDR 49.2% 3.86 3.68 3.77 16.7
GR 42.4% 3.72 3.40 3.66 18.0
GRdR 40.0% 3.60 3.29 3.56 20.6
G 40.1% 3.69 3.35 3.55 26.3
T 38.8% 3.46 3.65 3.54 27.9
Table 5: Results of the ablation study. GDR is significantly better than the ablated approaches, with an only exception marked by .
GDR vs Win(%) Tie(%) Lose(%)
GR 39.9 40.9 19.2
G 38.1 35.8 26.1
Table 6: Pairwise human judgment on response quality.

We report the results in Table 5. First, we look into which components contribute to the consistency. As seen, from T, G, GR to GDR, every step has an observable improvement in Const., indicating the effectiveness of our model’s design. Both the tuple-interaction in G and the rewriting process in R contribute to the improvements of persona consistency. The GRdR approach, with nothing different from GDR but a random deleting strategy, serves as a foil to our GDR model, which indicates a well-learned consistency matching model is of great benefit to our three-stage generation framework to generate persona consistent dialogues.

Second, we investigated the improvement of the remarkably low perplexity. As we can see, the one-stage transformer approaches G and T have a perplexity higher than 26. In contrast, after we add the rewriter R, the perplexity of all approaches has a significant decline, no matter whether there is a matching model D. One reason could be that the rewriter works like a denoising autoencoder 

Vincent et al. (2008), and it can focus more on the reconstruction of the missing information of sequence itself, rather than learning to map a sequence to an entirely different one.

We observed that the relevance scores of GR, GRdR, and G are a little inferior to the T. Even the GDR model is not significantly better than T on the relevance score. One plausible explanation is that all these models are specially designed for integrating persona information, although they obtain much better consistency score, it may come at the cost of relevance score.

Moreover, we compared the GDR’s response quality with two ablated models and reported it in Table 6. As we can see, the deleting and rewriting, which are designed for improving consistency, also have a positive effect on the dialogue quality.

At last, we present some generated examples in Table 7. In the first case, although the generated prototype is neutral regarding the persona, the word “nurse” is still masked according to our strategy. And after the rewriting stage, the final response expresses persona. In the second case, the prototype is potentially contradictory to the persona, and the keyword is successfully deleted by the matching model D. In the third case, the prototype is consistent with the persona, and no word is deleted. In these cases, both consistency and quality are improved after the final rewriting.

Persona i.  My mother is a dentist
ii. I’m currently looking for a job
Query I want to become a physical therapist.
Gen&Del IwishIcouldbeanurse.
Rewrite I want to be a dentist like my mom.
Transfmr I hear you. I am a nurse and a mom.
Per-S2S What kind of business?
GPMN I am a professional painter, what do you do?
DeepCopy What do you do for a living?
Per-CVAE I guess I would buy you one
Persona i.  Gymnastics is my favorite sport
ii. I’m a gold medalist olympian
Query I am a mini van driver and I drive around.
Gen&Del That’sawesome.Ialsolovetoski.
Rewrite That’s cool. I am a champion gymnast.
Transfmr Oh. I am still in high school .
Per-S2S I drive a big van, but around the house.
GPMN I drive a Fiat. I love it.
DeepCopy What do you do for a living?
Per-CVAE That is why you like dogs? Tell me.
Persona i.  I like to grill outdoors
ii. I have been retired for years
Query Old enough to work in construction. you?
Gen&Del I’mretired.Ispendtimeoutdoors.
Rewrite I’m retired. I spend most of my time outdoors.
Transfmr I am a boy and work in the construction. you?
Per-S2S No, I am a student.
GPMN I am a teacher, I love to be a teacher.
DeepCopy I work in the construction industry.
Per-CVAE I am a retired officer I love my bike ride.
Table 7: Example responses from different models, with a visualization of the consistency matching weights. Strikethrough words are the masked words in Delete stage. And underlined words in Rewrite stage express the persona. Transfmr is short for Transformer.

5 Conclusion and Future Work

In this paper, we presented a three-stage framework, Generate-Delete-Rewrite, for persona consistent dialogue generation. Our method adopts transformer architecture and integrates a matching model to delete the inconsistent words. Experiments are carried out on public-available datasets. Both human evaluations and automatic metrics show that our method achieves remarkably good performance. In the future, we plan to extend our approach to improve the consistency of multi-turn dialogues.


This paper is supported by the National Natural Science Foundation of China under Grant No.61772153 and No.61936010. Besides, we want to acknowledge the Heilongjiang Province Art Planning Project 2019C027 and the Heilongjiang Province Social Science Research Project 18TQB100. We also would like to thank all the anonymous reviewers for their helpful comments and suggestions.


  • D. Bahdanau, K. Cho, and Y. Bengio (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. Cited by: §3.2, 1st item.
  • S. R. Bowman, G. Angeli, C. Potts, and C. D. Manning (2015) A large annotated corpus for learning natural language inference. In

    Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

    pp. 632–642. Cited by: §2, §3.3, §3.5.
  • D. Cai, Y. Wang, W. Bi, Z. Tu, X. Liu, W. Lam, and S. Shi (2019a) Skeleton-to-response: dialogue generation guided by retrieval memory. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 1219–1228. Cited by: §2.
  • D. Cai, Y. Wang, W. Bi, Z. Tu, X. Liu, and S. Shi (2019b) Retrieval-guided dialogue response generation via a matching-to-generation framework. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 1866–1875. Cited by: §2.
  • Q. Chen, X. Zhu, Z. Ling, S. Wei, H. Jiang, and D. Inkpen (2017) Enhanced lstm for natural language inference. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1657–1668. Cited by: §3.3, §3.3, §3.5.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. Cited by: §4.4.
  • E. Dinan, S. Roller, K. Shuster, A. Fan, M. Auli, and J. Weston (2018)

    Wizard of wikipedia: knowledge-powered conversational agents

    arXiv preprint arXiv:1811.01241. Cited by: §2.
  • N. Dziri, E. Kamalloo, K. Mathewson, and O. R. Zaiane (2019) Evaluating coherence in dialogue systems using entailment. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 3806–3812. Cited by: §4.4.
  • Y. Gong, H. Luo, and J. Zhang (2018) Natural language inference over interaction space. In International Conference on Learning Representations, Cited by: §3.5, §4.4.
  • D. P. Kingma and M. Welling (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §1.
  • G. Klein, Y. Kim, Y. Deng, J. Senellart, and A. Rush (2017)

    OpenNMT: open-source toolkit for neural machine translation

    In Proceedings of ACL 2017, System Demonstrations, Vancouver, Canada, pp. 67–72. Cited by: §4.3.
  • W. Lei, X. He, Y. Miao, Q. Wu, R. Hong, M. Kan, and T. Chua (2020) Estimation-action-reflection: towards deep interaction between conversational and recommender systems. In Proceedings of the 13th International Conference on Web Search and Data Mining, pp. 304–312. Cited by: §2.
  • J. Li, M. Galley, C. Brockett, J. Gao, and B. Dolan (2016) A diversity-promoting objective function for neural conversation models. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 110–119. Cited by: §4.4.
  • J. Li, M. Galley, C. Brockett, G. Spithourakis, J. Gao, and B. Dolan (2016) A persona-based neural conversation model. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany, pp. 994–1003. Cited by: §1, §2.
  • J. Li, W. Monroe, A. Ritter, D. Jurafsky, M. Galley, and J. Gao (2016)

    Deep reinforcement learning for dialogue generation

    In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, Texas. Cited by: §2.
  • J. Li, W. Monroe, T. Shi, S. Jean, A. Ritter, and D. Jurafsky (2017) Adversarial learning for neural dialogue generation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 2157–2169. Cited by: §2.
  • H. Liu, Q. Yin, and W. Y. Wang (2019) Towards explainable NLP: a generative explanation framework for text classification. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 5570–5581. Cited by: §2.
  • Q. Liu, Y. Chen, B. Chen, J. Lou, Z. Chen, B. Zhou, and D. Zhang (2020) You impress me: dialogue generation via mutual persona perception. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Cited by: §2.
  • Y. Liu, C. Sun, L. Lin, and X. Wang (2016) Learning natural language inference using bidirectional lstm model and inner-attention. arXiv preprint arXiv:1605.09090. Cited by: §3.3.
  • Q. Qian, M. Huang, H. Zhao, J. Xu, and X. Zhu (2017) Assigning personality/identity to a chatting machine for coherent conversation generation. arXiv preprint arXiv:1706.02861. Cited by: §1, §2.
  • I. V. Serban, R. Lowe, P. Henderson, L. Charlin, and J. Pineau (2015) A survey of available corpora for building data-driven dialogue systems. arXiv preprint arXiv:1512.05742. Cited by: §1.
  • I. V. Serban, A. Sordoni, Y. Bengio, A. C. Courville, and J. Pineau (2016)

    Building end-to-end dialogue systems using generative hierarchical neural network models.

    In AAAI, Vol. 16, pp. 3776–3784. Cited by: §2.
  • L. Shang, Z. Lu, and H. Li (2015) Neural responding machine for short-text conversation. arXiv preprint arXiv:1503.02364. Cited by: §2.
  • H. Song, W. Zhang, Y. Cui, D. Wang, and T. Liu (2019) Exploiting persona information for diverse generation of conversational responses. In

    Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19

    pp. 5190–5196. Cited by: §1, §1, §2, 5th item.
  • H. Song, W. Zhang, J. Hu, and T. Liu (2020) Generating persona consistent dialogues by exploiting natural language inference. The Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI-20). Cited by: §2.
  • H. Su, X. Shen, R. Zhang, F. Sun, P. Hu, C. Niu, and J. Zhou (2019) Improving multi-turn dialogue modelling with utterance ReWriter. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 22–31. Cited by: §2.
  • S. Sukhbaatar, J. Weston, R. Fergus, et al. (2015) End-to-end memory networks. In Advances in neural information processing systems, pp. 2440–2448. Cited by: §1.
  • I. Sutskever, O. Vinyals, and Q. V. Le (2014) Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pp. 3104–3112. Cited by: §2, §3.1.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §2, §3.1, §3.2, §3.2, 6th item.
  • P. Vincent, H. Larochelle, Y. Bengio, and P. Manzagol (2008) Extracting and composing robust features with denoising autoencoders. In

    Proceedings of the 25th international conference on Machine learning

    pp. 1096–1103. Cited by: §4.6.
  • O. Vinyals and Q. Le (2015) A neural conversational model. arXiv preprint arXiv:1506.05869. Cited by: §2.
  • S. Welleck, J. Weston, A. Szlam, and K. Cho (2019) Dialogue natural language inference. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 3731–3741. Cited by: §1, §2, §3.1, §3.5, §4.1, §4.1.
  • J. Weston, E. Dinan, and A. Miller (2018) Retrieve and refine: improved sequence generation models for dialogue. In Proceedings of the 2018 EMNLP Workshop SCAI: The 2nd International Workshop on Search-Oriented Conversational AI, pp. 87–92. Cited by: §2.
  • Y. Wu, F. Wei, S. Huang, Y. Wang, Z. Li, and M. Zhou (2019) Response generation by context-aware prototype editing. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 7281–7288. Cited by: §2.
  • S. Yavuz, A. Rastogi, G. Chao, and D. Hakkani-Tur (2019) Deepcopy: grounded response generation with hierarchical pointer networks. arXiv preprint arXiv:1908.10731. Cited by: §2, 4th item.
  • S. Zhang, E. Dinan, J. Urbanek, A. Szlam, D. Kiela, and J. Weston (2018) Personalizing dialogue agents: I have a dog, do you have pets too?. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, pp. 2204–2213. Cited by: §1, §1, §1, §2, §2, 2nd item, 3rd item, §4.1, §4.4.
  • T. Zhao, R. Zhao, and M. Eskenazi (2017) Learning discourse-level diversity for neural dialog models using conditional variational autoencoders. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, pp. 654–664. Cited by: §2.
  • Y. Zheng, G. Chen, M. Huang, S. Liu, and X. Zhu (2019) Personalized dialogue generation with diversified traits. arXiv preprint arXiv:1901.09672. Cited by: §2.
  • Y. Zheng, R. Zhang, X. Mao, and M. Huang (2020) A pre-training based personalized dialogue generation model with persona-sparse data. The Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI-20). Cited by: §2.