Log In Sign Up

Persona-Knowledge Dialogue Multi-Context Retrieval and Enhanced Decoding Methods

by   Min Sik Oh, et al.

Persona and Knowledge dual context open-domain chat is a novel dialogue generation task introduced recently. While Persona and Knowledge is each interesting context of open-domain dialogue, the combination of both has not been well studied. We tackle Persona-Knowledge identification and response generation tasks in this paper. We design an informed data augmentation strategy that is compatible with neural Q A retrieval models. With the augmented data, we perform permutative Persona-Knowledge evaluation and successive Persona search fine-tuning. Furthermore, we perform dialogue generation with various decoding techniques and illustrate crucial elements. We achieve SOTA across official metrics with 93.99 23.62 SacreBLEU score.


page 1

page 2

page 3

page 4


An Empirical Investigation of Pre-Trained Transformer Language Models for Open-Domain Dialogue Generation

We present an empirical investigation of pre-trained Transformer-based a...

CAiRE in DialDoc21: Data Augmentation for Information-Seeking Dialogue System

Information-seeking dialogue systems, including knowledge identification...

PLATO-K: Internal and External Knowledge Enhanced Dialogue Generation

Recently, the practical deployment of open-domain dialogue systems has b...

Dialogue Distillation: Open-domain Dialogue Augmentation Using Unpaired Data

Recent advances in open-domain dialogue systems rely on the success of n...

Mix-review: Alleviate Forgetting in the Pretrain-Finetune Framework for Neural Language Generation Models

In this work, we study how the large-scale pretrain-finetune framework c...

DialAug: Mixing up Dialogue Contexts in Contrastive Learning for Robust Conversational Modeling

Retrieval-based conversational systems learn to rank response candidates...

Data-Efficient Methods for Dialogue Systems

Conversational User Interface (CUI) has become ubiquitous in everyday li...

1 Introduction

Call For Customized Conversation Jang et al. (2021) is an open-domain chat dataset that grounds human dialogue in both Persona and Knowledge. The dataset provides Knowledge grounded multi-turn dialogue that are aligned with user’s Persona. In particular, this dataset explores how variety of people’s individual preferences affects required Knowledge selection to generate the answer while travelling around the world (history, design, structure, tourism etc.). Thus, the dataset is composed of dialogues annotated with individual landmark associated Wiki passages and simple sentences inferring user’s preferences. This results in more realistic dialogue environment for evaluation of open-domain dialogue agents.

One important aspect of this configuration is that Persona and Knowledge pairs should be retrieved from given dialogue. Following grounding prediction tasks in Jang et al. (2021), we define Persona and Knowledge Dual Context Identification as the task to identify Persona and Knowledge jointly for a given dialogue. We hypothesize that there are specific interactions that happen between Persona, Knowledge, and Dialogue, thus they cannot be predicted separately from partial contexts. We utilize neural retrieval tools such as Sentence-BERT Reimers and Gurevych (2019) to jointly predict Persona and Knowledge. This is the first paper to outline joint retrieval techniques for multi-context grounded dialogue as of our understanding.

In addition, decoding techniques are crucial since conversation models have same encoder-decoder architecture utilized for other Text Generation tasks.

Roller et al. (2020) introduce recipes for retrieval and generation models where they emphasize decoding choices for grounded open-domain dialogue. Wu et al. (2016) propose a variety of normalization techniques for machine translation in a production system (Google Translate). Meister et al. (2020) investigates importance of beam configurations in reaching optimal performance. Following these studies, we aim to tackle known problem of brevity in which generative models favor shorter, less informative text than is optimal. We extensively experiment with various decoding strategies, length constraints and normalization techniques.

Our contributions are as follows :

1. Persona-Knowledge dual context retrieval methodology which utilizes neural retrieval tools to jointly retrieve Persona and Knowledge given Dialogue. We achieve SOTA performance for both Persona and Knowledge retrieval. Notably, no model fine-tuning is required for top-1 Knowledge retrieval method.

2. Enhanced decoding strategy that target optimal performance with specific emphasis on brevity enhancement. Notably, our approach obtains a significant performance gain without additional data or training.

2 Related Works

Integrating Persona with dialogue agents has been actively studied. Various different datasets and systems exist for the purpose, including Persona Chat Zhang et al. (2018) and many others Majumder et al. (2020); Joshi et al. (2017); Shuster et al. (2018); Xu et al. (2020); Rashkin et al. (2019). Access to Persona assists the dialog agent in responding correct dialogue to the user, however, lack of Knowledge context prohibits the agent from elaborating with specific detailed information.

On the other hand, integrating knowledge bases with dialogue is another engaging topic of dialogue studies. Datasets for this purpose are Dinan et al. (2018); Zhou et al. (2018). Relevant Knowledge to the dialogue is retrieved from the knowledge base and utilized in response generation. The shortcoming of this Knowledge-only approach is that relevant Knowledge itself might depend on Persona of the user. We specifically address this shortcoming in our method via studying interactions between all components of dialogue.

In dialogue generation, Wu et al. (2016) propose a variety of beam normalization techniques for machine translation. Roller et al. (2020) emphasizes decoding strategies for open-domain chatbot including beam size, beam length, and sampling methods. Meister et al. (2020) introduces regularization strategies for beam search.

3 Methodology

3.1 Knowledge Retrieval

We introduce a novel formulation of Persona, Knowledge and Dialogue as Q & A input (Figure 1). This form is specifically selected to infer relations between all inputs of the grounded dialogue during answer likelihood calculation, and to replicate short question and descriptive answer pairs often found in Q & A setting. notates pair for inference with retrieval model, notates specific Q & A candidate pairs, notates specific Persona and Knowledge pairs respectively, and notates dialogue corresponding to the pairs.


Question : "{I want to visit Seven Wonders of theAncient World.} {Wow, what is this?}"Answer : "{The Great Pyramid of Giza … of theSeven Wonders of the Ancient World, …}"
Figure 1: Q&A formulation of Persona & Knowledge pair (eq. 1). Question form is "{Persona} {Dialogue}" while answer is "{Knowledge}".

We then perform permutative Persona-Knowledge evaluation (Figure 2) on all pairs of augmented Persona and Knowledge . We find the best Knowledge via computing all pairs and recording Knowledge of most aligned pair. This is to make sure we find the best Knowledge that aligns with the Dialogue and Persona of the human. notates Q & A retrieval model that returns relevancy score and notates index of predicted true Knowledge .


3.2 Persona Retrieval

Continuing from Section 3.1, we fine-tune the Q & A retrieval model using augmented Persona and predicted true Knowledge pairs only, without incorrect Knowledge pairs. This fine-tuning step is to increase the performance of the model, and obtain correct normalized scores for Persona. Otherwise we will obtain higher scores due to alignment of with in terms of Q & A configuration. notates the fine-tuned model. is input to the Q & A model similar to , only difference being fixed true Knowledge . We note separately because it is data from separate training set formulated in same manner as with labeled true Knowledge.


Finally, we infer data pairs with model to obtain Persona likelihood score. We utilize a threshold to avoid retrieving unrelated Persona. Certain Dialogue has no Persona assigned to it, which we can replicate with the threshold.


Retrieved Persona and Knowledge for given Dialogue is as follows, notated by :


P 1

P 2

P 3

P 4

P 5

K 1

K 2

K 3

K 4

K 5
Figure 2: Persona-Knowledge permutations computed in search for best Persona & Knowledge. Best Persona & Knowledge pair is P2 and K3. Candidate pairs for Persona search (eq. 3) are marked in red.

3.3 Decoding Techniques

We describe generated grounded conversation response as a downstream task of Persona-Knowledge retrieval.


where indicates a response, represents a dialogue, denotes persona, indicates knowledge, represents minimum response length, denotes maximum response length, indicates the coefficient of the length normalization, denotes a beam size and represents the dialogue generation model. Note that we utilize our implementation of beam search instead of nucleus sampling Holtzman et al. (2019) baseline from Jang et al. (2021).
For length normalization technique, we apply the following formula proposed by Wu et al. (2016) to our decoder with various alpha values. We report experimental result in Appendix A.


where |Y| denotes the current target length and indicates the length normalization coefficient.

4 Experiment Setup

We utilize Call For Customized Conversation Jang et al. (2021) dataset for evaluation and fine-tuning, which has 10 Knowledge and 5 Persona candidates respectably for each dialogue. We integrate neural Question and Answering retrieval model from Sentence-BERT Reimers and Gurevych (2019) as starting model . Specifically, we utilize 12 layer MiniLM Wang et al. (2020) (33M params) based cross-encoder trained on MS MARCO111MRR@10 on MS MARCO Dev Set: 39.02 Nguyen et al. (2016). This model fits very well with our formulation since its purpose is for semantic search, with model evaluating short questions and long passages together. For Persona search (eq. 4, 6

), we fine-tune for 2 epochs and provide threshold of

in our best configuration.
In addition, to evaluate generation task, we extensively experiment with baseline generation model trained via configuration in Jang et al. (2021)

combined with several decoding hyperparameters. We train the baseline model for 5 epochs, and we use default decoding settings as minimum length 1, maximum length 20, and nucleus sampling. Finally, our method is trained additionally 25 epochs and uses minimum length 5, maximum length 80, and beam size 1, alpha 1.0. Exact hyperparameters are attached to Table 


5 Results

5.1 Knowledge Retrieval

We experiment with various ablations of Dialogue / Persona / Knowledge interactions and find permutative evaluation of eq.1 form yields best performance for selecting top-1 Knowledge. Result of 15 point increase confirms that considering all components of dialogue is important. We report the results on test set.

Model Type Accuracy
D & K 79.26
P & K (pairwise) 84.62
P + D & K (pairwise) 94.69 (+15.41)
Table 1: Knowledge retrieval results.

5.2 Persona Retrieval

For Persona retrieval experiments, we start with grounding Knowledge selected in Section 5.1. Then, we perform ablations of Dialogue augmentation and fine-tuning. Fine-tuning of model yields 8 point performance increase.

Model Type Accuracy
P & 86.75
P + D & 83.83
P + D & (fine-tuned) 91.57 (+7.74)
Table 2: Persona retrieval results with threshold 0.5.

We observe low performance for in comparison to . We suspect that this is due to lack of score normalization, in that Q & A relationship of Dialogue to true Knowledge may affect likelihood score. We argue that fine-tuning model normalizes the score in addition to raw performance increase. We perform threshold ablations as shown in Table 3 to verify our hypothesis.

Model Type 0.0 0.5 0.6 0.7
P + D & 79.30 83.83 84.02 84.26
Fine-tuned 86.81 91.57 92.16 91.87
Table 3: Persona retrieval threshold ablations.

We find that fine-tuned model has increased performance across all thresholds, including where the output has top-1 characteristics. We also find that the score increases in tandem with Persona threshold for non-fine-tuned case.

5.3 Generation Results

We experiment with various decoding methods and perform ablations. In these experiments, we use ours (5 epoch) model described in Table 9. We report the results on dev set.

  • Q1. What is the optimal performance we can reach with decoding method improvements?

  • Q2. How does the decoding strategy affect performance?

  • Q3. How does the length constraints affect performance?

Q1. What is the optimal performance we can reach with decoding method improvements?
We obtain 10, 11 point increase of BLEU and Rough-L respectably as described in Table 4.

Model Rouge-L BLEU
Baseline 30.79 11.16
Ours (5 epoch) 38.50 (+7.71) 19.31 (+8.15)
Ours (30 epoch) 41.54 (+10.75) 21.42 (+10.26)
Table 4: Performance report of our method.

Q2. How does the decoding strategy affect performance?
We select beam size of 10 informed by Meister et al. (2020). Table 5 demonstrates effectiveness of beam search compared to baseline nucleus sampling.

Beam Size N/A (nucleus) 10
BLEU 13.76 19.31 (+5.55)
Table 5: BLEU score of ours (5 epoch) model with different sampling strategies.

Q3. How does length constraints affect performance?
Table 6 demonstrates that the longer the maximum response length, the higher the performance gain. We also experiment with minimum length constraints in Appendix A.

Max. Length 20 40 60 80
BLEU 12.86 16.37 17.20 19.31
Table 6: BLEU score of ours (5 epoch) model with different maximum lengths.

6 Conclusion

We introduce Persona-Knowledge dual context retrieval method in this paper. We achieve SOTA grounding retrieval performance by Q & A informed data augmentations and application of novel fine-tuning techniques. We achieve SOTA dialogue generation performance by utilizing beam search and brevity-informed constraints. We perform minimal fine-tuning for both high-performing methods. We are first place across all metrics (Persona / Knowledge accuracy, SacreBLEU, CharF++, ROUGE-L) in the official leaderboard. We achieve significant point increase over each baseline metrics for both Grounding and Generation tasks.


  • E. Dinan, S. Roller, K. Shuster, A. Fan, M. Auli, and J. Weston (2018)

    Wizard of wikipedia: knowledge-powered conversational agents

    arXiv preprint arXiv:1811.01241. Cited by: §2.
  • A. Holtzman, J. Buys, L. Du, M. Forbes, and Y. Choi (2019) The curious case of neural text degeneration. arXiv. External Links: Document, Link Cited by: §3.3.
  • Y. Jang, J. Lim, Y. Hur, D. Oh, S. Son, Y. Lee, D. Shin, S. Kim, and H. Lim (2021) Call for customized conversation: customized conversation grounding persona and knowledge. AAAI-22. External Links: Document, Link Cited by: Persona-Knowledge Dialogue Multi-Context Retrieval and Enhanced Decoding Methods, §1, §1, §3.3, §4.
  • C. K. Joshi, F. Mi, and B. Faltings (2017) Personalization in goal-oriented dialog. arXiv. External Links: Document, Link Cited by: §2.
  • B. P. Majumder, H. Jhamtani, T. Berg-Kirkpatrick, and J. McAuley (2020)

    Like hiking? you probably enjoy nature: persona-grounded dialog with commonsense expansions


    Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

    Online, pp. 9194–9206. External Links: Link, Document Cited by: §2.
  • C. Meister, T. Vieira, and R. Cotterell (2020) If beam search is the answer, what was the question?. arXiv preprint arXiv:2010.02650. Cited by: §1, §2, §5.3.
  • T. Nguyen, M. Rosenberg, X. Song, J. Gao, S. Tiwary, R. Majumder, and L. Deng (2016) MS marco: a human generated machine reading comprehension dataset. External Links: Link Cited by: §4.
  • H. Rashkin, E. M. Smith, M. Li, and Y. Boureau (2019) Towards empathetic open-domain conversation models: a new benchmark and dataset. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 5370–5381. External Links: Link, Document Cited by: §2.
  • N. Reimers and I. Gurevych (2019) Sentence-bert: sentence embeddings using siamese bert-networks. EMNLP 2019. External Links: Document, Link Cited by: §1, §4.
  • S. Roller, E. Dinan, N. Goyal, D. Ju, M. Williamson, Y. Liu, J. Xu, M. Ott, K. Shuster, E. M. Smith, et al. (2020) Recipes for building an open-domain chatbot. arXiv preprint arXiv:2004.13637. Cited by: Appendix A, §1, §2.
  • K. Shuster, S. Humeau, A. Bordes, and J. Weston (2018) Image chat: engaging grounded conversations. arXiv. External Links: Document, Link Cited by: §2.
  • W. Wang, F. Wei, L. Dong, H. Bao, N. Yang, and M. Zhou (2020) MiniLM: deep self-attention distillation for task-agnostic compression of pre-trained transformers. External Links: 2002.10957 Cited by: §4.
  • Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, et al. (2016)

    Google’s neural machine translation system: bridging the gap between human and machine translation

    arXiv preprint arXiv:1609.08144. Cited by: Appendix A, §1, §2, §3.3.
  • M. Xu, P. Li, H. Yang, P. Ren, Z. Ren, Z. Chen, and J. Ma (2020) A neural topical expansion framework for unstructured persona-oriented dialogue generation. arXiv. External Links: Document, Link Cited by: §2.
  • S. Zhang, E. Dinan, J. Urbanek, A. Szlam, D. Kiela, and J. Weston (2018) Personalizing dialogue agents: I have a dog, do you have pets too?. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, pp. 2204–2213. External Links: Link, Document Cited by: §2.
  • K. Zhou, S. Prabhumoye, and A. W. Black (2018) A dataset for document grounded conversations. arXiv preprint arXiv:1809.07358. Cited by: §2.

Appendix A Appendix

While we achieve strong performance increase without any training of generative model, we find that our experimental results do not fully agree with existing methods introduced in Roller et al. (2020) and Wu et al. (2016). Robust decoding method applicable to multiple open-domain dialogue domains could be found. We leave this question to future studies.

a.1 The effect of the minimum length

Performance with different decoding minimum lengths. Other parameters are same as ours (5 epoch).

min. length 5 10 20 40
BLEU 16.18 16.17 15.82 13.86
Table 7: Different minimum lengths result.

a.2 The effect of the length normalization

Performance with different alpha coefficients of the length normalization. Other parameters are the same as ours (5 epoch).

alpha 0.2 0.4 0.6 0.8 1.0
BLEU 16.37 16.54 16.59 16.54 16.02
Table 8: Different alpha values result.

a.3 Hyperparameters

Parameter Baseline Ours
model BART-base BART-base
training epochs 5 5 or 30
learning rate 6.25-e5 6.25-e5
training batch size 2 2
alpha (length norm) 0.0 1.0
beam size N/A (nucleus) 10
minimum length 1 5
maximum length 20 80
Table 9: Hyperparameters of the baseline model and ours. Parameters with bold represent decoding hyperparameters.