Log In Sign Up

GKS: Graph-based Knowledge Selector for Task-oriented Dialog System

In previous research, knowledge selection tasks mostly rely on language model-based methods or knowledge ranking. However, approaches simply rely on the language model take all knowledge as sequential input that knowledge does not contain sequential information in most circumstances. On the other hand, the knowledge ranking method leverage dialog history and each given knowledge but not between pieces of knowledge. In the 10th Dialog System Technology Challenges (DSTC 10), we participated the second track of Knowledge-grounded Task-oriented Dialogue Modeling on Spoken Conversations. To deal with the problems mentioned above, we modified training methods based on SOTA models for the first and third sub-tasks and proposed Graph-Knowledge Selector (GKS), utilizing a graph-attention base model incorporated with language model for knowledge selection sub-task two. GKS makes knowledge selection decisions in the dialog by simultaneously considering each knowledge embedding generated from the language model, without sequential features. GKS also leverages considerable knowledge in the decision-making, takes relations across knowledge as a part of the selection process. GKS outperforms several SOTA models proposed in the data-set on knowledge selection from the 9th Dialog System Technology Challenges (DSTC9).


page 1

page 2

page 3

page 4


A Knowledge-Grounded Dialog System Based on Pre-Trained Language Models

We present a knowledge-grounded dialog system developed for the ninth Di...

Quantized-Dialog Language Model for Goal-Oriented Conversational Systems

We propose a novel methodology to address dialog learning in the context...

Multimodal Dialog Systems with Dual Knowledge-enhanced Generative Pretrained Language Model

Text response generation for multimodal task-oriented dialog systems, wh...

Adapting Document-Grounded Dialog Systems to Spoken Conversations using Data Augmentation and a Noisy Channel Model

This paper summarizes our submission to Task 2 of the second track of th...

End-to-End Task-Oriented Dialog Modeling with Semi-Structured Knowledge Management

Current task-oriented dialog (TOD) systems mostly manage structured know...

Knowledge-incorporating ESIM models for Response Selection in Retrieval-based Dialog Systems

Goal-oriented dialog systems, which can be trained end-to-end without ma...

Building A User-Centric and Content-Driven Socialbot

To build Sounding Board, we develop a system architecture that is capabl...

1 Introduction

Task-oriented dialog system has been widely used nowadays, providing specific services in different industries. Building dialog systems with the ability to deal with heterogeneous data become significant to make robust and general frameworks for wide-range applied scenarios. The problem is amplified under the situation when input data is considered to be spoken language, making input data with even more variance. In this paper, we proposed our approach to construct dialog system in each sub-task to alleviate differentiation for data-sets. We mainly focus on Knowledge selection task, aiming to solve potential obstacles which past knowledge selection might encounter.

In DSTC10 track 2, the goal is to perform dialog system on spoken language, which is not straight forward to cope with via pre-trained language model. Regarding task 1, we design our knowledge-seeking turn detection model with denoise language model. The model is pre-trained with text data, aiming to transfer spoken language to text-style data.

Knowledge selection approach in the past mostly utilized language models. These approaches often have minor or no change on model structure but training objective and some training techniques such as data augmentation. We assume information within knowledge for dialog should not be taken simply in sequential style language model (see. Table


).To better capture knowledge information, we form our knowledge selection model with graph-based design for task 2. The proposed Graph-Knowledge Selector (GKS) takes knowledge-embedding as input to make selection task. GKS does not simply concate knowledge and question together in forward into language model as input and output prediction. Instead, GKS leverages information between each knowledge and between knowledge and question with graph-attention model.

As to assure our proposed frameworks in previous tasks, we select a model proposed in DSTC9 without much modification. We select Bao et al. (2020) with modification as our generation model, which takes output from knowledge selection task prediction as input.

The main contribution in this paper is the new knowledge selection model for dialog system. However, since our approach does not solve the problem of spoken language quite well, we further test our framework on DSTC9 dataset, which consists of text data. The experiment result shows that our proposed framework outperforms models proposed in last year’s challenge. Our implementation will be released upon acceptance.

Detected turn Does the hotel offer accessible parking?
Knowledge0 Does the hotel offer accessible parking?
Knowledge1 Is there on-site private parking at the
Bridge Guest House?
Knowledge2 Do I have to pay for parking?
Knowledge3 Is there a cost for parking?
Knowledge4 Can I make a reservation for parking?
Knowledge5 Do they have help for disabled parking?
Knowledge6 Do you provide room service daily?
Knowledge7 Are there any fitness center or gym?
Table 1: Example from dialog turn and knowledge. The desired knowledge for the user should be Knowledge0. However, there are no sequential relation between knowledge but many approach simply connect knowledge together as input. Moreover, if we train the model simply pairing Detected Turn with each knowledge and sampling training pairs, the overlapping information between knowledge might be wasted. In the example, Knowledge0-5 refer to parking issues, we try let the model distinguish Knowledge0 from all the others according to detected turn but not with sampling approaches.

2 Related Work

The recent development of task-oriented dialogue system benefits from pre-trained language models like GPT-2  (Radford et al., 2018; Budzianowski and Vulic, 2019; Ham et al., 2020). Furthermore, BERT-like  (Devlin et al., 2019) architectures, such as RoBERTa  (Liu et al., 2019), achieved state-or-the-art performance on natural language understanding task like GLUE dataset  (Wang et al., 2018). In subtask 1 and subtask 3, we leveraged this architecture to attain a better performance.

Knowledge-Seeking Turn Detection

The knowledge-seeking turn detection was first introduced in DSTC9 track1 by  (Kim et al., 2020)

. A binary classifier was proposed to solve this task, while  

Tang et al. (2021) used a supervised methos with an ELECTRA-based model followed by a fully connected layer  (Clark et al., 2020) to determine whether to trigger knowledge. In other way,  Mi et al. (2021) employed ensemble algorithm of four pre-trained models, such as RoBERTa and UniLM  (Dong et al., 2019), to solve it as a classification task as well.  Bao et al. (2020) proposed to consider API and external knowledge by schema description method.  (Shah et al., 2019; Eric et al., 2019; Rastogi et al., 2020)

Knowledge Selection

The knowledge selection task is to retrieve candidate snippets from knowledge base for response generation. Traditionally, TF-IDF technique (Ramos, 2003) and language model are applied on similar tasks. As mentioned above, the limitations of previous works lie in the model structure. On the other hand, we are inspired by Kernel Graph Attention Network proposed by (Liu et al., 2021), which performs fine-grained fact verification with kernel-based attention. We believe such graph-based model can be used to better capture information and select a more plausible set of knowledge snippets.

Knowledge Grounded Generation

The third component of our system is to generate responses given select knowledge snippets. Recently, Pre-trained language models such as BERT propel progress in natural language generation, but also demonstrate limitation while being directly fine-tuned on small conversation datasets  (Rashkin et al., 2019; Wolf et al., 2019). PLATO  (Bao et al., 2020) was proposed to address this problem by using uni- and bi-directional processing with further pre-training on large-scale Reddit and Twitter conversations. In addition, they introduced a discrete latent variable to grasp one-to-many utterances relationship information between conversations. In DSTC9 track1,  Tan et al. (2020)

further incorporated knowledge copy method to calculate the probability of response by combining generation distribution with the knowledge-attention distribution. 

Tan et al. (2020) provides an efficient way to generate sentences under the given knowledge, reducing the pressure added on the decoder and is easier for models to generalize to unseen knowledge.

3 Methodology

In this section, we first define our problem by dividing it into three separated sub-tasks, then discuss how we address each part of it with different models. Fig 1 shows our overall framework.

3.1 Knowledge-Seeking Turn Detection

In the first phase of our system, we deploy a binary classifier to decide whether to trigger the knowledge access branch for a given utterance and dialogue history.

Data Representations

To capture whether the current dialog turn will need to trigger the knowledge, we decide to concatenate the current dialog with dialog history. We hope the model can consider richer information than only using one dialog turn. Besides, to make the model understand the speaker of dialog, we added a speaker token ([User] or [Sys]) before every dialog. The speaker tokens represent this dialog turn is spoken by user or system. We believe that different speakers will provide implicit information to the dialog history and will make our system perform better. Below is the representation of our input data:


where equals to the turn of user, equals to the turn of system.

Binary Classification Model

In this part, we defined Knowledge-Seeking Turn Detection as a binary classification task. To extract informative features in the dialog context, we chose to use RoBERTa as our encoder since it outperformed most of pre-trained language model currently. Besides, we applied a new dialogue turn embedding, which represents the number of turns in the whole dialogue, in our training procedure. We hope model can learn more from this embedding and will regard turn number as an important information. After fine-tuning the RoBERTa model, the probability of being the correct answer is calculated as:


where is the output hidden states of [CLS] token, are trainable parameters, is the probability that input dialog will need to trigger the knowledge branch access.

3.2 Knowledge Selection

This section describes how we develop our graph-based knowledge selection model GKS.

Figure 1: The figure shows the overall framework for knowledge selection and knowledge-grounded generation. The left part describes the workflow of knowledge selection, generating the probability for each node. The generation model takes the highest probability among all nodes as knowledge snippets as input for generation.

Knowledge Embedding

To construct node embedding for GKS, we first build knowledge embedding with Bert model. For pre-training Bert model, we first concatenate detected knowledge-needed questions (refer to detected knowledge-seeking turns) and each knowledge with [SEP] token as input, and add [CLS] at the first of every question-knowledge pair. We train the Bert model combined with a linear layer as a binary classification task, which refer to “Related” and “Unrelated”. We then reference this pretrained bert model to make node embedding.


is the knowledge in the set connected with question with [SEP], where is the knowledge set consists of with knowledge pieces of the current entity. is the hidden state of the knowledge-question pair. To elaborate, represent [CLS]. and represent question and knowledge where question contains words (tokens) and knowledge words (tokens).

Graph-Attention Knowledge Selection

Inspired by Liu et al. (2021), which successfully capture text information with graph attention model for factual verification, we develop our selection model with graph-based model. The prediction is made per-node,which the node with the highest probability indicates the predicted knowledge. We follows Zhou et al. (2019), utilizing node kernel to determine relevance between dialog turn and each knowledge with “readout” function. First, Graph-Attention Knowledge Selector (GKS) applies the translation matrix on the knowledge hidden state and , where is the question of the user, by the hidden state and . GKS then applies the kernel match feature on the translation matrix to construct node representation for knowledge selection




function as the readout to calculate knowledge selection probability . The whole model is trained end-to-end by minimizing the cross entropy loss:


where is the golden knowledge to the detect knowledge turn.

3.3 Knowledge Grounded Generation

After candidate knowledge snippets were given, we then select the one snippet with the highest probability as an input for knowledge grounded generation. Inspired by  Bao et al. (2020) and  Tan et al. (2020), we then leverage RoBERTa-based architecture with the consideration of latent variable. Inspired by  Tan et al. (2020), we concatenate knowledge snippet and dialogue history with special tokens as input. Unlike  Tan et al. (2020), we consider both question and answer part of the knowledge snippet:


where represents selected knowledge snippet, and represent two speakers in the dialogue respectively, and denotes term in dialog history . represents the response. Following  Tan et al. (2020)

, we encode response into the Z matrix, of which each row represents a special z corresponding to given examples. To select a specific z as our latent variable, we estimate posterior probability

, where S denotes dialogue history. The rest of the architecture and calculation process, such as knowledge copy mechanism, segmented response generation, and modified beam search, are essentially identical to the ones in  Tan et al. (2020). We illustrate it with subtask2 model in Figure 1.

4 Experiments

This section demonstrates our experiment results. For the Baseline and chosen baselines, we report from their paper presented in DSTC9 last year.

Knowledge Seeking-Turn Detection

Table 2 shows our result on the DSTC9 data-set. As the proposed baseline model of DSTC9 is already performing very well, other proposed models have only a slight difference in their result. According to our experiment result, our model outperformed several baseline on F1 score, which indicates our approach to training model with [User] and [Sys] tokens gives language model more ability to learn user utterence’s pattern.  (Engelmore and Morgan, 1986) perform better from selected models. We assume their proposed training strategy with data augmentation is the key reason to gain performance under the condition that most language models could gain very good performance on the original data-set.

Model Recall Precision F1
Baseline 0.9992 0.9719 0.9853
ROBERTA-HS 0.9981 0.9963 0.9972
ROBERTA-WD (DA) 0.9996 0.9985 0.9990
He et al. (2021) 0.99102 0.9969 0.9939
Tang et al. (2021) 0.9817 0.9465 0.9638
Ours 0.9918 0.9921 0.9920
Table 2: Our result of Knowledge-Seeking Turn Detection. Where ROBERTA-HS and ROBERTA-WD (DA) are from (Mi et al., 2021).
Model Acc@5 Acc@1 MRR@5
Baseline 0.8772 0.6201 0.7263
ROBERTA-WD 0.9745 0.9145 0.9428
ROBERTA-WD (IS) 0.9741 0.9456 0.9589
ROBERTA-WD-listwise 0.9752 0.9394 0.9566
He et al. (2021) 0.9892 0.9465 0.9665
Tang et al. (2021) 0.9665 0.9117 0.9372
KGS 0.9899 0.95435 -
Table 3: Result of our presented Knowledge Selection model KGS. ROBERTA-WD, ROBERTA-WD (IS), and ROBERTA-WD-listwise are proposed in (Mi et al., 2021).

Knowledge Selection

Since our motivation is aiming to develop a better solution for the knowledge section without noise in spoken language translation and preventing potential defects mentioned in earlier sections that previous approaches don’t cope with, we further test our proposed model on the DSTC9 dataset. Table 3 shows the final result of KGS model performance on DSTC9 track 1 dataset. We selected several models proposed in last year’s competition and SOTA models as baseline. ROBERTA-WD (IS) in Mi et al. (2021) used sampling technique and k-fold cross-validation during the training process. (He et al., 2021) acquired multi-scale negatives to replace random sampling ,which might lead to coarse-grained class separation. Tang et al. (2021) is ELECTRA-based model with proposed aggregated loss, which contains correlation between the domains, entity names, knowledge snippets, and dialog contexts. The result shows that our model, which applies graph-based model in selection process, outperforms past approaches that only rely on language models, even without data augmentation.

Baseline 0.3601 0.2202 0.1389 0.0956 0.3600 0.3939 0.1749 0.3501
Mi et al. (2021) 0.4330 0.3061 0.2133 0.1616 0.4535 0.4795 0.2520 0.4304
Tang et al. (2021) 0.3684 0.2374 0.1531 0.1030 0.3719 0.4113 0.1938 0.3692
He et al. (2021) 0.4267 0.2789 0.1858 0.1357 0.4324 0.4587 0.2249 0.4093
Ours 0.4356 0.2978 0.1993 0.1378 0.4400 0.4711 0.2415 0.4262
Table 4: Automatic metric of our generation model against other baselines in subtask3.

Knowledge Grounded Generation

The generated results are demonstrated in Table 4, it is consumerated with others in DSTC9. Following Tan et al. (2020)

, our RoBERTa-based model has the same hyperparameters as the baseline model in

Kim et al. (2021). Learning rate is

, batch size is 4, and gradient accumulation steps is 32. The number of hidden variable z is set to 5. Our model is trained in 20 epochs, we use copy mechanism followed by vanilla beam search to get our final generated result.

5 Conclusion

In this paper, we proposed a framework for DSTC10 and DSTC9. Our main goal is to develop a better solution for knowledge selection tasks, which only rely on language models to perform selection in the past. The results showed that our proposed knowledge selection model with a graph-based model performed better than the proposed models last year. For our future goal, we are interested in replacing knowledge turn question embedding, which is constructed with text sentence embedding in our original setting, with wave embedding. We assume this could better obtain spoken without hurting the overall system.


  • S. Bao, H. He, F. Wang, H. Wu, and H. Wang (2020) PLATO: pre-trained dialogue generation model with discrete latent variable. External Links: 1910.07931 Cited by: 1 Introduction, Knowledge-Seeking Turn Detection, Knowledge Grounded Generation, 3.3 Knowledge Grounded Generation.
  • P. Budzianowski and I. Vulic (2019) Hello, it’s GPT-2 - how can I help you? towards the use of pretrained language models for task-oriented dialogue systems. CoRR abs/1907.05774. External Links: Link, 1907.05774 Cited by: 2 Related Work.
  • K. Clark, M. Luong, Q. V. Le, and C. D. Manning (2020) ELECTRA: pre-training text encoders as discriminators rather than generators. External Links: 2003.10555 Cited by: Knowledge-Seeking Turn Detection.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4171–4186. External Links: Link, Document Cited by: 2 Related Work.
  • L. Dong, N. Yang, W. Wang, F. Wei, X. Liu, Y. Wang, J. Gao, M. Zhou, and H. Hon (2019) Unified language model pre-training for natural language understanding and generation. External Links: 1905.03197 Cited by: Knowledge-Seeking Turn Detection.
  • R. Engelmore and A. Morgan (Eds.) (1986) Blackboard systems. Addison-Wesley, Reading, Mass.. Cited by: Knowledge Seeking-Turn Detection.
  • M. Eric, R. Goel, S. Paul, A. Kumar, A. Sethi, P. Ku, A. K. Goyal, S. Agarwal, S. Gao, and D. Hakkani-Tur (2019) MultiWOZ 2.1: a consolidated multi-domain dialogue dataset with state corrections and state tracking baselines. External Links: 1907.01669 Cited by: Knowledge-Seeking Turn Detection.
  • D. Ham, J. Lee, Y. Jang, and K. Kim (2020) End-to-end neural pipeline for goal-oriented dialogue systems using GPT-2. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 583–592. External Links: Link, Document Cited by: 2 Related Work.
  • H. He, H. Lu, S. Bao, F. Wang, H. Wu, Z. Niu, and H. Wang (2021) Learning to select external knowledge with multi-scale negative sampling. External Links: 2102.02096 Cited by: Knowledge Selection, Table 2, Table 3, Table 4.
  • S. Kim, M. Eric, K. Gopalakrishnan, B. Hedayatnia, Y. Liu, and D. Hakkani-Tur (2020) Beyond domain apis: task-oriented conversational modeling with unstructured knowledge access. External Links: 2006.03533 Cited by: Knowledge-Seeking Turn Detection.
  • S. Kim, M. Eric, B. Hedayatnia, K. Gopalakrishnan, Y. Liu, C. Huang, and D. Hakkani-Tur (2021) Beyond domain apis: task-oriented conversational modeling with unstructured knowledge access track in dstc9. External Links: 2101.09276 Cited by: Knowledge Grounded Generation.
  • Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov (2019) RoBERTa: A robustly optimized BERT pretraining approach. CoRR abs/1907.11692. External Links: Link, 1907.11692 Cited by: 2 Related Work.
  • Z. Liu, C. Xiong, M. Sun, and Z. Liu (2021) Fine-grained fact verification with kernel graph attention network. External Links: 1910.09796 Cited by: Knowledge Selection, Graph-Attention Knowledge Selection.
  • H. Mi, Q. Ren, Y. Dai, Y. He, J. Sun, Y. Li, J. Zheng, and P. Xu (2021) Towards generalized models for beyond domain api task-oriented dialogue. DSTC 9. External Links: Link Cited by: Knowledge-Seeking Turn Detection, Knowledge Selection, Table 2, Table 3, Table 4.
  • A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever (2018) Language models are unsupervised multitask learners. External Links: Link Cited by: 2 Related Work.
  • J. E. Ramos (2003) Using tf-idf to determine word relevance in document queries. Cited by: Knowledge Selection.
  • H. Rashkin, E. M. Smith, M. Li, and Y. Boureau (2019) Towards empathetic open-domain conversation models: a new benchmark and dataset. External Links: 1811.00207 Cited by: Knowledge Grounded Generation.
  • A. Rastogi, X. Zang, S. Sunkara, R. Gupta, and P. Khaitan (2020) Schema-guided dialogue state tracking task at dstc8. External Links: 2002.01359 Cited by: Knowledge-Seeking Turn Detection.
  • D. J. Shah, R. Gupta, A. A. Fayazi, and D. Hakkani-Tur (2019) Robust zero-shot cross-domain slot filling with example values. External Links: 1906.06870 Cited by: Knowledge-Seeking Turn Detection.
  • C. Tan, X. Yang, Z. Zheng, T. Li, Y. Feng, J. Gu, Q. Liu, D. Liu, Z. Ling, and X. Zhu (2020) Learning to retrieve entity-aware knowledge and generate responses with copy mechanism for task-oriented dialogue systems. External Links: 2012.11937 Cited by: Knowledge Grounded Generation, 3.3 Knowledge Grounded Generation, Knowledge Grounded Generation.
  • L. Tang, Q. Shang, K. Lv, Z. Fu, S. Zhang, C. Huang, and Z. Zhang (2021) RADGE: relevance learning and generation evaluating method for task-oriented conversational systems. DSTC 9. External Links: Link Cited by: Knowledge-Seeking Turn Detection, Knowledge Selection, Table 2, Table 3, Table 4.
  • A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman (2018) GLUE: A multi-task benchmark and analysis platform for natural language understanding. CoRR abs/1804.07461. External Links: Link, 1804.07461 Cited by: 2 Related Work.
  • T. Wolf, V. Sanh, J. Chaumond, and C. Delangue (2019)

    TransferTransfo: a transfer learning approach for neural network based conversational agents

    External Links: 1901.08149 Cited by: Knowledge Grounded Generation.
  • J. Zhou, X. Han, C. Yang, Z. Liu, L. Wang, C. Li, and M. Sun (2019) GEAR: graph-based evidence aggregating and reasoning for fact verification. External Links: 1908.01843 Cited by: Graph-Attention Knowledge Selection.