Filtering before Iteratively Referring for Knowledge-Grounded Response Selection in Retrieval-Based Chatbots

04/30/2020 ∙ by Jia-Chen Gu, et al. ∙ Queen's University USTC Anhui USTC iFLYTEK Co 0

The challenges of building knowledge-grounded retrieval-based chatbots lie in how to ground a conversation on the background knowledge and how to perform their matching with the response. This paper proposes a method named Filtering before Iteratively REferring (FIRE) for presenting the background knowledge of dialogue agents in retrieval-based chatbots. We first propose a pre-filter, which is composed of a context filter and a knowledge filter. This pre-filter grounds the conversation on the knowledge and comprehends the knowledge according to the conversation by collecting the matching information between them bidirectionally, and then recognizing the important information in them accordingly. After that, iteratively referring is performed between the context and the response, as well as between the knowledge and the response, in order to collect the deep and wide matching information. Experimental results show that the FIRE model outperforms previous methods by margins larger than 2.8 original personas and 4.1 well as 3.1

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Building a conversational agent with intelligence has received significant attention with the emergence of personal assistants such as Apple Siri, Google Now and Microsoft Cortana. One approach is to building retrieval-based chatbots, which aims to select a potential response from a set of candidates given only the conversation context

Lowe et al. (2015); Wu et al. (2017); Zhou et al. (2018b); Gu et al. (2019a); Tao et al. (2019).

Figure 1: An example from CMU_DoG dataset. Words in the same color refer to each other. Some irrelevant utterances do not refer to the knowledge but play the role of connecting the preceding and following utterances. Some knowledge entries such as Year, Director and Critical Response are not mentioned in this conversation.

However, real human conversations are often grounded on the external knowledge. People may associate relevant background knowledge according to the current conversation, and then reply to make it more engaging. Recently, researchers are devoted to simulating this motivation by grounding dialogue agents with background knowledge Zhang et al. (2018); Zhou et al. (2018a); Mazaré et al. (2018); Zhao et al. (2019); Gu et al. (2019b). In this paper, we study the problem of knowledge-grounded response selection and specify the knowledge as unstructured entries that are common sources in practice. An example is shown in Figure 1. In this way, agents can respond according to not only the semantic relevance with the given context, but also the relevant knowledge out of scope of the given conversation.

The current state-of-the-art model on this task, i.e., DIM Gu et al. (2019b), which proposed a dual matching framework performing the context-response and knowledge-response matching respectively. This model has showed great performance to select an appropriate response matching the context and the knowledge simultaneously. However, the context has no relationships with the knowledge in DIM, as it neglects the step of grounding the conversation on the knowledge, which is essential for this task. Meanwhile, the matching in DIM is too shallow to capture deep matching information, as the response refers to the context and the knowledge only once. To this end, we argue DIM has three drawbacks that: (1) it does not ground the conversation on the knowledge, as not all utterances are relevant to the knowledge (such as the greetings), (2) it does not comprehend the knowledge according to the conversation, as entries in the knowledge are redundant (such as the entries of Year, Director and Critical Response in Figure 1 are not mentioned), and (3) the matching in DIM is performed in a shallow and limited way.

In this paper, we propose a method named Filtering before Iteratively REferring (FIRE) for presenting the background knowledge of dialogue agents in retrieval-based chatbots. We first propose a pre-filter which let the context and the knowledge refer to each other bidirectionally, in order to recognize the important utterances in the context and important entries in the knowledge. Specifically, this pre-filter is composed of a context filter and a knowledge filter. The context filter let the context refer to the knowledge to derive the knowledge-aware context. We utilize the soft attention mechanism to discriminate the relevant and irrelevant utterances. Typically, a peaky distribution of attention weights is achieved for relevant utterances as a result of semantically relevant words, while a flat one is achieved for irrelevant ones, as shown in Appendix A.2. The relevant utterances can be enhanced by the semantically relevant words in the knowledge, while approximately averaged knowledge does not bring too much useful information for irrelevant utterances. However, we still keep the irrelevant utterances instead of directly filtering them out, since they still play the role of connecting the preceding and following utterances. On the other hand, we design a knowledge filter to derive the context-aware knowledge and directly filter out the irrelevant knowledge entries. The entries are independent to each other and filtering out these irrelevant ones do not affect the whole.

Given the filtered context and knowledge, how to perform their matching with the response is another key issue to this task. Motivated by the attention-over-attention (AoA) Cui et al. (2017) and interaction-over-interaction (IoI) Tao et al. (2019)neural networks, we design an iteratively referring network. This network follows the dual matching framework Gu et al. (2019b) by letting the response refer to the context and the knowledge simultaneously. Different from the shallow and limited matching in DIM, we perform the referring operation iteratively. The outputs of each iteration are utilized as the inputs of next iteration. Each time of iteration is capable of capturing additional matching information based on previous iterations. We accumulate the outputs of each iteration and then aggregate them into a set of matching features for ranking responses. Accumulating those can help to derive deep and wide matching information.

We test our proposed method on the PERSONA-CHAT Zhang et al. (2018) and CMU_DoG datasets Zhou et al. (2018a). Results show that the FIRE model outperforms previous methods by margins larger than 2.8% on original personas and 4.1% on revised personas on the PERSONA-CHAT dataset, as well as 3.1% on the CMU_DoG dataset in terms of top-1 accuracy, achieving a new state-of-the-art performance for knowledge-grounded response selection in retrieval-based chatbots.

In summary, the contributions of this paper are two-fold. (1) A Filtering before Iteratively REferring (FIRE) method is proposed which aims to filter the context and the knowledge first, and then iteratively refer to the response. (2) Experimental results on the PERSONA-CHAT and CMU_DoG datasets demonstrate that our proposed model outperforms the state-of-the-art model by large margins on the accuracy of response selection.

2 Related Work

2.1 Response Selection

Response selection is an important problem in building retrieval-based chatbots. Existing work on response selection can be categorized into single-turn Wang et al. (2013) and multi-turn dialogues Lowe et al. (2015); Wu et al. (2017); Zhou et al. (2018b); Gu et al. (2019a); Tao et al. (2019). Early studies have been more on single-turn dialogues, considering only the last utterance of a context for response matching. More recently, the research focus has been shifted to multi-turn conversations, a more practical setup for real applications. Wu et al. (2017)

proposed the sequential matching network (SMN) which first matched the response with each context utterance and then accumulated the matching information by a recurrent neural network.

Zhou et al. (2018b) proposed the deep attention matching network (DAM) to construct representations at different granularities with stacked self-attention. Gu et al. (2019a)

proposed the interactive matching network (IMN) to perform the bidirectional and global interactions between the context and the response in order to derive the matching feature vector.

Tao et al. (2019) proposed the interaction over interaction (IoI) model which performed matching by stacking multiple interaction blocks.

2.2 Knowledge-Grounded Chatbots

Chit-chat models suffer from a lack of explicit long-term memory as they are typically trained to produce an utterance given only a very recent dialogue history. Recently, some studies show that chit-chat models can be more diverse and engaging by conditioning them on the background knowledge. Zhang et al. (2018) released the PERSONA-CHAT dataset which employs the speakers’ profile information as the background knowledge. Zhou et al. (2018a) released the CMU_DoG dataset which employs the Wikipedia articles about popular movies as the background knowledge. Mazaré et al. (2018) proposed the fine-tuned persona-chat (FT-PC) model which first pre-trained a model using a large-scale corpus with external knowledge and then fine-tuned it on the PERSONA-CHAT dataset. Zhao et al. (2019) proposed the document-grounded matching network (DGMN) which fused information in the context and the knowledge into representations of each other. Gu et al. (2019b) proposed a dually interactive matching network (DIM) which performed the interactive matching between responses and contexts and between responses and knowledge respectively.

In this paper, we make two improvements to the state-of-the-art DIM model Gu et al. (2019b) on this task. Specifically, (1) a pre-filter is designed for the context and the knowledge before their matching with the response, and (2) the context-response and knowledge-response matching are deeper and wider than those in DIM.

3 Task Definition

Given a dialogue dataset , an example of the dataset can be represented as . Specifically, represents a context with as its utterances and as the utterance number. represents a knowledge description with as its entries and as the entry number. represents a response candidate. denotes a label. indicates that is a proper response for ; otherwise, . Our goal is to learn a matching model from . For any context-knowledge-response triple , measures the matching degree between and .

4 FIRE Model

Figure 2: An overview of our proposed FIRE model.

Figure 2 shows an overview of the model architecture. In general, the context utterances, responses and knowledge entries are first encoded by a sentence encoder. Then the context and the knowledge are co-filtered by referring to each other. Next, the response refers to the filtered context and knowledge simultaneously and iteratively. The outputs of each iteration are aggregated into a matching feature, and utilized as the inputs of next iteration at the same time. Finally, the matching features of each iteration are accumulated for prediction. Details are provided in the following subsections.

4.1 Word Representation

We follow the setting used in DIM Gu et al. (2019b)

, which constructs word representations by combining general pre-trained word embeddings, those estimated on the task-specific training set, as well as character-level embeddings, in order to deal with the out-of-vocabulary issue.

Formally, embeddings of the m-th utterance in a context, the n-th entry in a knowledge description and a response candidate are denoted as , and respectively, where , and are the numbers of words in , and R respectively. Each , or is an embedding vector.

4.2 Sentence Encoder

Note that the encoder can be any existing encoding model. In this paper, the context utterances, knowledge entries and response candidate are encoded by bidirectional long short-term memories (BiLSTMs)

Hochreiter and Schmidhuber (1997). Detailed calculations are omitted due to limited space. After that, we can obtain the encoded representations for utterances, entries and response, denoted as , and respectively. Each , or is an embedding vector of d-dimensions. The parameters of these three BiLSTMs are shared in our implementation.

4.3 Pre-Filter

As illustrated in Figure 1, not each utterance refers to the knowledge, and not each entry is mentioned in the conversation. In order to ground the conversation on the knowledge and comprehend the knowledge according to the conversation, we propose a pre-filter. It let the context and the knowledge refer to each other bidirectionally to derive the filtered context and knowledge representations and , which are then utilized to match with the response. This pre-filter is composed of a context filter and a knowledge filter. We will introduce them as follows.

Context Filter

The context refers to the knowledge in order to derive the knowledge-aware context representation by collecting the matching information between them and dynamically determining the importance of each utterance. We still keep those irrelevant utterances softly instead of directly filtering them out, since they still play the role of connecting the preceding and following utterances.

First, given the set of utterance representations encoded by the sentence encoder, we concatenate them to form the context representation with . Also, the knowledge representation with is formed similarly by concatenating . Then, a soft alignment is performed by computing the attention weight between each tuple {} as

(1)

After that, local inference is determined by the attention weights computed above to obtain the local relevance between the context and the knowledge. For a word in the context, its relevant representation carried by the knowledge is identified and composed using as

(2)

where the contents in that are relevant to are selected to form , and we define .

To further enhance the collected information, the element-wise difference and multiplication between {} are computed, and are then concatenated with the original vectors to obtain the enhanced context representations as follows,

(3)

where and . So far we have collected the relevant information from the knowledge for the context. Finally, we compress the collected and original information, in order to obtain the knowledge-aware context representation as

(4)

where , and are parameters updated during training.

Generally speaking, the operation of context referring to knowledge mentioned above can be considered as employing as query, and as key and value. For brevity, we define this referring operation as

(5)

The filtered context representation is then utilized to match with the response.

Knowledge Filter

Similarly, the knowledge refers to the context in order to derive the context-aware knowledge representation . However, different from the context filter, we adopt a selection strategy to directly filter out irrelevant knowledge entries, as the entries are independent to each other and filtering out irrelevant ones does not affect the whole.

First, the knowledge refers to the context by performing the same referring operation in order to collect the matching information from the context as follows,

(6)

where .

Furthermore, we need to compute the relevance between each entry and the whole conversation in order to determine whether to filter out this entry. We first perform the last-hidden-state pooling over the representations of the utterances and entries out of the sentence encoder. The utterance embedding and the entry embedding are obtained. Next, we compute the relevance score for each utterance-entry pair as follows,

(7)

where is the matching similarity updated during training.

In order to obtain the overall relevance score between each entry and the whole conversation, additional aggregation operation is required. Here, we make an assumption that one entry is mentioned only once in the conversation. Thus, for a given entry, its relevance score with the conversation is defined as the maximum relevance score between it and all utterances. Mathematically, we have

(8)

A threshold is introduced here. Those entries whose scores are below will be considered as uninformative for this conversation and directly filtered out. Mathematically, we have

(9)
(10)

where

is sigmoid function and

is element-wise multiplication. We define the final filtered knowledge representation .

4.4 Iteratively Referring

Gu et al. (2019b) shows that the interactions between the context and the response and those between the knowledge and the response can both provide useful matching information for deciding the matching degree between them. However, the matching information collected there are very shallow and limited, as the response refers to the context and the knowledge only once. In this paper, we design an iteratively referring network which let the response refer to the filtered context and knowledge iteratively. Each time of iteration is capable of capturing additional matching information based on previous iterations. Accumulating these iterations can help to derive the deep and wide matching features.

Take the context-response matching as an example. The matching strategy adopted here considers the global and bidirectional interactions between two sequences. Let and be the outputs of the l-th iteration, and the inputs of the l+1-th iteration, where and is the number of iterations. We define .

First, the context refers to the response by performing the same referring operation as follows,

(11)

After that, we can obtain the response-aware context representation .

Bidirectionally, the response refers to the context as follows,

(12)

After that, we can obtain the context-aware response representation .

After finishing one time of iteration, we can derive and , which are utilized as the input of next iteration. After L times of iterations, we can obtain and .

On the other hand, the knowledge-response matching is conducted identically to the context-response matching as introduced above. The representations of response-aware knowledge and knowledge-aware response are used as follows,

(13)
(14)

where . Similarly, we can obtain and after L times of knowledge-response referring.

4.5 Aggregation

Given these sets of matching matrices , , , and , they are aggregated into the final matching features. Note that we perform the same aggregation operation for each iteration. Here, we take and for example. As aggregation is not the focus of this paper, we adopt the same strategy as that used in DIM Gu et al. (2019b). We will introduce briefly and readers can refer to Gu et al. (2019b) for more details.

First, and are converted back to separated matching matrices as and . Then, each matching matrix , and

are aggregated by max pooling and mean pooling operations to derive their embeddings

, , and respectively. Next, the sequences of and are further aggregated to get the embedding vectors for the context and the knowledge respectively.

As the utterances in a context are chronologically ordered, the utterance embeddings are sent into another BiLSTM following the chronological order of utterances in the context. Combined max pooling and last-hidden-state pooling operations are then performed to derive the context embeddings .

On the other hand, as the entries in a knowledge description are independent to each other, an attention-based aggregation is designed over to derive the knowledge embeddings .

The matching feature vector of this iteration is the concatenation of context, knowledge and response embeddings as

(15)

where the first two features describe the context-response matching, and the last two describe the knowledge-response matching.

Last, we could obtain the set of matching features for each iteration .

4.6 Prediction

For each matching feature vector

, it is sent into a multi-layer perceptron (MLP) classifier. Here, the MLP is designed to predict whether a

triple match appropriately based on the derived matching feature vector and return a score denoting the matching degree. A softmax output layer is adopted in this model to return a probability distribution over all response candidates. The probability distributions for each matching feature vector

are averaged to derive the final matching score for ranking.

4.7 Learning Method

Inspired by Tao et al. (2019), we learn a model by minimizing the summation of the loss of each iteration. By this means, each feature can be directly supervised by the labels in during learning. Furthermore, inspired by Szegedy et al. (2016), we employ the strategy of label smoothing in order to prevent the model from being over-confident. Let denote the parameters of the matching model. The learning objective is formulated as

(16)

5 Experiments

Model PERSONA-CHAT CMU_DoG
Original Revised
Starspace Wu et al. (2018) 49.1 60.2 76.5 32.2 48.3 66.7 50.7 64.5 80.3
Profile Memory Zhang et al. (2018) 50.9 60.7 75.7 35.4 48.3 67.5 51.6 65.8 81.4
KV Profile Memory Zhang et al. (2018) 51.1 61.8 77.4 35.1 45.7 66.3 56.1 69.9 82.4
Transformer Mazaré et al. (2018) 54.2 68.3 83.8 42.1 56.5 75.0 60.3 74.4 87.4
DGMN Zhao et al. (2019) 67.6 80.2 92.9 58.8 62.5 87.7 65.6 78.3 91.2
DIM Gu et al. (2019b) 78.8 89.5 97.0 70.7 84.2 95.0 78.7 89.0 97.1
FIRE (Ours) 81.6 91.2 97.8 74.8 86.9 95.9 81.8 90.8 97.4
Table 1: Performance of the proposed and previous methods on the test sets of the PERSONA-CHAT and CMU_DoG datasets. The meanings of “Original”, and “Revised” can be found in Section 5.1.

5.1 Datasets

We tested our proposed method on the PERSONA-CHAT Zhang et al. (2018) and CMU_DoG datasets Zhou et al. (2018a) which both contain multi-turn dialogues grounded on the background knowledge.

The PERSONA-CHAT dataset consists of 8939 complete dialogues for training, 1000 for validation, and 968 for testing. Response selection is performed at every turn of a complete dialogue, which results in 65719 dialogues for training, 7801 for validation, and 7512 for testing in total. Positive responses are true responses from humans and negative ones are randomly sampled by the dataset publishers. The ratio between positive and negative responses is 1:19 in the training, validation, and testing sets. There are 955 possible personas for training, 100 for validation, and 100 for testing, each consisting of 3 to 5 profile sentences. To make this task more challenging, a version of revised persona descriptions are also provided by rephrasing, generalizing, or specializing the original ones.

The CMU_DoG dataset consists of 2881 complete dialogues for training, 196 for validation, and 537 for testing. Response selection is also performed at every turn of a complete dialogue, which results in 36159 dialogues for training, 2425 for validation, and 6637 for testing in total. This dataset was built in two scenarios. In the first scenario, only one worker has access to the provided knowledge, and he/she is responsible for introducing the movie to the other worker; while in the second scenario, both workers know the knowledge and they are asked to discuss the content. Since the data size for an individual scenario is small, we followed the setting used in Zhao et al. (2019) which merged the data of the two scenarios in the experiments and filtered out conversations less than 4 turns to avoid noise. Since this dataset did not contain negative examples, we adopted the version shared in Zhao et al. (2019), in which 19 negative candidates were randomly sampled for each utterance from the same set.

5.2 Evaluation Metrics

We used the same evaluation metrics as in the previous work

Zhang et al. (2018); Zhao et al. (2019). Each model aimed to select best-matched response from available candidates for the given context and knowledge. We calculated the recall of the true positive replies, denoted as .

5.3 Training Details

Due to limit space, readers can refer to Appendix A.1 for more details.

5.4 Experimental Results

Table 1 presents the evaluation results of our proposed and previous methods on the PERSONA-CHAT dataset under the configurations of original and revised personas, as well as the results on the CMU_DoG dataset. As the DIM model was not tested on the CMU_DoG dataset, we used the code released by the original authors Gu et al. (2019b) to test its performance on the CMU_DoG dataset. Results show that the FIRE model outperforms previous methods by margins larger than 2.8% on original personas and 4.1% on revised personas on the PERSONA-CHAT dataset, as well as 3.1% on the CMU_DoG dataset in terms of top-1 accuracy , achieving a new state-of-the-art performance for knowledge-grounded response selection in retrieval-based chatbots.

5.5 Analysis

Ablations

Model PERSONA-CHAT CMU_DoG
Original Revised
FIRE 82.3 75.2 83.4
  - Iterative refer 81.3 73.8 81.6
    - Pre-filter 78.9 71.1 78.8
C-R 65.6 66.2 79.7
C-R Fusion 67.0 66.4 80.9
Filter C-R 78.8 70.2 81.4
K-R 51.6 34.3 57.8
K-R Fusion 54.2 39.4 63.1
Filter K-R 63.6 51.0 73.5
Table 2: Ablation tests of the FIRE model on the validation sets. C-R denotes the context-response matching and K-R denotes the knowledge-response matching. denotes the operation order.

We conducted the ablation tests as follows. First, we ablate the iteratively referring by setting the number of iterations to one. Then we removed the pre-filter. The evaluation results on the validation sets were shown in Table 2. The performances of these ablation models were worse than before, leading to a drop in terms of selection accuracy, which demonstrated the effectiveness of these components in the FIRE model.

Furthermore, we discussed the single context-response or knowledge-response matching in order to show the effectiveness of the context filter and the knowledge filter separately. Three experiments were designed as follows: (1) single context-response matching without knowledge; (2) context-response matching first and then knowledge fusion at a fine-grained utterance-level, as the IMN model in Gu et al. (2019b) where readers can refer to for more details; (3) context filtering first and then the context-response matching. The evaluation results on the validation set were shown in Table 2. It shows that the fusion after matching and the filtering before matching can both improve the performance with the help of knowledge. Meanwhile, the filtering before matching outperformed the fusion after matching by a large margin, which demonstrated the effectiveness of the context filter. Also, we designed similar experiments for the knowledge-response matching and we can observe the same trend, which demonstrated the effectiveness of the knowledge filter.

Context-Knowledge Co-Filtering

Figure 3: Performance of FIRE with respect to different on the validation sets.

Figure 3 illustrates the performance of FIRE with respect to different hyper-parameter on the validation sets. Here, the number of iterations was set to 1 to save computation. It is notable that the selection strategy will be ablated when . From the figure, we can observe a consistent trend that there was an improvement when increasing at the beginning, which indicates that filtering out irrelevant entries indeed improves the selection accuracy. Then the performance started to drop when was too large since some indeed relevant entries were also filtered out by mistake.

A case study is further conducted in Appendix A.2 by visualizations.

Iteratively Referring

Figure 4: Performance of FIRE with respect to different numbers of iterations on the validation sets.

Figure 4 illustrates how the performance of FIRE changes with respect to the number of iterations on the validation sets. From the figure, we can observe a consistent trend that a significant improvement was achieved during the first few iterations, and then the performance of the model becomes stable. The results indicate that iteratively referring indeed improves accuracy of response selection.

Complexity

We analysed the time and space complexity difference between FIRE and DIM, which shows that FIRE is more time-efficient. Readers can refer to Appendix A.3 for more details.

6 Conclusion

In this paper, we propose a method named Filtering before Iteratively REferring (FIRE) for presenting the background knowledge of dialogue agents in retrieval-based chatbots. FIRE first pre-filters the context and the knowledge and then uses the filtered context and knowledge to perform the deep and wide matching with the response. Experimental results show that the FIRE model outperforms previous methods by large margins on the PERSONA-CHAT and CMU_DoG datasets, achieving a new state-of-the-art performance for knowledge-grounded response selection in retrieval-based chatbots. In the future, we will explore to employ pre-training methods to select relevant knowledge and incorporate it for response selection.

References

Appendix A Appendices

a.1 Training Details

For training the FIRE model on both the PERSONA-CHAT and CMU_DoG datasets, some common configurations were set as follows. The Adam method Kingma and Ba (2015) was employed for optimization. The initial learning rate was 0.00025 and was exponentially decayed by 0.96 every 5000 steps. Dropout Srivastava et al. (2014) with a rate of 0.2 was applied to the word embeddings and all hidden layers. A word representation is a concatenation of a 300-dimensional GloVe embedding Pennington et al. (2014), a 100-dimensional embedding estimated on the training set using the Word2Vec algorithm Mikolov et al. (2013), and 150-dimensional character-level embeddings with window sizes {3, 4, 5}, each consisting of 50 filters. The word embeddings were not updated during training. All hidden states of the LSTM have 200 dimensions. The MLP at the prediction layer have 256 hidden units with ReLU Nair and Hinton (2010) activation. The validation set was used to select the best model for testing.

Some parameters were different according to the characteristics of the two datasets. For the PERSONA-CHAT dataset, the maximum number of characters in a word, that of words in a context utterance, of utterances in a context, of words in a response, of words in a knowledge entry, and of entries in a knowledge description were set to be 18, 20, 15, 20, 15, and 5 respectively. For the CMU_DoG dataset, these parameters were set to 18, 40, 15, 40, 40 and 20 respectively. We padded with zeros if the number of utterances in a context and the number of knowledge entries in a knowledge description were less than the maximum; otherwise, we kept the last context utterances or the last knowledge entries. Batch size was set to 16 for the PERSONA-CHAT and 4 for the CMU_DoG. The hyper-parameter

was set to 0.3 on original personas and 0.2 on revised personas on the PERSONA-CHAT dataset, as well as 0.2 on the CMU_DoG dataset, which were tuned on the validation sets as shown in Figure 3.

All code was implemented in the TensorFlow framework

Abadi et al. (2016) and will be published to help replicate our results after the paper acceptance.

a.2 Case Study

A case study was conducted for further illustration. Specifically, the context utterances of this case are U1: hey , are you a student , i traveled a lot , i even studied abroad. U2: no , i work full time at a nursing home . i am a nurses aide . U3: nice , i just got a advertising job myself . do you like your job ? U4: nice . yes i do . caring for people is the joy of my life . U5: nice my best friend is a nurse , i knew him since kindergarten. The knowledge entries of this case are E1: i have two dogs and one cat . E2: i work as a nurses aide in a nursing home . E3: i love to ride my bike . E4: i love caring for people .

The context-to-knowledge attention weights used in Eq. (2) of the context filter are visualized in Figure 5 (a). Meanwhile, the knowledge-to-context attention weights of the knowledge filter are also visualized in Figure 5 (b). Furthermore, the similarity scores in Eq. (7) for each utterance-entry pair are visualized in Figure 6 (a). The final scores in Eq. (8) for each entry are visualized in Figure 6 (b).

We can see that the utterances U2 and U4 obtained large attention weights with the entries E2 and E4 respectively. Meanwhile, some irrelevant entries E1 and E3 obtained small similarity scores with the conversation, which were going to be filtered out. This experimental results verified the effectiveness of the pre-filter.

(a) Context filter
(b) Knowledge filter
Figure 5: Visualizations of attention weights of (a) context filter (the sum of weights for each row equals to 1) and (b) knowledge filter (the sum of weights for each column equals to 1) for a test sample. The darker units correspond to larger values.
(a) in Eq. (7)
(b) in Eq. (8)
Figure 6: Visualizations of similarity scores of (a) in Eq. (7) and (b) in Eq. (8) for a test sample. The darker units correspond to larger values.

a.3 Complexity

Model Time (s) Parameters
DIM 160.4 6.5M
FIRE 109.5 13.1M
Table 3: The inference time over the validation set of PERSONA-CHAT under the configuration of original personas using different models, together with their numbers of parameters.

We analysed the time and space complexity by comparing our proposed FIRE model and the state-of-the-art DIM model Gu et al. (2019b) on this task.

In order to explore the efficiency difference between FIRE and DIM, we analysed their time complexity by comparing their run-time computation speed. We recorded the inference time over the validation set of PERSONA-CHAT under the configuration of original personas using a GeForce GTX 1080 Ti GPU. Furthermore, the number of parameters was used to evaluate the space complexity of these two models. The results are shown in Table 3

. We can see that, FIRE requires more parameters than DIM as FIRE adds an additional pre-filter and deepens the matching network. However, FIRE is more time-efficient as it requires less inference time. The reason is that we design a lighter aggregation method in FIRE by replacing the recurrent neural network in the aggregation part of DIM with a single-layer non-linear transformation.