Augmenting Transformers with KNN-Based Composite Memory for Dialogue

04/27/2020 ∙ by Angela Fan, et al. ∙ 0

Various machine learning tasks can benefit from access to external information of different modalities, such as text and images. Recent work has focused on learning architectures with large memories capable of storing this knowledge. We propose augmenting generative Transformer neural networks with KNN-based Information Fetching (KIF) modules. Each KIF module learns a read operation to access fixed external knowledge. We apply these modules to generative dialogue modeling, a challenging task where information must be flexibly retrieved and incorporated to maintain the topic and flow of conversation. We demonstrate the effectiveness of our approach by identifying relevant knowledge from Wikipedia, images, and human-written dialogue utterances, and show that leveraging this retrieved information improves model performance, measured by automatic and human evaluation.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Machine learning solutions to various tasks, such as game-playing or dialogue, are often dependent on external information. This information can take multi-modal forms, including structured knowledge bases, free text, and images, and also comes in overwhelmingly large quantities. A pressing challenge is to create models that can identify which specific elements of multiple information sources are relevant, and incorporate them into standard architectures on each task. In this work, we focus on human-machine dialog and how to efficiently retrieve external knowledge that is relevant to the task. We consider two scenarios and for each scenario, retrieve two types of knowledge: (i) knowledge about similar dialog contexts and (ii) external knowledge used to ground the conversation into real world information.

Knowledge about similar dialog contexts allows for a hybrid retrieval/generative approach to dialog where the system response is generated based not only on a representation of the current dialog context and of the relevant world knowledge, but also based on a response retrieved from a similar dialog context. In this case, the retrieved knowledge can be viewed as providing information about dialog structure and dialog utterances: which type of response is likely given similar context?

External knowledge is also retrieved to improve the semantic content of the dialog model. In one scenario (Wizard of Wikipedia), it is retrieved from a pre-selected set of Wikipedia sentences associated with the current dialog topic. Retrieval aims to select the sentence that is most relevant at each step of the dialog and thereby to ground system responses in relevant world knowledge (e.g. by referring to a Star Wars when talking about science fiction). In the other scenario (Engaging ImageChat), the retrieved external knowledge is images and their associated dialogues. By retrieving images that are similar to the image being talked about and their associated dialog, we aim to enrich system responses with knowledge about what is typically mentioned when describing similar images (e.g. when talking about an image with cats, mentioning their breed).

Previous work has explored incorporating large external memories into neural network layers (Weston et al., 2014; Sukhbaatar et al., 2015, 2019; Lample et al., 2019). Many existing approaches focus on using attention over the memory slots, which is computationally intensive and becomes less effective as the the size of the memory grows. In this work, we propose representing multiple sources of external information as fixed encodings and using K Nearest Neighbors search to fetch relevant information. KNN search is computationally efficient and scalable, and libraries like faiss (Johnson et al., 2019) allow KNN to be easily used on GPUs and integrated into neural networks. As the external memories are kept fixed, they do not require any training to learn the memories along with the model. We can thus scale more easily to larger memories by learning only the KNN-based read operation to identify relevant information from the memory.

Our core contribution proposes an efficient, KNN-based Information Fetching (KIF) module that can access relevant external knowledge, combine knowledge from different sources, and integrate this information into standard sequence to sequence architectures. We apply these flexible modules to two dialogue datasets, challenging tasks where generative models can leverage external information to write coherent, on-topic responses. We show that relevant information can be identified from hundreds of thousands of candidates in a multi-modal, multi-knowledge-source setting to improve the performance of generative dialogue models. Further, the output of the KIF modules is interpretable as specific knowledge is selected, allowing users to better understand the information the generative model conditions upon when writing the subsequent utterance. On both datasets, we achieve state of the art results compared to generative models and find there is no statistically significant difference in the interestingness or human preference of our model output compared to state of the art retrieval models.

2 Related Work

We discuss related work on learning to incorporate external knowledge into neural networks and efficiently accessing relevant information. We then describe work in generative dialogue that incorporates knowledge.

Figure 1: KIF modules (orange) fetch relevant information from multi-modal external knowledge sources and incorporate it in standard neural architectures.

2.1 Incorporating External Knowledge

Augmenting neural networks with memory, or longer term components that can be accessed with read and write operations, has been explored in various proposed architectures. For example, Memory Networks (Weston et al., 2014; Sukhbaatar et al., 2015, 2019) introduce attention mechanisms over large external memories. Neural cache models (Grave et al., 2016) simplify these to access previous memories with a dot product. Previous work has also studied how to read and write into these memory architectures (Rae et al., 2016; Graves et al., 2014; Joulin and Mikolov, 2015). In contrast, we focus on how to read very large memories.

Another line of research has focused on computational scalability for larger external memories to allow efficient access of information. For example, Chandar et al. (2016) propose a hierarchical memory network rather than a flat one and Rae et al. (2016) learn sparse operations to read and write. Lample et al. (2019) focus on learning memories of up to one million slots and how to efficiently access the slots using product keys. Khandelwal et al. (2019) use nearest neighbor operations to augment language models. Beyond explicit memory representations, it may be possible to store information implicitly during training time by memorizing common patterns present in text (Petroni et al., 2019)

. We focus on learning to fetch relevant information from multiple explicit external multi-modal knowledge sources and integrate them into one network. Further, our work allows the retrieved information to be interpreted as each memory slot is an explicit fact that can be read as text, rather than a learned vector such as in

Lample et al. (2019).

Work has also focused on computationally efficient softmax operations (Mnih and Hinton, 2009; Grave et al., 2017; Chen et al., 2015). Many approximate softmax techniques use KNN-like operations to form clusters, and the overall softmax operation is constrained by the slow calculation of the exponential. Our usage of KNN benefits from efficient and scalable libraries such as faiss and nmslib.

2.2 Generative Dialogue

We develop a general architecture for incorporating external information and apply it to the case of generative dialogue models. Previous work in dialogue has leveraged knowledge as necessary information to accomplish the task. For example, airline and restaurant booking tasks often use API calls to access information about reservation times and availability (Bordes et al., 2016). In contrast, our work focuses on how to incorporate unstructured knowledge, such as free text found on the web. Previous work has employed architectures that attend over the available knowledge and identify relevant pieces of information, which scales poorly with large quantities of information (Dinan et al., 2018; Qin et al., 2019; Lian et al., 2019). In this work, we replace the use of attention over external information with the output of a KNN module. Other work has investigated incorporating information retrieval in language modeling and question answering Chen et al. (2017); Fan et al. (2019); Seo et al. (2019); Guu et al. (2020), while we focus on dialogue applications and flexibly incorporating knowledge from multiple, multi-modal sources.

On the modeling side, work has explored both generative (Serban et al., 2016a, b) and retrieval based models (Zhang et al., 2018), which identify the best utterance from the training set to return as the dialogue response. This often leverages self-attention or cross-attention mechanisms (Humeau et al., 2019). Further work has explored hybrid models, for example using the output of a retrieval model as input for a generative model (Dinan et al., 2018; Weston et al., 2018). We extend these approaches by augmenting generative models with retrieval-like operations based on KNN search, allowing dialogue models to flexibly incorporate various sources of external knowledge.

3 KNN-based Information Fetching Modules

Broadly, the KNN-based Information Fetching (KIF) module assumes a model can access inputs to produce outputs . In a setting without additional supporting information, the model will process inputs to make output predictions: . However, in many tasks, additional information is present, represented as . To incorporate into , we encode each element of and into a fixed-size vector representation. This can be accomplished in a variety of ways, for example with an encoder neural network.

Then, to make predictions, the model encodes and uses K Nearest Neighbors to find the closest related information in . The representations of the identified nearest neighbors are combined in a weighted sum, where each of the retrieved neighbors is weighted by its similarity to .

These operations are differentiable, so they can be incorporated into neural networks in a straightforward way. All elements of the knowledge source

are pre-computed and kept fixed — we do not backpropagate to affect the embeddings of the pre-encoded knowledge. However, this lack of backpropagation can introduce a mismatch between the encoding of

and the model that is training, as the training model has constantly changing representations because the weights are being learned. The model must learn a function to align its representations to the external memory. To circumvent this misalignment, we instead learn a mapping operator that trains to map elements of the model’s representation of into the information representation space . Concretely,

is a multi-layer perceptron with ReLU nonlinearities. From the input elements of

, learns a representation of an output close to the corresponding projection of into . This can be interpreted as learning a read operation on a fixed external memory. If there was no change to the encoding of the model compared to the pre-computed knowledge, then the ideal mapping operator would be the identity function. However, as the model changes significantly during the training process, the nonlinear mapping capability of is essential to be able to identify the correct knowledge from the input .

Thus, a model augmented with KIF will incorporate external knowledge in the following manner. First, we find the nearest elements to the projection of in based on KNN search using inner product. Then, the relevant elements are encoded by . We use the optimized faiss library for KNN search, which can conduct billion-scale KNN efficiently.


These elements are weighted by their nearest neighbor scores and then summed. This is subsequently concatenated to the representation of and used by to form the final prediction:


This can be easily extended to using multiple modules simultaneously. For instance, two sources of information, and , can be combined by identifying the top candidates of each information source. The weighted sum of the KIF output on each information source is concatenated with .

Finally, different sources of information may not be required for every prediction and some information sources can be more important than others. To allow the model to make more fine-grained decisions about what information to use from what source, and how much of it, we add a gating mechanism using a sigmoid function around each weighted sum of KNN representations.

and denote the KIF module from Equation 1 applied to and respectively.


4 Applying KIF to Dialogue Tasks

We describe how to apply our method to the task of generative dialogue, a challenging setting where models autoregressively generate engaging and on-topic responses. We investigate dialogue for two main reasons: first, dialogue agents must be able to consult relevant information to maintain the topic of the conversation. Second, retrieval-based agents have strong performance compared to generative ones, due to their ability to copy dialogue utterances from the training set. Using KIF, we can incorporate the benefits of retrieval architectures into generative, knowledge-based models.

KIF for Generative Dialogue

In a dialogue setting, represents the text of the conversation . A conversation consists of multiple back-and-forth utterances (or turns). For example, a conversation could consist of 4 turns: where is the direct utterance the model should respond to, and the earlier utterances are the conversation context.

Standard generative dialog models use a Transformer neural network as and want to produce an output that is an appropriate response to the conversation. However, in many cases, the conversation history alone does not include all of the information required to produce an appropriate response. To incorporate knowledge, models often concatenate a knowledge source such as Wikipedia to , such that , and use attention modules to identify the most relevant knowledge. However, this approach is computationally intensive when handling large quantities of information. Further, attention mechanisms have been found to operate poorly over long sequences, as the mechanism is blurry and struggles to make fine-grained decisions (Fan et al., 2018). The same is true for hierarchical approaches, which lack scalability.

We augment Transformer sequence-to-sequence (seq2seq) networks with KIF to create generative dialogue models. We experiment on two dialogue tasks, Wizard of Wikipedia (Dinan et al., 2018) and Engaging Imagechat (Shuster et al., 2018). In both datasets, models must leverage information external to the dialogue history alone — in Wizard of Wikipedia, the chat requires access to a knowledgeable facts and in Engaging Imagechat, discussion about a specific image. As models must process multiple inputs and ground responses in the knowledgeable facts or images, these tasks challenge existing seq2seq approaches.

Wizard of Wikipedia

The goal of the Wizard of Wikipedia dataset is to train knowledgeable agents that can chat in any domain. The dataset contains 1,365 various topics discussed in 18,430 dialogues in the training set, totalling 166,787 training utterances. The topic is included as the first utterance of the conversation. The Wikipedia knowledge is Wikipedia sentences for each dialogue turn, identified by an information retrieval system and released as part of the full dataset.

Our model for Wizard of Wikipedia has access to two sources of external information, and :

  • is Wikipedia Knowledge provided by the dataset as evidence to support knowledgeable chitchat. The scale of this KNN search is to filter through an average of 34 sentences. The KIF module uses dialogue features to fetch relevant knowledge to condition upon to generate the subsequent utterance.

  • is Training Utterances. To incorporate the benefits of retrieval-based dialogue models to the generative setting, we use KIF to identify relevant utterances from the training set and take their responses as input. If many conversations about dogs have already occurred, models should be able to take advantage of these human-written examples to improve their generations. For example, likely conversation could occur about the breed of the dog, daily routine with a pet, and similar topics. There are around K dialogue utterances as inputs to KNN search. This can be interpreted as incorporating the benefits of retrieval models by identifying an utterance with similar structure as the text the model would like to generate. We do not allow the module to fetch the correct response of the current conversation context.

Access to these two sources of knowledge can be seen as learning a template and a topic separately. Sample templates can be identified from the training utterances, and topic-specific information learned by accessing the Wikipedia knowledge.

To better identify relevant training utterances from the large quantity available, we break down into conversation sub-features for a more fine-grained match in the KNN search step. We concatenate the encoding of the most recent dialogue utterance (e.g. ) with the encoding of the dialogue context from the current conversation and the turn number , such that is the representation used for KNN search. The turn number is represented as an embedding. Concretely, if the model is trying to produce the 5th turn of the conversation, then is the most recent utterance from the dialogue partner, would be the last 3 turns of exchange, and would be 4.

These are known to be salient conversation features. The most recent dialogue utterance is the direct turn the model is responding to, and the dialogue context may provide additional clues. The turn number is important, as earlier turns are often generic (e.g. how are you doing today) and later turns are more specific.

Engaging ImageChat

The goal of Engaging ImageChat is to create agents capable of chitchatting about images, selected from the YFFC100M dataset (Thomee et al., 2015). The dataset contains 186,782 dialogues in the training set, each about a unique image, totalling 355,862 utterances. Agents are assigned one of 215 personalities (e.g. sweet, caring, excited) to increase engagingness. We use a Multi-Modal neural network designed to handle both image input and text input. Following Shuster et al. (2018), the images are encoded using a pre-trained ResNeXt network (Xie et al., 2017). To extract the final image representation, we project the -dimensional output of the image encoder to -dimensions using a deep multi-layer perceptron with ReLU activation units. The conversation history, which includes the personality, is encoded with a Transformer encoder network. The image and conversation are integrated using the Multimodal-Sum-Combiner module proposed in Shuster et al. (2018).

Our model for Engaging Imagechat has access to two sources of external information, and :

  • is Chat on Similar Images. While there are over K different images used in this dataset, many of the images are similar. For example, conversations associated with two pictures of dogs could be relevant to each other. The model is able to use the current image features to fetch from around K different images and returns 6 turns of related chat for each image. Fetching from consists of identifying related image chats, or conversations on related topics (as similar images likely have similar conversations).

  • is Training Utterances. Similar to the motivation for the previous dataset, we allow the model to identify training utterances that could be useful for responding in the current conversation. The scale of this fetching task is large: K dialogue utterances. This could be interpreted as identifying utterances with similar structure to what the model would like to generate, and is complementary to the topic-based related image chats.

To identify relevant information from training utterances, we use the same dialogue features in the KNN search step, with one modification: we add the personality provided by the dataset. The concatenation of features used for KNN search is: where is the turn number and is the personality. As utterances from speakers with the same personality are more likely to be similar, this feature improves the quality of the fetched information. For example, conversations with the sweet personality often include similar text such as aww, that’s wonderful.

5 Experimental Setup

5.1 Implementation Details

We use (Miller et al., 2017) to implement our models. The data for both datasets used is available for download from as well. We use byte-pair encoding (Sennrich et al., 2015) to represent the text to better handle the rare word problem (Dinan et al., 2018; Fan et al., 2017). Our generative Transformer models have 8 encoder layers and 8 decoder layers, with FFN size 2048, embedding dimension 512, and 4 attention heads. We optimize using Adam (Kingma and Ba, 2014) and the inverse square root learning schedule (Vaswani et al., 2017)

with 10k warmup updates. The initial learning rate is 0.0001 and we optimize for model perplexity. We use a dropout of 0.5 and set gradient clipping to 0.1. We set k =

for all cases. We pre-train the Transformer seq2seq model used for both datasets on 250M comments from Reddit. The comments are parsed to maintain conversational threads, so the encoder network has been exposed to conversational context at training time. The ResNeXt encoder is pretrained on 3.5 billion images (Mahajan et al., 2018)

. For both datasets, we model a vocabulary size of 54944 based on the BPE-based vocabulary from the Reddit pretraining. We tuned the learning rate and batchsize hyperparameters together. The model size is not tuned, as it was pre-trained with this size and thus kept fixed.

Model Test F1 Test F1
(Seen) (Unseen)
Retrieval Trans. MemNet* 15.4 12.4
2-Stage Generative MemNet* 18.9 17.4
Generative Trans. MemNet* 16.9 14.4
          + Reddit Pre-Train 17.6 16.3
Retrieve and Refine 18.2 17.9
Response Generation with MR 17.5 16.8
KIF-Augmented Transformer 25.9 22.3
Table 1: Results on the Wizard of Wikipedia dataset. * denotes results from Dinan et al. (2018). Retrieve and Refine is from Weston et al. (2018) and Response Generation is from Qin et al. (2019).
Model Test F1
Retrieval Trans.* 9.8111In Shuster et al. (2018)

, retrieval Transformer models report Hits@N using a fixed candidate set of 99 distractor candidates and 1 true candidate. We compute F1 using their open-sourced model by scoring the entire training set of over

K utterances with the model and taking the top scoring candidate as the response.
Generative Trans. MemNet 7.1
          + Reddit Pre-Train 12.8
Retrieve and Refine 13.6
Response Generation with MR 13.2
KIF-Augmented Transformer 14.4
Table 2: Results on the Engaging Imagechat dataset. * denotes results from Shuster et al. (2018). Retrieve and Refine model from Weston et al. (2018) and Response Generation is from Qin et al. (2019).

5.2 Evaluation


We generate with beam search, setting the beam size to

. We use 3-gram blocking. This technique disallows repeated n-grams from being generated multiple times and reduces repetition.

Figure 2: Human Evaluation Results on both Datasets. More than 50% indicates the KNN Model is preferred. Stars indicate statistical significance at .
Figure 3: Human Evaluation on the Unseen Test set of Wizard of Wikipedia. More than 50% indicates the KNN Model is preferred. Stars indicate statistical significance at
Human: I love watching disney movies! Do you like Disney?
Model: I love disney, I love watching disney movies and different animations and characters that make them unique
Fetched Knowledge: disney consumer products and interactive media - the company specializes in producing animated films
Fetched Training Utterance: i love music, I love going to music shows and dancing and having fun
Human: What’s your favorite disney movie? There are so many good ones.
Model: I love the incredibles, they are my favorite disney movie.
Fetched Knowledge: disney announced intentions to develop additional superhero films after the success of the incredibles
Fetched Training Utterance: i love kiteboarding, it is one of my favorite activities on the water.
Human: omg, me too! the second one just came out i think. I don’t watch a lot of sequels though, I think the last one I saw was toy story.
Model: That’s cool! I remember when toy story was first released in 1995.
Fetched Knowledge: talk of a second film following toy story began around a month after the film’s opening, in december 1995
Fetched Training Utterance: that’s neat! i like so many of her songs! i remember when her first album was released in 2008.
Human: I’ve actually never seen toy story before. Kind of embarrassing…
Model: Ohhh … I think toy story is a classic? It was first released in 1999.
Fetched Knowledge: toy story 2 is a 1999 american computer-animated comedy film produced by pixar animation studios for walt disney pictures
Fetched Training Utterance: Ohhh … What do people with color blindness do to cope with the effects?
Figure 4: Conversation between Human and KIF-Augmented Transformer on Wizard of Wikipedia. The top-scoring Wikipedia knowledge and training utterances fetched by KIF are displayed with model output.

Automatic Metrics

Following Dinan et al. (2018), we compute F1, a metric of unigram overlap, between the generated utterance and the human-written reference utterance from the dataset. For generative models, utterances are generated using beam search. For retrieval models, the next utterance is predicted by ranking the entire set of training utterances, and the highest scoring utterance is chosen.

In Wizard of Wikipedia, there are two test sets: a set of seen topics, or topics that have been seen at training time with new test-time dialogues. The second set is unseen, or topics that have not been encountered at all during training time. We evaluate on both subsets.

Human Evaluation

We follow the setup and use the analysis questions proposed in the Acute-Eval dialogue evaluation system (Li et al., 2019). For reproducibility, we adopt this existing evaluation setting that has been applied to several dialogue datasets. We collect 100 human-bot conversational dialogues on a crowdsourcing platform for both datasets. The dialogues are eight turns long. Then, we show pairs of the collected conversations side by side, one conversation with a human and model A and the other conversation with a human and model B. We ask annotators the following questions:

  • Who would you prefer to talk to for a long conversation?

  • If you had to say one of the speakers is interesting and one is boring, who would you say is more interesting?

  • Which speaker sounds more human?

  • Which speaker has more coherent responses in the conversation?

  • If you had to say that one speaker is more knowledgeable and one is more ignorant, who is more knowledgeable? (Wizard of Wikipedia only)

We measure the percentage of time one model was chosen over the other, taking the majority agreement between three evaluators. To reduce variance, dialogues paired in the evaluation were collected on the same topic for Wizard of Wikipedia and collected on the same image and personalities for Engaging ImageChat. Topic and images used are unique and taken randomly from the test set.

5.3 Baselines

We compare Transformers augmented with KIF to the state of the art retrieval models published on each dataset. We note that these existing retrieval models are state of the art on both datasets, and have been shown to be strong baselines compared to other retrieval techniques based on TF-IDF Chen et al. (2017).

We further compare to three additional generative baselines that access knowledge:

  • Transformer Memory Networks. To contrast the ability of KIF to existing work, we compare our models to published Transformer Memory Networks (Dinan et al., 2018). These models encode each piece of external information independently with a Transformer Encoder, and these are stored as memory slots. To access information in the memory slots, a model performs dot-product attention between the memory slots and the dialogue context. In Dinan et al. (2018), the knowledge selection from Wikipedia was supervised with either a two-stage model where the first model was trained to predict the right knowledge, or an end-to-end model with an auxiliary loss for knowledge prediction accuracy.

  • Retrieve and Refine. We implement a hybrid model (Weston et al., 2018) that incorporates top retrieval candidates as additional input to Generative Transformer MemNets. In Retrieve and Refine, a fixed number of candidates are retrieved and concatenated to the conversational history in the encoder. Unlike the KIF-Augmented Transformer, the retrieval is conducted with a separate model so there is no backpropagation to affect the retrieval. With KIF, models can alter the retrieved candidates by learning the mapping operator. Further, a fixed amount of information is always retrieved, without the capability to easily rescale to focus on specific candidates.

  • Response Generation with MR. We implement the model proposed in Qin et al. (2019), which embeds the conversation history and document contextually before decoding with a biLSTM. In Qin et al. (2019), the encodings were using pretrained CoVE vectors McCann et al. (2017). We found our pretrained Transformer embeddings to work more effectively as they are trained specifically for dialogue. Thus, we modify this baseline to replace CoVE embeddings with domain-specific ones.

All of Transformer generative baselines are initialized with the same pre-training on Reddit that we use for our models for fair comparison on modeling quality.

6 Results

We describe the results of incorporating KIF modules into Transformer networks. We display an example conversation between a human and our model in Figure 

4, and show the top scoring Wikipedia knowledge and Training Utterance fetched by KIF modules. We compare to various baselines using automatic and human evaluation, and discuss our experiments. We present various ablation settings to understand the key features that make our method function.

Human: Hey, how are you doing
Fetched Training Utterances: I’m great, thanks for asking. Craving some chocolate. Do you like chocolate?
Hello, how is it going? I know some trivia about this movie
Hello, it’s lunch time here, and I’m in the mood for a great steak
Human: What are your hobbies?
Fetched Training Utterances: I work at an elementary school. I hope you find a job you love too […]
I have a hound, we just got her. Although, I grew up with Labrador Retrievers.
I just love ice cream. I love the types with fruits and flavours. Do you like ice cream?
Human: hi buddy, what do you think about cinematography?
Gold Chosen Knowledge: cinematographers use a lens to focus reflected light from objects into a real image […]
Fetched Knowledge: cinematography is the art of motion-picture photography
typically, a lens is used to repeatedly focus the light reflected from objects […]
the modern photographic camera evolved from the camera obscura
Human: Speaking of blue skies, have you seen the 1946 movie staring bing crosby?
Gold Chosen Knowledge: blue skies is a 1946 american musical comedy film […] and starring bing crosby […]
Fetched Knowledge: blue skies is a 1946 american musical comedy film […] and starring bing crosby […]
blue skies the band has since broken up
blue skies was was composed in 1926 as a last - minute addition to betsy the musical
Figure 5: Examples of Top-3 Fetched Training Utterances and Fetched Knowledge when responding to a human chat from the dataset using a trained Wizard of Wikipedia model. Examples are taken from validation.
Figure 6: Ablations on Wizard of Wikipedia. (a) KIF can scale to hundreds of relevant sentences (blue) while the baseline model, the Generative Transformer MemNet, scales poorly (gray) (b) Gating can remove irrelevant information. In the 3 Sources case, one source of external information is unrelated. (c) Performance as varies.

6.1 KIF is Effective for Incorporating Knowledge

Automatic Evaluation.

Comparing KIF augmented Transformer networks to published baselines and Retrieve and Refine, we find improved results.

For Wizard of Wikipedia, the improvement in F1 score over the best baseline is around 8 points (see Table 1). A major contributing factor is the construction of the dataset — as each dialogue turn is grounded in a specific knowledge sentence from Wikipedia, improving the ability to identify the relevant fact strongly improves performance. Contrasting the results from the seen and unseen test sets in Table 1, the improvement on unseen is worse — it is harder to fetch training utterances for unseen topics.

While Imagechat has no explicit dependency on knowledge, we still see a 2 point improvement compared to the Generative Transformer MemNet (with the additional Reddit pre-training), indicating that KIF can be generally useful (see Table 2). Compared to an even stronger baseline that we tune in this work, Retrieve and Refine, we see 1 point improvement.

Human Evaluation. Results are shown in Figure 2. On both datasets, we find there is large improvement over existing generative models (green bars) that is statistically significant for some of the evaluation questions. Evaluators agree that KIF-augmented Transformers are generally more coherent and human-sounding compared to the Generative MemNet.

Compared to existing retrieval models (blue) is more nuanced. Along the lines of existing work (Zhang et al., 2018; Dinan et al., 2018), we find that retrieval-based models score very well in human evaluations that ask how human or interesting a dialogue sounds. This is because retrieval models return human-written utterances from the training set and do not suffer from decoding mistakes present in generative models. For example, on Engaging ImageChat, while our model has significantly improved over the generative baseline (see green bars in Figure 2, right), it does not beat retrieval based methods in sounding more human or being more interesting (see blue bars in Figure 2, right).

A surprising result is that KIF-augmented Transformers are voted more human sounding than retrieval models on Wizard of Wikipedia. This is because the dataset’s human utterances are long and factual due to the tendency of crowdworkers to copy Wikipedia. Sometimes humans chatting with the retrieval bot would respond uh… that’s an interesting fact? Otherwise, our model scores similarly to retrieval models, with most of the evaluations not having statistically significant differences.

We conduct a second evaluation on the Unseen Test Set of the Wizard of Wikipedia dataset. Results are shown in Figure 3. Trends are similar compared to the results on the Seen Test set, though the preference for the KIF-augmented Transformer is greater over the retrieval baseline. We hypothesize that because the Unseen Test Set is on entirely held out topics, the retrieval baseline can struggle to identify relevant utterances. In contrast, the KIF-augmented Transformer, similar to the generative baseline from Dinan et al. (2018), can use the generative capability to produce utterances.

Lastly, we conduct an additional studies to examine the variance of the comparative dialogue judgements. The evaluation study for Wizard of Wikipedia is repeated three times on different days, and evaluators who have answered on previous days are not allowed to evaluate again in any subsequent experiments. We find there is greater variance on questions asking which dialogue is more human and more interesting, most likely as different evaluators can interpret these in different ways. Further, we see that comparison with the Retrieval model has less variance compared to the Generative model, possibly because the Retrieval model’s human written text is devoid of mistakes. Overall, we find that the conclusions (and statistical significance) are stable across multiple evaluations.

Knowledge Training Utterance Generation
buzz lightyear’s name is in honor of astronaut edwin ‘buzz’ aldrin my favorite character in that book series is hermione granger cool! my favorite character in that movie is buzz lightyear
mr potato head is based on the real-life mr. potato head toy my favorite character in that book series is hermione granger my favorite character in that movie is real-life mr potato head
slinky dog is a toy dachschund with a metal slinky for a body my favorite character in that book series is hermione granger cool! my favorite character is the slinky dog
slinky dog is a toy dachschund with a metal slinky for a body i really like the character hermione granger cool! i really like slinky dog
slinky dog is a toy dachschund with a metal slinky for a body my favorite character of all time has to be hermione granger i love that movie, my favorite character has to be slinky dog the dachshund
slinky dog is a toy dachschund with a metal slinky for a body i agree with you! that’s my favorite character as well i think so too! my favorite is slinky
Table 3: Effect of Fetched Information on Generated Utterances. The top

section provides examples for a fixed training utterance, changing the knowledge — the generated text maintains the construction of the training utterance but changes the favorite character to match the knowledge. The

bottom section provides examples for fixed knowledge but changing the training utterance — the generated text modifies its form to match the training utterance, but the favorite character information remains consistent.

6.2 Scaling KIF to Challenging Retrieval Settings

KIF modules can be used in more realistic and challenging settings for knowledge retrieval that test the scalability of the module. In Figure 6(a), we compare the Generative Transformer MemNet Baseline with KIF-Augmented Transformers in three settings. The first is the standard Wikipedia sentences provided by the dataset (average 34 sentences). Then, we extend to providing the full Wikipedia article (average 57 sentences) and finally to providing multiple Wikipedia articles (average 205 sentences), identified using the conversation’s topic. This increasing size of available knowledge could be realistic for settings where it is unclear what information is most relevant, if filtering steps to preprocess the data remove potentially relevant information, or if information synthesis from multiple knowledge sources is necessary to produce a high quality generation. As the Wikipedia knowledge becomes more difficult to identify, performance decreases, but still outperforms the baseline that uses the dataset-provided set of 34 sentences.

Comparing the scaling capability of KIF to the standard Generative Transformer MemNet Baseline highlights the advantage of using KNN. The attention-based mechanism used in Dinan et al. (2018) struggles to identify salient information when given increasingly larger quantities of knowledge, unlike the KNN information fetch. We hypothesize the attention mechanism is challenged by softmax-ing over a larger quantity of inputs, as it can be difficult to make sharp distinctions.

6.3 Analysis of Fetched Knowledge

Example conversations from our KIF-augmented generative model are shown in Figure 4 on Wizard of Wikipedia. We find that relevant knowledge is identified that affects the content of the generated utterance. For example, the model finds knowledge sentences about Disney movies as the human conversationalist starts the conversation discussing Disney. The model leverages the fetched knowledge to write the content of the generated utterance. In a concrete example, the fetched sentence disney announced intentions […] after the success of the incredibles leads the model to generate the utterance i love the incredibles, they are my favorite disney movie.

In contrast, the model uses the form of the fetched training utterance often as a template for writing a response. For example, the model copies the training utterance Ohhh … what do people with color blindness do to cope with the effects? and starts the model generation with Ohhh … and continues with the question i think toy story is a classic? following the form of the selected training utterance.

Figure 5 displays the top-3 fetched training set utterances and knowledge sentences on the Wizard of Wikipedia dataset when responding to a human utterance. KIF modules can identify multiple relevant items. In response to the human question about blue skies the 1946 movie the model identifies both the comedy film and the band.

Finally, the elements retrieved by KIF modules provide a more interpretable understanding of what the model is conditioning upon to generate a dialogue response. In Table 3, we display for the same dialogue history, changing the model’s fetched training utterance and knowledge sentence for our own examples. The model heavily incorporates our manual changes of the fetched information into the generated utterance. For example, changing the knowledge directly affects what the model generates as the favorite character — from buzz lightyear to mr potato head to slinky dog — while changing the fetched training utterance changes the form of the generated sentence.

6.4 Ablations

Importance of Multiple Knowledge Sources.

One benefit of KIF modules is that multiple can be used together to fetch information from different sources. In this ablation, we examine the importance of this functionality. For Wizard of Wikipedia and Engaging ImageChat, multiple knowledge sources are used — training utterances to capture the capability of a retrieval-based model and knowledge from Wikipedia or related chats based on image features. The performance decreases when only using one source (see Table 4).

For Engaging Imagechat, this study also underlines the importance of being able to fetch in a multi-modal fashion. The general form of the KIF module — requiring only a feature vector to find nearest neighbors from — allows fetching on multiple modalities such as text and images. In Table 4, using the Image-based KIF to fetch text from Related Images is important to reach the strongest performance (compare Training Utterances Only that uses text-based KIF and using both Training Utterances and Related Images).

Model Test F1
Wizard of Wikipedia
Training Utterances Only 18.1
Wiki Knowledge Only 23.9
Training Utterances and Wiki Knowledge 25.9
Engaging ImageChat
Training Utterances Only 13.9
Related Images Only 13.8
Training Utterances and Related Images 14.4
Table 4: Using Multiple KIF Modules on Multiple Sources is important for improved performance.
Model Valid F1
Wizard of Wikipedia
Previous Utterance Only 24.6
+ Dialogue Context 26.4
+ Turn Embedding 27.4
Engaging ImageChat
Previous Utterance Only 13.3
+ Dialogue Context 14.5
+ Turn Embedding + Personality 15.1
Table 5: Important Features for KNN Search using KIF. Salient conversation features improve performance on both datasets.
Model Valid F1
KIF-Augmented Transformer 27.4
One KIF Module fetches multiple times
2 Fetches 26.9
3 Fetches 26.0
Multiple KIF Modules fetch once each
2 Fetches 26.5
3 Fetches 25.9
Table 6: Multi-hop with KIF to retrieve information with multiple fetch steps

Multi-Hop Retrieval with KIF.

Work in memory networks Weston et al. (2014); Sukhbaatar et al. (2015) employed multi-hop mechanisms. Such capacity could be useful in cases where multiple sources are necessary or perhaps more information is incrementally required. To emulate multi-hop memory mechanisms, we use KIF to retrieve relevant information for or fixed hops. As the number of hops is fixed, the multi-hop operation remains differentiable. We do not allow the model to retrieve information in a second hop if that information was already selected.

We experimented in two settings. In the first, the same KIF module is used multiple times to fetch different information, and then all of the fetched knowledge is concatenated. Results are shown in Table 6 (top). in the second setting, we examine spreading out the fetches into different KIF modules at various encoder network depths. This could be interpreted as the model learning to access more required information layer by layer. It is possible that as the model progresses deeper, the more abstract and high level representations that are built allow different knowledge to be retrieved. As the encoder models we experiment with have six layers, we distribute the KIF fetches evenly throughout. Results are shown in Table 6 (bottom).

In both multi-hop settings, no improvement in performance on the Wizard of Wikipedia dataset is observed. We hypothesize this can be partially attributed to the construction of the dataset — as humans explicitly based their written dialogue utterance on one knowledge sentence. Further, it is possible concatentation brings together too much information for the model to incorporate, and thus adding additional fetches makes the retrieval more noisy.

Using Dialogue Features for KNN Performance.

The quality of the KNN search is critical to the performance of KIF modules. As the external knowledge is kept fixed, KIF must be able to align the dialogue context with the knowledge to identify relevant pieces of information. In Table 5, we show that matching on more features can improve the quality of the retrieved information. Using only the encoding of the immediate previous utterance can improve results on Wizard of Wikipedia by 7 F1 points, but this is further improved by also leveraging the encoding of context (+1.8 F1) and using the dialogue turn number (+1 F1). These features are available in the datasets, and we leverage them to improve the relatedness of retrieved knowledge.

Effect of Gating.

We analyze the effect of the gating mechanism used in KIF by evaluating the capability of the gate to identify and focus on salient information. On Wizard of Wikipedia, we concatenate a third source of information: dialogue turns from a completely different corpus called PersonaChat (Zhang et al., 2018). This dataset looks quite different — short utterances without factual knowledge — and should be easy for the model to identify as distinct from Wizard of Wikipedia. As shown in Figure 6(b), if KIF on PersonaChat is included without gating, it has a harmful effect as the model includes irrelevant information. When equipped with gating, the model learns to use the gate to ignore some inputs, and can recover almost the full performance of the model without this irrelevant information source.

Size of K in KNN.

Figure 6(c) shows the performance on Wizard of Wikipedia when varying the amount of knowledge. Generally, being able to access multiple relevant pieces of information is helpful, but too much information can be harmful. This is likely because the weighted sum operation becomes blurry if too many sentences are summed.

7 Conclusion

We present a KNN-based Information Fetching module that learns to identify relevant information from external knowledge sources by learning a mapping-based read operation. KIF modules benefit from the scalability and efficiency of K Nearest Neighbors search, enabling computation with large external memories. We show in the context of two dialogue datasets that relevant knowledge can be identified and incorporated to create more engaging, high quality dialogue.


  • A. Bordes, Y. Boureau, and J. Weston (2016) Learning end-to-end goal-oriented dialog. arXiv preprint arXiv:1605.07683. Cited by: §2.2.
  • S. Chandar, S. Ahn, H. Larochelle, P. Vincent, G. Tesauro, and Y. Bengio (2016) Hierarchical memory networks. arXiv preprint arXiv:1605.07427. Cited by: §2.1.
  • D. Chen, A. Fisch, J. Weston, and A. Bordes (2017) Reading wikipedia to answer open-domain questions. arXiv preprint arXiv:1704.00051. Cited by: §2.2, §5.3.
  • W. Chen, D. Grangier, and M. Auli (2015) Strategies for training large vocabulary neural language models. arXiv preprint arXiv:1512.04906. Cited by: §2.1.
  • E. Dinan, S. Roller, K. Shuster, A. Fan, M. Auli, and J. Weston (2018)

    Wizard of wikipedia: knowledge-powered conversational agents

    ICLR. Cited by: §2.2, §2.2, §4, 1st item, §5.1, §5.2, Table 1, §6.1, §6.1, §6.2.
  • A. Fan, C. Gardent, C. Braud, and A. Bordes (2019)

    Using local knowledge graph construction to scale seq2seq models to multi-document inputs

    arXiv preprint arXiv:1910.08435. Cited by: §2.2.
  • A. Fan, D. Grangier, and M. Auli (2017) Controllable abstractive summarization. arXiv preprint arXiv:1711.05217. Cited by: §5.1.
  • A. Fan, M. Lewis, and Y. Dauphin (2018) Hierarchical neural story generation. arXiv preprint arXiv:1805.04833. Cited by: §4.
  • E. Grave, A. Joulin, M. Cissé, H. Jégou, et al. (2017) Efficient softmax approximation for gpus. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1302–1310. Cited by: §2.1.
  • E. Grave, A. Joulin, and N. Usunier (2016) Improving neural language models with a continuous cache. arXiv preprint arXiv:1612.04426. Cited by: §2.1.
  • A. Graves, G. Wayne, and I. Danihelka (2014) Neural turing machines. arXiv preprint arXiv:1410.5401. Cited by: §2.1.
  • K. Guu, K. Lee, Z. Tung, P. Pasupat, and M. Chang (2020) Realm: retrieval-augmented language model pre-training. arXiv preprint arXiv:2002.08909. Cited by: §2.2.
  • S. Humeau, K. Shuster, M. Lachaux, and J. Weston (2019) Real-time inference in multi-sentence tasks with deep pretrained transformers. arXiv preprint arXiv:1905.01969. Cited by: §2.2.
  • J. Johnson, M. Douze, and H. Jégou (2019) Billion-scale similarity search with gpus. IEEE Transactions on Big Data. Cited by: §1.
  • A. Joulin and T. Mikolov (2015) Inferring algorithmic patterns with stack-augmented recurrent nets. In Advances in neural information processing systems, pp. 190–198. Cited by: §2.1.
  • U. Khandelwal, O. Levy, D. Jurafsky, L. Zettlemoyer, and M. Lewis (2019) Generalization through memorization: nearest neighbor language models. arXiv preprint arXiv:1911.00172. Cited by: §2.1.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §5.1.
  • G. Lample, A. Sablayrolles, M. Ranzato, L. Denoyer, and H. Jégou (2019) Large memory layers with product keys. arXiv preprint arXiv:1907.05242. Cited by: §1, §2.1.
  • M. Li, J. Weston, and S. Roller (2019) ACUTE-eval: improved dialogue evaluation with optimized questions and multi-turn comparisons. arXiv preprint arXiv:1909.03087. Cited by: §5.2.
  • R. Lian, M. Xie, F. Wang, J. Peng, and H. Wu (2019) Learning to select knowledge for response generation in dialog systems. arXiv preprint arXiv:1902.04911. Cited by: §2.2.
  • D. Mahajan, R. Girshick, V. Ramanathan, K. He, M. Paluri, Y. Li, A. Bharambe, and L. van der Maaten (2018) Exploring the limits of weakly supervised pretraining. In

    Proceedings of the European Conference on Computer Vision (ECCV)

    pp. 181–196. Cited by: §5.1.
  • B. McCann, J. Bradbury, C. Xiong, and R. Socher (2017) Learned in translation: contextualized word vectors. In Advances in Neural Information Processing Systems, pp. 6294–6305. Cited by: 3rd item.
  • A. H. Miller, W. Feng, A. Fisch, J. Lu, D. Batra, A. Bordes, D. Parikh, and J. Weston (2017) ParlAI: a dialog research software platform. arXiv preprint arXiv:1705.06476. Cited by: §5.1.
  • A. Mnih and G. E. Hinton (2009) A scalable hierarchical distributed language model. In Advances in neural information processing systems, pp. 1081–1088. Cited by: §2.1.
  • F. Petroni, T. Rocktäschel, P. Lewis, A. Bakhtin, Y. Wu, A. H. Miller, and S. Riedel (2019) Language models as knowledge bases?. arXiv preprint arXiv:1909.01066. Cited by: §2.1.
  • L. Qin, M. Galley, C. Brockett, X. Liu, X. Gao, B. Dolan, Y. Choi, and J. Gao (2019) Conversing by reading: contentful neural conversation with on-demand machine reading. arXiv preprint arXiv:1906.02738. Cited by: §2.2, 3rd item, Table 1, Table 2.
  • J. Rae, J. J. Hunt, I. Danihelka, T. Harley, A. W. Senior, G. Wayne, A. Graves, and T. Lillicrap (2016) Scaling memory-augmented neural networks with sparse reads and writes. In Advances in Neural Information Processing Systems, pp. 3621–3629. Cited by: §2.1, §2.1.
  • R. Sennrich, B. Haddow, and A. Birch (2015) Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909. Cited by: §5.1.
  • M. Seo, J. Lee, T. Kwiatkowski, A. P. Parikh, A. Farhadi, and H. Hajishirzi (2019) Real-time open-domain question answering with dense-sparse phrase index. arXiv preprint arXiv:1906.05807. Cited by: §2.2.
  • I. V. Serban, A. Sordoni, Y. Bengio, A. Courville, and J. Pineau (2016a) Building end-to-end dialogue systems using generative hierarchical neural network models. In

    Thirtieth AAAI Conference on Artificial Intelligence

    Cited by: §2.2.
  • I. V. Serban, R. Lowe, L. Charlin, and J. Pineau (2016b) Generative deep neural networks for dialogue: a short review. arXiv preprint arXiv:1611.06216. Cited by: §2.2.
  • K. Shuster, S. Humeau, A. Bordes, and J. Weston (2018) Engaging image chat: modeling personality in grounded dialogue. arXiv preprint arXiv:1811.00945. Cited by: §4, §4, Table 2, footnote 1.
  • S. Sukhbaatar, E. Grave, G. Lample, H. Jegou, and A. Joulin (2019) Augmenting self-attention with persistent memory. arXiv preprint arXiv:1907.01470. Cited by: §1, §2.1.
  • S. Sukhbaatar, J. Weston, R. Fergus, et al. (2015) End-to-end memory networks. In Advances in neural information processing systems, pp. 2440–2448. Cited by: §1, §2.1, §6.4.
  • B. Thomee, D. A. Shamma, G. Friedland, B. Elizalde, K. Ni, D. Poland, D. Borth, and L. Li (2015) YFCC100M: the new data in multimedia research. arXiv preprint arXiv:1503.01817. Cited by: §4.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §5.1.
  • J. Weston, S. Chopra, and A. Bordes (2014) Memory networks. arXiv preprint arXiv:1410.3916. Cited by: §1, §2.1, §6.4.
  • J. Weston, E. Dinan, and A. H. Miller (2018) Retrieve and refine: improved sequence generation models for dialogue. arXiv preprint arXiv:1808.04776. Cited by: §2.2, 2nd item, Table 1, Table 2.
  • S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He (2017) Aggregated residual transformations for deep neural networks. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 1492–1500. Cited by: §4.
  • S. Zhang, E. Dinan, J. Urbanek, A. Szlam, D. Kiela, and J. Weston (2018) Personalizing dialogue agents: i have a dog, do you have pets too?. arXiv preprint arXiv:1801.07243. Cited by: §2.2, §6.1, §6.4.