Multi-level Memory for Task Oriented Dialogs

10/24/2018 ∙ by Revanth Reddy, et al. ∙ ibm 0

Recent end-to-end task oriented dialog systems use memory architectures to incorporate external knowledge in their dialogs. Current work makes simplifying assumptions about the structure of the knowledge base, such as the use of triples to represent knowledge, and combines dialog utterances (context) as well as knowledge base (KB) results as part of the same memory. This causes an explosion in the memory size, and makes the reasoning over memory harder. In addition, such a memory design forces hierarchical properties of the data to be fit into a triple structure of memory. This requires the memory reader to infer relationships across otherwise connected attributes. In this paper we relax the strong assumptions made by existing architectures and separate memories used for modeling dialog context and KB results. Instead of using triples to store KB results, we introduce a novel multi-level memory architecture consisting of cells for each query and their corresponding results. The multi-level memory first addresses queries, followed by results and finally each key-value pair within a result. We conduct detailed experiments on three publicly available task oriented dialog data sets and we find that our method conclusively outperforms current state-of-the-art models. We report a 15-25 both entity F1 and BLEU scores.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


Task oriented dialog systems are designed to complete a user specified goal, or service an information request using natural language exchanges. Unlike open domain end-to-end neural dialog models, task oriented systems also need to rely on external knowledge sources, outside of the current conversation context, to return a response [Henderson, Thomson, and Williams2014a, Su, Carreras, and Duh2016, Bordes and Weston2017a, Eric et al.2017, El Asri et al.2017]. For instance, in the example shown in Table 1 a dialog agent giving tour package recommendations needs to be able to first query an external knowledge source to determine packages that meet a user’s requirement, and then respond accordingly.

Origin Dest. Hotel Price Cat. Rating
Dallas Mannheim Regal Resort $2800 5.0 8.98
Toronto Calgary Amusement $1864.4 4.0 6.91
Dallas Mannheim Vertex Inn $3592.8 3.0 7.15
Dallas Santos Globetrotter $2000 3.0 8.37 ..
Dallas Mannheim Starlight $4018.1 4.0 8.56
Role Turn Utterance
Agent 1 how can i help
User 2
i’ve got a few days off  from  aug 26-aug 31. I’m not flexible
on this, but i want to somehow treat myself with an 8 day trip.
I’m leaving Dallas and I want to check out Munich
Agent 2 sorry we dont have any package for Munich
User 3 how about Mannheiem?
Agent 3 would a 5 day trip suffice
User 4 sure dude
Agent 4
A 5 star hotel  called the Regal resort, it has free wifi and a spa
User 5 dates?
Agent 5 starts on august 27th until the 30th
User 6
ok that could work. I would like to see my options in Santos as well
Agent 6
there is Hotel Globetrotter in Santos. it has 3 stars and comes with
breakfast and wifi, it leaves on august 25th and returns on 30th,
all for $2000
User 7
ahh i can’t leave until august 26th though. I guess i’ll just have to
go to Mannheim then.
Agent 7
the Regal resort package costs $2800 with economy class flights.
User 8 yeah i will book it
Table 1: A goal oriented dialog based on the Frames dataset [El Asri et al.2017] along with an external knowledge source with each row containing a tour package.

In order to enable end-to-end goal oriented dialog tasks, current state of the art methods use neural memory architectures to incorporate external knowledge [Su, Carreras, and Duh2016, Eric et al.2017, Fung, Wu, and Madotto2018]. As can be seen in Table 1, agent responses may also include entity values present only in the dialog context (eg: “Munich” in the Agent response in Turn 2). In order to support such utterances, models also include tokens from the input dialog context in the same memory [Fung, Wu, and Madotto2018].

Existing memory based architectures for task oriented dialog suffer from multiple limitations. First, the creation of a shared memory for copying values from dialog context, as well as the knowledge base (KB) results, forces the use of a common memory reader for two different types of data. This makes the task of reasoning over memory, harder – not only does the memory reader need to determine the right entries from a large memory (since each word from context also occupies a memory cell), it also needs to learn to distinguish between the two forms of data (context words and KB results) stored in the same memory.

Subject Relation Object Subject Relation Object
Vertex Inn Price $3592.8 Vertex Inn Category 3.0
Regal Resort Price $2800 Regal Resort Rating 8.98
Regal Resort Category 5.0 Starlight Price $4018.1
Starlight Rating 8.56 Starlight Category 4.0
Table 2: Results from Dallas to Mannheim stored in the form of triples.

Second, all current neural memory architectures, store results returned by a knowledge source in the form of triples (eg. ). This modeling choice makes it hard for the memory reader to infer relationships across otherwise connected attributes. For instance, consider the example triple store in Table 2 showing results for query executed for packages between “Dallas” and “Mannheim”. If the user asks the dialog agent to check the price of stay at a 5 star hotel, the memory reader needs to infer the the correct answer is $2800 by learning that the price, the category and the hotel need to be linked to return an answer (shown in blue).

Lastly, current models treat conversations as a sequential process, involving only the use of the most recent information request/query. In contrast, in real world dialogs such as the one shown in Table 1, the agent may have to refer to results (to Mannheim) from a previously executed query (see Turn 7). Thus, at each turn, the system has to memorize all the information exchanged during the dialog, and infer the package being referred to, by the user. In order to support such dialogs, the memory needs to store results of all queries executed during the course of the dialog. The problem of storing and inferring over such results (which may be from multiple queries) is exacerbated when memory is represented in the form of triples.

In this paper we present our novel multi-level memory architecture that overcomes the limitations of existing methods: (i) We separate the memory used to store tokens from the input context and the results from the knowledge base. Thus, we learn different memory readers for context words as well for knowledge base entities (ii) Instead of using a store, we develop a novel multi-level memory architecture which encodes the natural hierarchy exhibited in knowledge base results by storing queries and their corresponding results and values at each level. We first attend on the queries, followed by the results in each query to identify the result being referred to, by the user. We then attend on the individual entries in the result to determine which value to copy in the response. Figure 0(c) shows our multi-level memory storing the results from queries executed as part of the dialog in Table 1.

Our paper makes the following contributions:

  1. We propose the use of separate memories for copying values from context and KB results. Thus, the model learns separate memory readers for each type of data.

  2. Our novel multi-level memory for KB results, models the queries, results and their values in their natural hierarchy. As our experiments show, the separation of memory as well as our multi-level memory architecture, both, contribute to significant performance improvements.

  3. We present detailed experiments demonstrating the benefit of our memory architecture along with model ablation studies. Our experiments on three publicly available datasets (CamRest676 [Su, Carreras, and Duh2016], InCar assistance [Eric et al.2017], Maluuba Frames [El Asri et al.2017]) show a substantial improvement of 15-25 % in both entity F1 scores, and BLEU scores as compared to existing state of the art architectures. To the best of our knowledge we are the first to attempt end-to-end modeling of task oriented dialogs with non-sequential references as well as multiple queries, as seen in the Maluuba Frames dataset.

(a) Architecture of our model with multi-level memory attention.
(b) Context memory created using thehidden states (c) Expanded view of the multi-level memory corresponding to example in Table 1
Figure 1: Model architecture (a) along with schematic representation of context memory (b) and multi-level KB memory (c)

Related work

Recent methods such as [Vinyals and Le2015, Serban et al.2016, Serban et al.2017] proposed for end-to-end learning of dialogs were aimed at modeling open-domain dialogs. While they can be used for learning task oriented dialogs, they are not well suited to interface with a structured KB. To better adapt them to handle task oriented dialogs: 1) BordesW16 (BordesW16) proposed a memory network based architecture to better encode KB tuples and perform inferencing over them and 2) Mem2Seq (Mem2Seq) incorporated copy mechanism to enable copying of words from the past utterances and words from KB while generating responses.

All successful end-to-end task oriented dialog networks [Eric et al.2017, Bordes and Weston2017b, Fung, Wu, and Madotto2018] make assumptions while designing the architecture: 1) KB results are assumed to be a triple store, 2) KB triples and past utterances are forced to be represented in a shared memory to enable copying over them. Both these assumptions makes the task of inferencing much harder. Any two fields linked directly in the KB tuple are now linked indirectly by the subject of the triples. Further, placing the KB results and the past utterances in same memory forces the architecture to encode them using a single strategy. In contrast, our work uses two different memories for past utterances and KB results. The decoder is equipped with the ability to copy from both memories, while generating the response. The KB results are represented using a multi-level memory which better reflects the natural hierarchy encoded by sets of queries and their corresponding result sets.

Memory architectures have also been found to be helpful in other tasks such as question answering. Work such as [Xu et al.2016] defines a hierarchal memory architecture consisting of sentence level memory followed by word memory for a QA task while [Chandar et al.2016] defines a memory structure that speeds up loading and inferencing over large knowledge bases. Recent work by [Chen et al.2018] uses a variational memory block along with a hierarchical encoder to improve diversity of open domain dialog responses.

Multi-Level Memory Network

In this section, we describe our end-to-end model for task oriented dialogues. Our model (Figure 0(a)) consists of: (i) a hierarchical encoder which encodes the current input context consisting of the user and agent utterances (ii) a multi-level memory that maintains the queries and knowledge base results seen so far in the course of the dialogue, and (iii) copy augmented sequence decoder that uses a separate context memory. The decoder uses a gating mechanism for memory selection while generating a response. The queries and knowledge base results are maintained in a multi-level memory.


Our model uses a standard hierarchical encoder as proposed by [Sordoni et al.2015]. The encoder takes a sequence of utterances as input. For the turn, the dialogue context can be represented as , which consists of user utterances and system utterances. Each utterance is further a sequence of words . We first embed each word using a word embedding function

that maps each word to a fixed-dimensional vector. We then generate utterance representations,

using a single layer bi-directional GRU. denotes the hidden state of word in the bi-directional GRU. The input representation is generated by passing each utterance representation through another single layer GRU.

Multi-level Memory


Current approaches break down KB results by flattening them into (subj-rel-obj) triples. However, converting KB results into triples leads to loss of relationship amongst attributes in the result set. This makes the reasoning over memory hard as model now has to infer relationships when retrieving values from memory. We hypothesize that a representation of all the values in the result, and not just one of the values, should be used while attending over a result in KB. This is one of the main motivations of our multi-level memory in which we keep the results intact in memory and attend over a representation of the result before attending on the individual key-value pairs in each result.


Let be the queries fired to the knowledge base so far over the course of dialogue. Every query is a set of key-value pairs , corresponding to the query’s slot and argument where is the number of slots in query . For example, in the user utterance at Turn 3 in Table 1, the query fired by the system on the knowledge base would be {’origin’:’Dallas’,’destination’:’Manheim’,’Start’:’Aug 26’, ’end’: ’Aug 31’, ’Adults’:1}.The execution of a query on an external knowledge base, returns a set of results. Let be the result of query . Each result is also a set of slot-value pairs where is the number of attributes in result . A visualization of the memory with queries and their corresponding results can be seen in Figure 0(c).

The first level of memory contains the query representations. Each query is represented by = Bag of words over the word embeddings of values () in . The second level of memory contains the result representations. Representation of each result is given by = Bag of words over the word embeddings of values () in . The third level of memory contains the result cells which have the key-value pairs () of the results. The values () which are to be copied into the system response are thus present in the final level of memory . We now describe how we apply attention over the context and multi-level memory.


The model generates the agent response word-by-word; a word at time step is either generated from the decode vocabulary or is a value copied from one of the two memories (knowledge base or context memory). A soft gate controls whether a value is generated from vocabulary or copied from memory. Another gate determines which of the two memories is used to copy values.

Generating words:

Let the hidden state of the decoder at time be .


The hidden state is used to apply attention over the input context memory. Attention is applied over the hidden states of the input bi-directional (BiDi) GRU encoder using the “concat” scheme as given in [Luong, Pham, and Manning2015]. The attention for the th word in the th utterance is given by:


The attention scores are combined to create an attended context representation ,


and similar to [Luong, Pham, and Manning2015], the decoder word-generation distribution is given by :


Copying words from context memory:

The input context memory is represented using the hidden states of the input Bi-Di GRU encoder. Similar to [Gulcehre et al.2016], the attention scores

, are used as the probability scores to form the copy distribution

over the input context memory.


Copying entries from KB memory:

The context representation , along with the hidden state of decoder , is used to attend over the multi-level memory. The first level attention, , is applied over the queries .


The second level attention, , is the attention over the results of query .


The product of first level attention and second level attention is the attention over results of all the queries in the multi-level memory. The weighted sum of the first level attention, second level attention and result representations gives us the attended memory representation, .


Each result is further composed of multiple result cells. On the last level of memory, which contains the result cells, we apply key-value attention similar to [Eric et al.2017]. The key of the result cell is the word embedding of the slot, , in the result. The attention scores, , for the keys represent the attention over the result cells of each result .


The product of first level attention , second level attention and third level attention gives the final attention score of the value in the KB memory. These final attention scores when combined (Eq. 10), form the copy distribution, , over the values in KB memory.



Lastly, similar to [Gulcehre et al.2016], we combine the generate and copy distributions using a soft gate. We use gate (Eq. 11) to obtain the copy distribution (Eq. 12) by combining and .


Finally, we use gate to obtain the final output distribution , by combining generate distribution and copy distribution as shown below:


We train our model by minimizing the cross entropy loss .



We present our experiments using three real world publicly available multi-turn task oriented dialogue datasets: the InCar assistant [Eric et al.2017], CamRest [Su, Carreras, and Duh2016] and the Maluuba Frames dataset [El Asri et al.2017]. All three datasets contain human-human task oriented dialogues which were collected in a Wizard-of-Oz [Wen et al.2017] setting.

(i) InCar assistant dataset consists of multi-turn dialogues in three distinct domains: calendar scheduling, weather information retrieval, and point-of-interest navigation. Each dialogue has it’s own KB information provided and thus, the system does not have to make any queries.

(ii) CamRest dataset, consists of human-to-human dialogues set in the restaurant reservation domain. There are three queryable slots (food, price range, area) that users can specify. This dataset has currently been used for evaluating slot-tracking systems. Recent work by [Lei et al.2018] uses an end-to-end network without a KB and substitutes slot values with placeholders bearing the slot names in agent responses. However, we formatted111 We will release our pre-processed data and code for further research. the data to evaluate end-to-end systems by adding API call generation from the slot values so that restaurant suggestion task can proceed from the KB results.

(iii) Maluuba Frames dataset, consists of dialogues developed to study the role of memory in task oriented dialogue systems. The dataset is set in the domain of booking travel packages which involves flights and hotels. In contrast to the previous two datasets, this dataset contains dialogs that require the agent to remember all information presented previously as well as support results from multiple queries to the knowledge base. A user’s preferences may change as the dialogue proceeds, and can also refer to previously presented queries (non-sequential dialog). Thus,to store multiple queries, we require levels in our multi-level memory as compared to levels in the other datasets, since they don’t have more than one query. We do not use the dialogue frame annotations and use only the raw text of the dialogues. We map ground-truth queries to API calls that are also required to be generated by the model. Recent work has used this dataset only for frame tracking [Schulz et al.2017] and dialogue act prediction [Peng et al.2017, Tang et al.2018]. To the best of our knowledge we are the first to attempt the end-to-end dialog task using this dataset. Table 3 summarizes the statistics of the datasets.

InCar CamRest Maluuba Frames
Train Dialogs 2425 406 1095
Val Dialogs 302 135 137
Test Dialogs 304 135 137
Avg. no. of turns 2.6 5.1 9.4
Avg length. of sys. resp. 8.6 11.7 14.8
Avg no. of sys. entities 1.6 1.7 2.9
Avg no. of queries 0 1 2.4
Avg no. of KB entries 66.1 13.5 141.2
Table 3: Statistics for different datasets.


Our model is trained end-to-end using Adam optimizer [Kingma and Ba2014] with a learning rate of . The batch-size is sampled from [8,16]. We use pre-trained Glove vectors [Pennington, Socher, and Manning2014] with an embedding size of 200. The GRU hidden sizes are sampled from [128, 256]. On all the datasets, we tuned the hyper-parameters with grid search over the validation set and selected the model which gives best entity F1.

Evaluation Metrics


We use the commonly used BLEU metric [Papineni et al.2002] to study the performance of our systems as it has been found to have strong correlation [Sharma et al.2017] with human judgments in a task-oriented dialog setting. We use the Moses multi-bleu.perl script in our evaluation.

Entity F1

To explicitly study the behaviour of different memory architectures, we use the entity

to measure how effectively values from a knowledge base are used in the dialog. To compute the entity F1, we micro-average the precision and recall over the entire set of system responses to compute the micro F1

222We observe that [Fung, Wu, and Madotto2018] reports the micro average of recall as the micro F1.. For the InCar Assistant dataset, we compute a per-domain entity F1 as well as the aggregated entity F1. Since our model does not have slot-tracking by design, we evaluate on entity F1 instead of the slot-tracking accuracy as in [Henderson, Thomson, and Williams2014b, Wen et al.2017]


We experiment with the following baseline models for comparing the performance of our Multi-Level Memory architecture:

  • Attn seq2seq333We use the implementation provided by [Fung, Wu, and Madotto2018] at [Luong, Pham, and Manning2015]: A model with simple attention over the input context at each time step during decoding.

  • Ptr-UNK3 [Gulcehre et al.2016]: The model augments a sequence-to-sequence architecture with attention-based copy mechanism over the encoder context.

  • KVRet [Eric et al.2017]: The model uses key value knowledge base in which the KB is represented as triples in the form of . This model does not support copying words from context. The sum of word embeddings of , is used as the key of the corresponding .

  • Mem2Seq3 [Fung, Wu, and Madotto2018]: The model uses a memory networks based approach for attending over dialog history and KB triples During decoding, at each time step, the hidden state of the decoder is used to perform multiple hops over a single memory which contains both dialog history and the KB triples to get the pointer distribution used for generating the response.


InCar CamRest Maluuba Frames
Model BLEU F1
Attn seq2seq [Luong, Pham, and Manning2015] 11.3 28.2 36.9 35.7 10.1 7.7 25.3 3.7 16.2
Ptr-UNK [Gulcehre et al.2016] 5.4 20.4 22.1 24.6 14.6 5.1 40.3 5.6 25.8
KVRet [Eric et al.2017] 13.2 48.0 62.9 47.0 41.3 13.0 36.5 10.7 31.7
Mem2Seq [Fung, Wu, and Madotto2018] 11.8 40.9 61.6 39.6 21.7 14.0 52.4 7.5 28.5
Multi-level Memory Model (MM) 17.1 55.1 68.3 53.3 44.5 15.9 61.4 12.4 39.7
Table 4: Comparison of our model with baselines

Table 4 shows the performance of our model against our baselines. We find that our multi-level memory architecture comprehensively beats all existing models establishing new state-of-the-art benchmarks on all three datasets. Our model outperforms each baseline on both BLEU and F1 metrics.

InCar: On this dataset, we show entity F1 scores for each of the scheduling, weather and navigation domains. Our model has the highest F1 scores across all the domains. It can be seen that our model strongly outperforms Mem2Seq on each domain. A detailed study reveals that the use of triples cannot handle cases when a user queries with non-subject entries or has cases when the response requires inferencing over multiple entries. In contrast, our model is able to handle such cases since we use a compound representation of entire result (bag of words over values) while attending on that result.

CamRest: Our model achieves the highest BLEU and entity F1 score on this dataset. From Table 4, we see that simpler baselines like Ptr-UNK show competitive performance on this dataset because, as shown in Table 3, CamRest dataset has relatively fewer KB entries. Thus, a simple mechanism for copying from context results in good entity F1 scores.

Maluuba Frames: The Maluuba Frames dataset was introduced for the frame tracking task, wherea dialog frame is a structured representation of the current dialog state. Instead of explicitly modeling the dialog frames, we use the context representation to directly attend on the Multi-level memory. As Table 3 shows, this dataset contains significantly longer contexts as well as larger number of entities, as compared to the previous two datasets. In addition, unlike other datasets, it also contains non-linear dialog flows where a user may refer to previously executed queries and results. The complexity of this dataset is reflected in the relatively lower BLEU and F1 scores as compared to other datasets.


Entity source-wise performance: To further understand the effect of separating context memory from KB memory and using a multi-level memory for KB, Table 5 shows the percentage of ground-truth entities, according to their category, which were also present in the generated response. For example, on the InCar dataset, out of the entities in ground-truth response that were to be copied from the KB, our model was able to copy of them into the generated response. From Table 5, it can be seen that our model is able to copy a significantly larger number of entities from both, KB and context, as compared to the Mem2Seq model in all datasets.

InCar CamRest Maluuba
Ctxt. KB Ctxt. KB Ctxt. KB
Mem2Seq 66.2 25.3 63.7 36.5 17.7 8.9
Multi-level Mem. 81.6 37.5 70.1 53.4 27.2 14.6
Table 5: Percentage (%) of category-wise (context vs KB) ground truth entities captured in generated response. Abbreviation Ctxt denotes context.

Model ablation study

InCar CamRest Maluuba Frames
Model BLEU F1
Unified Context and KB memory (Mem2Seq) 11.8 40.9 61.6 39.6 21.7 14.0 52.4 7.5 28.5
Separate Context and KB Memory 14.3 44.2 56.9 54.1 24.0 14.3 55.0 12.1 36.5
+Replace KB Triples With Multi-level memory 17.1 55.1 68.3 53.3 44.5 15.9 61.4 12.4 39.7
Table 6: Model ablation study : Effects of separate memory and memory design

We report our ablation studies on the all three datasets. Table 6 shows the incremental benefit contributed by individual components used in our model. We investigate the incremental gains made by (i) Using separate memory for context and KB triples (ii) Replacing triples with a multi-level memory. We use the recent Mem2Seq model for comparison with a unified context and KB memory.

As can be seen from Table 6, the separation of context memory and KB memory leads to a significant improvement in BLEU and F1 scores on all datasets. This validates our hypothesis that storing context words and KB results in a single memory confuses the memory reader. The use of a multi-level memory instead of triples leads to further gains. This suggests, better organization of KB result memory is beneficial.

Error Analysis

We analyzed the errors made by our dialog model on 100 dialog samples in test set of Maluuba Frames. We observed that the errors can be divided into five major classes: (i) Model outputs wrong result due to incorrect attention (27%), (ii) Model returns package details instead of asking more information from the user (16%), (iii) Model incorrectly captures user intent (13%), (iv) Model makes an error due to non-sequential nature of dialog (22%) In such errors, our model either generates an API call for a result already present in the memory, or our model asks for a query-slot value that was already provided by the user, (v) Data specific characteristics such as insufficient samples for certain classes of utterances (eg: more than one package returned) or returning different, but meaningful package attributes as compared to gold data, contribute to 22 % of the errors.

Attention Visualization

(a) Attention over the multi-level KB memory
(b) Decreasing order of attention scores over words in context.
Figure 2: Visualization of attention over memory for example in Table 7

Analyzing the attention weights is a good way to understand how the model is inferencing over the memory to copy entities from it. We show an example from the Maluuba Frames dataset shown in Table 7. Here, the user first wants to know about packages to Manas and then requests for trips to Pittsburgh. Later, the user becomes interested in the 3.5 star hotel in Pittsburgh which was suggested by the agent and wants to know its guest rating. It can be seen from Table 7 that our model outputs the correct guest rating (8.86) of the hotel. Mem2Seq fails to understand the context and generates an irrelevant response. KVRet generates a readable response but points to the guest rating of a different hotel.

The attention over the memory while generating the word ‘’ for this example is shown in Fig 2. Fig 1(a) shows that the query with destination as ‘Pittsburgh’ gets the highest attention and among the results of this query, the package with the 3.5 star rated hotel gets highest attention. Within this result, the model gives highest score to the result cell with guest rating as the key. To further understand why the result correct hotel gets higher attention, Fig 1(b) shows the attention scores over the words in context memory. The context representation captures the correct words (3.5, guest, rating) in context which are in-turn used to apply attention over the multi-level memory.

Lastly, studying the values of the gates (prob. of generating from vocab.) and (prob. of copying from KB), we found that gate had a probability value of 0.08 thereby driving the model to copy from memory instead of generating from output vocabulary and gate with a probability value of 0.99 was responsible for selecting KB memory over context memory.

Role Turn Utterance
Agent 1 hello ! how can i help you today ?
User 2
i just need to dream a bit .. i have 4500 dollars but no vacation days.
i’m wondering what a hypothetical trip to Manas would be like
Agent 2 i can help with that, where are you departing from
User 3
i’m in Tijuana, not departing any time soon unfortunately though
Agent 5
[API_CALL manas tijuana 29/8 16/9 4500 ....]
User 7 that’s fine, what about pittsburgh
Agent 7
[API_CALL pittsburgh tijuana ... ... 4500 ...]
Agent 8
that i do have! would you be satisfied with a 3.5 star hotel ?
User 9
as of now i’m satisfied with a tent and a backpack .
what are the hotel details?
Agent 9
the hotel is situated near a park, comes with free parking, wifi
and breakfast ! does this sound good
User 10 sounds quite nice, what about the guest rating ?
Gold the guest rating is impressive : 8.86
Mem2Seq this is the only available with the same
it’s a 2.5 star hotel near a park, museum and has a
guest rating of 8.22
Our Model
the onyx isle hotel has a 8.86 guest rating and offers free parking,
breakfast and wifi.
Table 7: Comparing the responses generated by various models on an example in test set of Maluuba Frames dataset.


In this paper, we presented an end-to-end trainable novel architecture with multi-level memory for task oriented dialogues. Our model separates the context and KB memory and combines the attention on them using a gating mechanism. The multi-level KB memory reflects the natural hierarchy present in KB results. This also allows our model to support non-sequential dialogs where a user may refer to a previously suggested result. We find that our model beats existing models by 15-25% on both entity F1 and BLEU scores, establishing state-of-the-art results on three publicly available real-world task oriented datasets. We study the benefits of each of our design choices using an ablation study. In future work, we would like to incorporate better modeling of latent dialog frames so as to improve the attention signal on our multi-level memory. As our error analysis suggests, nearly 22 % of the errors could possibly be reduced by improved modeling of the dialog context to better capture the state of dialog.