Memory-augmented Dialogue Management for Task-oriented Dialogue Systems

05/01/2018 ∙ by Zheng Zhang, et al. ∙ Tsinghua University 0

Dialogue management (DM) decides the next action of a dialogue system according to the current dialogue state, and thus plays a central role in task-oriented dialogue systems. Since dialogue management requires to have access to not only local utterances, but also the global semantics of the entire dialogue session, modeling the long-range history information is a critical issue. To this end, we propose a novel Memory-Augmented Dialogue management model (MAD) which employs a memory controller and two additional memory structures, i.e., a slot-value memory and an external memory. The slot-value memory tracks the dialogue state by memorizing and updating the values of semantic slots (for instance, cuisine, price, and location), and the external memory augments the representation of hidden states of traditional recurrent neural networks through storing more context information. To update the dialogue state efficiently, we also propose slot-level attention on user utterances to extract specific semantic information for each slot. Experiments show that our model can obtain state-of-the-art performance and outperforms existing baselines.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Task-oriented dialogue systems offer a natural and effective interface for users to seek information and complete complex tasks in an interactive manner. Such systems often collect users’ preferences in the course of dialogue before issuing the final query to the knowledge base (such as booking a flight ticket). There are also some works (Hixon et al., 2015; Saha et al., 2018) viewing the task-oriented dialogue task as a context-aware, multi-turn question answering (QA) task in which a user can interact with the system in multi-turn contexts and the system also has access to the knowledge base.

Different from open-domain conversational systems which are often modeled in an end-to-end manner, task-oriented dialogue systems are generally composed of several cascaded processes, as shown in Figure 1

, including natural language understanding (NLU), dialogue management (DM), and natural language generation (NLG). Dialogue management, which is in charge of selecting actions in response to user inputs, plays a central role in task-oriented dialogue systems

(Williams and Young, 2007; Ge and Xu, 2015a). It takes as input the user intent which is analyzed by NLU, interacts with knowledge base, and decides the next system action. Sometimes NLU and DM can be coupled together as a single module which can be trained end-to-end to read directly from user utterance and produce system action. The system action produced by DM will be translated into a natural language utterance by NLG (Wen et al., 2015) to interact with users.

Figure 1. The processing flow of task-oriented a dialogue system. Natural language understanding (NLU) parses the user utterance and extracts structured semantic information from the utterance, dialogue management receives the semantic information and decides the next dialogue act that the system should take, and natural language generation (NLG) translates the dialogue act to a natural language response. In some cases, NLU and DM can be coupled together as a single module, and the semantic information produced by NLG is often unstructured in this situation, such as the output of neural network.

In order to decide the next action a dialogue system should take, dialogue management, particularly in task-oriented dialogue systems, should deal with the dialogue context information. It needs to access not only local utterances, but also the global information about what has been addressed several turns ago. The global history information, which is often referred to as dialogue state, is a key factor in dialogue systems. Based on the dialogue state, the dialogue manager then produces system action according to its policy. The task of dialogue management is sometimes divided into two subtasks, namely dialogue state tracking which maintains dialogue history information, and dialogue policy which selects the next system action based on the dialogue state.

Early methods of modeling dialogue management are mostly rule-based, in which the state update and dialogue policy process are manually defined, but these methods did not take into account the probability uncertainty in dialogue. Bayesian network methods

(Paek and Chickering, 2005; Williams and Young, 2007)

formulated dialogue management as a probabilistic graphical model which models the conditional dependency between different states, and each specific state is bound with an action to be taken, but the definition of dialogue state still need manually-crafted rules. Recently, many neural network methods have been proposed for dialogue management due to their capability of semantic representation and automatic feature extraction, and obtain state-of-the-art performance on many dialogue tasks

(Ge and Xu, 2015b; Serban et al., 2016a). More specifically, most neural dialogue models are RNN (Recurrent Neural Network) based which takes as input user utterance and system response at each dialogue turn, and the hidden state of RNN is utilized as the representation of dialogue state (Henderson et al., 2014b; Williams et al., 2017).

However, despite of the success of RNN on various text modeling tasks, simple RNN is proven to have poor performance on dialogue tasks (Williams et al., 2017)

due to the single hidden state vector used in RNN and thus the defect of modeling long-range contexts. Hierarchical RNN structures  

(Serban et al., 2016b) and memory networks (Weston et al., 2014; Dodge et al., 2015; Bordes and Weston, 2016) are feasible solutions to this issue, but existing neural models still lack an explicit memorization of the history semantics of the entire dialogue session: the dialogue act types, semantic slots, and the values of the slots are not explicitly processed during the interaction.

Another important issue is to extract semantic information from user utterance when combining NLU and DM together, which is the case in most end-to-end dialogue systems. Such semantic information is critical for dialogue state update. Existing methods either extract information from predefined features (such as POS and NER tags) by heuristic rules

(Henderson et al., 2014b), or from pretrained word embeddings by neural network encoder (Mrkšić et al., 2016). However, words in user utterance have different importance for updating dialogue states and predicting the next action, which is not taken into consideration by previous methods. For example, in a user utterance I want to book a table in Beijing Hotel, the word book apparently contributes more than the word want to the user intent. Furthermore, each word contributes differently to different slots, e.g., word British is more related to slot Cuisine while north is more related to Location, as shown in Figure 2.

To address the above issues, we propose a novel Memory-Augmented Dialogue management model (MAD) which attentively receives user utterances as input and predicts the next dialogue act111The dialogue act can be translated into a natural language utterance by a language generator, as shown in (Wen et al., 2015). . Dialogue act is composed of two parts in our model: dialogue act type and slot-value pairs, as shown in Table 1 . Dialogue act type indicates the intent type such as Query or Recommendation, which is a high-level representation of dialogue act. Slot-value pairs denote key elements of a task, and represent the key semantic information supplied by the user during the interaction, which also indicate the state of the dialogue.

We design two memory modules, namely a slot-value memory and an external memory, which can be written (or updated) and read, to enhance the ability of modeling history semantics of dialogues. A memory controller is introduced to control the write and read operations to the two memories. The slot-value memory explicitly memorizes and updates the values of the semantic slots during interaction. The write to the slot-value memory units, each corresponding to a slot, is implemented by a slot-level attention mechanism. In this manner, the slot-value memory provides an observable and interpretable representation of the dialogue state. The external memory serves as a supplement to the single hidden state of a RNN structure and provides a better capacity to store more historical dialogue information. A complete dialogue act (consisting of dialogue act type and slot-value pairs) for the next interaction is predicted based on the slot-value memory and external memory.

Utterance How about a British restaurant in north part of town.
Dialogue act type Query
Slot-value pairs Cuisine=British, Location=Paris
Mask (auxiliary) Rating Cuisine Price Service Location
0 1 0 0 1
Table 1. An example of dialogue act for a given utterance. Dialogue act type is a high-level representation of an utterance. Slot-value pairs are the task-specific semantic elements that are mentioned in an utterance.

Our contributions are summarized as follows:

  • We propose a novel memory-augmented dialogue management model by introducing two memory networks. The slot-value memory network maintains the values of semantic slots during interaction, and the external-memory augments the single state representation of the recurrent networks. Both memory modules enable the model to access not only local utterances, but also the global semantics of the entire dialogue session.

  • We propose an attention mechanism for updating the dialogue state. In particular, the model first computes a weight distribution over all words in a user utterance for each slot. Then, the weighted representation of the utterance is used to update the memory unit for each slot.

  • The model can offer more observable and interpretable results in that the slot-value memory can track the change of dialogue states explicitly.

2. Related Work

The role of dialogue management (DM) is to launch the next interaction through predicting the next action the system should take, or by generating an utterance directly in response to user’s query. The previous studies on DM can be broadly classified into three types: rule-based models, Bayesian network models, and neural models.

Rule-based approaches date back to very early dialogue systems (Weizenbaum, 1966). Several architectures are proposed to formulate the process of dialogue management. The flow diagram approach (McTear, 1998) used a finite-state machine to model state transition in dialogue, where the state represents a certain dialogue status, and the transition between states is triggered by the corresponding type of a user utterance. Slot-filling approaches (Goddeau et al., 1996) expanded the definition of dialogue state to an aggregation of slots and values. In such models, user can talk about each slot by issuing constraints and requesting the values of slots, and the dialogue state will be updated as long as a user provides new values for the slots during interaction. Though rule-based DM models work well in some applications, these approaches have apparent difficulties in task and domain adaptation (Zukerman and Albrecht, 2001) because the rules are usually tailored to a specific scenario. Due to the nature of hand-crafted rules, the variety and diversity of language is not well addressed. The need for hand-crafted rules also makes it expensive to build a rule-based system.

Bayesian network approaches are proposed to address the issues of rule-based methods. Dialogue management was firstly formalized as a Markov decision process (MDP)

(Levin et al., 1998) under the Markov assumption (Paek and Chickering, 2005), in which the new state at turn is only conditioned on the previous state and system action . MDP models the uncertainty in dialogue and becomes more robust to the errors induced by speech recognition and NLU. Partially observable Markov decision processes (POMDP) (Williams and Young, 2007) provides a more principled way in that it takes environment observation

into consideration. On the top of this framework, state transition and dialogue policy are trained using reinforcement learning. However, the POMDP model becomes difficult to train for the domains with large state space. An improved version of POMDP - Hidden Information State (HIS)

(Young et al., 2007) is proposed to address this problem by grouping dialogue states into partitions. Another key problem in building Bayesian dialogue model is the lack of training corpus, thus user simulation (Schatzmann et al., 2006) is employed to enhance the training procedure, where dialogue data can be collected through interactions between a user simulator and a target system. In spite of the success of Bayesian network methods, designing an appropriate reward function and manually crafting features limit the applicability of such approaches. As a noticeable defect, the state in these approaches is still manually defined, requiring a large amount of human labor.

A variety of neural models have recently been applied for the dialogue management task. Since the process of a dialogue session naturally follows a sequence-to-sequence learning problem at the turn level, recurrent neural network (RNN) is proposed to model the process (Henderson et al., 2014b; Mrkšić et al., 2016; Wen et al., 2017). At each turn, RNN takes as input the structured semantic representation produced by NLU (or raw user utterance when combining NLU and DM together) and predicts system action, where the hidden state of RNN is utilized as the representation of a dialogue state. There are also some neural end-to-end models which directly take dialogue context as input and generate natural language response (Shang et al., 2015; Li et al., 2016; Serban et al., 2016a, 2017)

in open-domain conversational systems. However, due to the vanishing gradient problem and the limited ability of state representation, RNN is difficult to capture the long-range context in dialogue. Hybrid Code Networks

(Williams et al., 2017) proposes to handle the state representation problem by combining rule-based and RNN-based models together, while the performance is still highly dependent on the hand-crafted rules.

Figure 2. Slot-level attention: word mentions in user utterance are mapped to semantic slots such as rating, cuisine, price, service, and location.

Memory network provides a principled approach for modeling long-range dependency and making multi-hop reasoning, which has advanced many NLP tasks such as machine translation (Wang et al., 2016) and question answering (Sukhbaatar et al., 2015)

. Neural turing machines

(Graves et al., 2014) was proposed to augment existing neural models with additional memory units to solve complicated tasks. It is analogous to a Turing machine but is differentiable end-to-end. (Weston et al., 2014) proposed fully supervised memory networks which employ supervision signal not only from answer labels but also from pre-specified supporting facts. (Sukhbaatar et al., 2015) proposed end-to-end memory networks (MEMN2N) which can be trained end-to-end without any intervention on which supporting fact should be used during training. Dynamic memory network proposed by (Kumar et al., 2016) uses a sentence-level attention mechanism to update its internal memory during multi-hop inference. Key-value memory network (Miller et al., 2016) encodes prior knowledge by introducing a key memory structure which stores facts to address to the relevant memory value. There are already some works which introduced memory network into the task of dialogue management (Perez and Liu, 2017) where memory networks are straightforwardly applied in a machine reading manner. In comparison, our model is better to model the long-range history semantics of the dialogue session by memorizing and updating the dialogue act types and the values of semantic slots explicitly, which is implemented through a slot-value memory and an external memory.

Extracting semantic information from user utterance is a key issue in task-oriented dialogue systems when combining NLU and DM together. Early methods used hand-crafted rules and semantic features, including NER and POS tags, to construct semantic features for user utterance. (Henderson et al., 2014b) proposed to use the speech recognition confidence score as an additional feature. (Serban et al., 2016a, 2017) used hierarchical RNN models, where the user utterance is processed by a word-level RNN, and utterances are sequentially connected through an utterance-level RNN. (Mrkšić et al., 2016)

proposed to use convolutional neural network (CNN) model for semantic feature extraction. However, existing approaches did not consider the fact that words in an utterance contribute differently to different slots, which is important for updating the dialogue state.

3. Memory-augmented Dialogue Management with Slot-Attention

Figure 3. Memory-augmented Dialogue Management (MAD): At each dialogue turn , the model takes as input the current user utterance and the previous system response, and predicts the next dialogue act. The slot-value memory is updated with an attentive read of the user utterance by a slot-level attention mechanism while the external memory is read and updated by the controller. The memory controller along with the two memory modules will predict the next dialogue act of the system by a classifier.

3.1. Task Definition

This paper deals with task-oriented dialogue management. We start by defining the input and output of our model. At the current turn () of a dialogue, given a user utterance along with the system response of the previous turn (), the task of dialogue management module is to predict the next system dialogue act that will be utilized to generate a natural language utterance. This procedure can be formalized as follows:

where and are the user utterance at the current turn and system response at the previous turn, respectively, and is the next dialogue act which can be used to generate system response. represents the parameters of the model. The next system response will be generated from by a natural language generator, which is beyond the scope of this paper.

To exemplify the concept of dialogue act in our model, we take the task of restaurant reservation as an example, as shown in Table 1. Dialogue act () is composed of two elements: dialogue act type and slot-value pairs. Dialogue act type is a general description of user intents, such as Query where the user may search for some information, and Recommend where the user may ask for some recommendations. A slot-value pair represents a filled value for a slot 222Generally speaking, a slot in task-oriented dialogue systems is a category of semantic features, which defines some key attribute or element for accomplishing a task. , such as Location=north, Price=expensive and Cuisine=British. The slot-value pairs are usually regarded as the state representation in many dialogue state tracking studies (Henderson et al., 2014b). During the interaction, the filled value for each slot may be provided or updated by the user, and correspondingly, the dialogue state changes. For instance, when the user says How about a British restaurant in north part of town., two slot-value pairs, Cuisine=British and Location=north, will be updated. However, not all slot-value pairs which are mentioned in the context are to be addressed in the dialogue act of system response. We thus introduce an auxiliary variable Mask, which is a one-hot vector with dimension which is the number of slots, to decide which slot-value pairs are to be included in the next dialogue act. As shown in Table 1, the slots appeared in dialogue act are only Cuisine and Location, and their mask value is set to 1. In previous dialogue turns, the value of other slots may have already been mentioned, but their value is useless for the system response of this turn, and their Mask value is 0. Generally speaking, a dialogue act can be viewed as the structured semantic representation of a natural language sentence.

3.2. Overview

As shown in Figure 3, the memory-augmented dialogue management model has two novel memory components, namely slot-value memory and external memory . The slot-value memory consists of a static slot memory () and a dynamic value memory () where one memory unit (i) in is mapped to a unique unit (i) in . remains unchanged during the interaction, while and is updated at each turn . We also design an RNN-based memory controller which controls read and write of the slot-value memory and external memory. The slot-value memory is updated with an attentive read of the user utterance by a slot-level attention mechanism while the external memory is read and updated by the controller. The memory controller along with the two memory modules will predict the next dialogue act of the system by a set of classifiers.

Let and 333Note that is the system response at turn while is to be generated with a predicted . denote the word embedding sequence of the user utterance at turn and the preceding system response at turn , respectively, where are word embeddings, and are the lengths of two sequences. At each turn , our model works in the following procedure:

1. Memory Read: The controller reads information from the value memory and external memory. The read of is conditioned on the controller state () and the value memory () at the previous turn, and the slot memory, formally as follows:

(1)

and the read of the external memory conditions on the controller state and the external memory at the previous turn:

(2)

Inspired by (Graves et al., 2014), we introduce content-based addressing for memory read. are content vectors read from the slot-value memory and the external memory, respectively.

2. Controller State Update: The controller state is then updated by the information read from the value memory and the external memory, and the content from and :

(3)

where GRU stands for gated recurrent units

(Cho et al., 2014), and denotes the concatenation of vectors. For simplicity, an utterance () is represented by the averaged word embeddings but more elaborated representation models are also applicable.

3. Memory Write: Memory vectors in and are updated based on and their previous values:

(4)
(5)

The output at turn is obtained based on and . The output consists of the elements of a dialogue act, that is, the dialogue act type, slot-value pairs and a mask. Note that the slot memory is static and does not need to be updated.

3.3. Slot-Value Memory

The slot-value memory tracks the dialogue state by storing and updating the value of each semantic slot during interaction. It is composed of two components: slot memory and value memory, and both of them are composed of the same number () of column vectors. The slot memory is kept constant during the dialogue, with each column vector corresponding to a semantic slot . The semantic slots are like Location, Price, or Cuisine. Inspired by (Miller et al., 2016), each slot memory unit (i) in our model acts as the index, which helps to locate the content in . In our proposed model, we further apply the slot memory unit to extracting slot-relevant information from user utterance. Thus we keep unchanged during training and test time, and is initialized by the averaged embeddings of words in slot .

The value memory stores the value of each slot in . During the dialogue, the value of a slot may be added into the memory when a new slot is mentioned, or an old value can be updated to a new value of a previously mentioned slot. That is, each memory unit in the value memory stores the latest value (may be empty) of a semantic slot.

Read from the slot-value memory In our model, the main function of the slot-value memory is to trace the latest value of each slot, which is critical for predicting the slot-value pairs in the dialogue act. However, the effect of the slot-value memory on the state update of the controller is not straightforward. Thus, we employ a simple method for the read from the slot-value memory, which is the average of the vectors in the value memory:

(6)

where is the number of slots.

Write to the slot-value memory The write to (i) depends on slot addressing which decides how much information should be updated for each slot when giving a user utterance. Ideally, the value memory is supposed to update its values for all slots that are mentioned in a user utterance. For example, when user inputs an utterance ”I want a Chinese restaurant”, the model updates slot Cuisine with a new value Chinese.

Inspired by (Graves et al., 2014; Miller et al., 2016), we apply a slot addressing technique to decide the amount of information that should be updated to each value memory vector of the corresponding slot given a user utterance:

(7)

The first term is new information obtained from the attentive representation () of utterance and the second term is the old information maintained. The attentive representation of utterance , described soon later, essentially decides the relatedness of the user utterance to slot . is a gate which controls how much should be updated, and it depends on the attentive read and the last system response :

(8)

If utterance mentions slot , will be large, and the corresponding value memory unit will be updated substantially, otherwise much less information will be updated with a smaller . In order to better train these , we employ additional supervision on the weight, as defined in (see Eq. 3.7).

Figure 4. Slot-level attention mechanism for updating the slot-value memory.For each slot , the attention score for each word is calculated based on word embeddings and slot memory . Context vector is the weighted sum of word embeddings of the utterance. Finally, the value memory is updated based on the previous value vector and the context vector. Note that the attention mechanism is applied on each slot .

3.4. Slot-level Attention

The context vector in the above section is an attentive representation of utterance , conditioned on the -th slot vector. More formally, for an user utterance , we compute attention weights () where each weight indicates the similarity of a word embedding to a slot memory unit , as follows:

(9)
(10)
(11)

For the previous example, the weight between word Chinese and slot Cuisine will be large, while the weights between other words and this slot will be much smaller. The learning of is also supervised as shown in (see Eq. 3.7).

3.5. External Memory

The external memory is used to augment the representation capacity of the single state of RNN henderson2014word, and it is sometimes referred to as memory state (Wang et al., 2016) in other works. Varies from the slot-value memory, external memory is not endowed with explicit semantic meaning in our framework. The external memory consists of columns of -dimensional unit vectors, which are to be read and written to during dialogue controlled by the memory controller.
Read The read vector at turn is a weighted sum of the memory units:

(12)

where is the number of external memory units. And the weight is given by

(13)

where is an update gate which controls the amount of to be updated, and is a weight controlling new information to read from conditioned on the state of the controller .

(14)
(15)

Write There are two operations during the write to the external memory: and . controls how much old information should be removed from the memory and controls the addition of new information. Formally,

(16)

where the first term is the left information after erased by vector , and the second is new information added by vector . The scalar , the read weight on memory unit , as defined in Eq. 13.

Both erase vector and add vector are obtained conditioned on the state of the controller , as follows:

(17)
(18)
Figure 5. Dialogue act prediction of MAD: is the dialogue act type of system response at turn . is the mask for slot-value pairs at turn , and the color of each mask block indicates its value, with white indicating 1 and black for 0. represents the value of slot . The prediction of and are both based on (i).

3.6. Dialogue Act Prediction

As illustrated in Figure. 5, our memory-augmented network predicts a dialogue act as follows: first, the dialogue act type is predicted via ; second, each slot is associated with a binary classifier () that decides whether the th slot should be included in the final dialogue act; third, if a slot is selected, the value of the slot is predicted by . The final dialogue act can be assembled by these predicted results.

Predicting dialogue act type: this classifier outputs a distribution over dialogue act types such as Inform, Request, and Recommendation. It is implemented by a MLP conditioned on the controller state and all memory units:

(19)

where is one of all the dialogue act types.
Predicting a slot: there is a slot mask which controls the slots to be included in the final dialogue act. There is a binary classifier for each slot conditioned on the controller state , external memory and its corresponding value memory unit ):

(20)

where , indicates that slot should be included in the next dialogue act.
Predicting the value of a slot

: once we obtain which slot should be included in the dialogue act, we need to decide which value of the slot should be mentioned. This is given by the classifier which estimates a probability distribution over all the values for a slot:

(21)

where is all the values of slot .

3.7. Loss Function

We adopt cross entropy as our objective function. There are three terms in the function corresponding to the prediction of dialogue act types (), slot-value pairs (), and slot mask (), as presented in the previous section.

The loss function is defined as follows:

(22)

where

(23)
(24)
(25)

where is the number of dialogue act types, is the number of values for slot , are the gold distributions obtained from the training data, and are defined in the preceding subsection. and are hyper-parameters.

Furthermore, we found that performance improvement can be observed when applying weak heuristic supervision on the intermediate variables, and the supervision signal can be easily obtained by simple string matching rules. This is a common practice for training sophisticated neural networks (Liu et al., 2016; Kiddon et al., 2016). More specifically, we apply extra supervision on the update gate of the value memory (see Eq. 8) and the attention weight of an utterance (see Eq. 10). Those intermediate supervision is applied with a two-stage training schema: firstly, we pretrain our model only with the heuristic loss (

, see below) for several epochs, and then train the model further with the loss (

) defined by Eq. 22 for the remaining epochs.

The heuristic supervision loss is defined as follows:

(26)

where is the number of words in at turn and is the slot index.

Note that and represent the gold distributions of the update and attention weights, respectively. For each word of utterance , if appears in the values of slot , and , otherwise and . This means that if a value of a slot appears in the utterance, the value (also the word) should be attended w.r.t. that slot, and the update weight should be equal to 1. By this way, the value memory of the corresponding slot can be updated accordingly.

4. Experiment

4.1. Data Preparation

We first evaluated our memory augmented dialogue management model on two synthetic datasets adopted from the dialog bAbI dataset(Bordes and Weston, 2016) and the Second Dialogue State Tracking Challenge dataset (Henderson et al., 2014a), which are originally proposed for end-to-end dialogue systems and dialogue state tracking task. However, both of the above two datasets are small-scale. To better assess the performance of our proposed model on large-scale datasets, we collected a new Chinese dialogue management dataset consisting of real conversations from the flight booking domain.

4.1.1. DMBD: Dialogue Management bAbI Dataset

The original dialogue bAbI dataset (DBD) is designed to evaluate the performance of end-to-end dialogue systems on the task of restaurant reservation. In (Bordes and Weston, 2016), the task is formulated as a machine comprehension task by applying the MEMN2N (Sukhbaatar et al., 2015) model, considering the dialogue context and last user utterance as story and question respectively, and the system response is selected from a fixed answer set. The DBD dataset is composed of five manually constructed subtasks: issuing API calls, updating API calls, displaying, providing extra information and full dialogue, to examine the system performance on different tasks, in which the full dialogue is a combination of the first four tasks. The data for these tasks were collected through a simulator which is based on an underlying knowledge base along with some manually-crafted natural language patterns, where the simulator rules can be utilized by us to perform dialogue act annotations. For more details of DBD, please refer to (Bordes and Weston, 2016).

Informable slots Requestable slots
Name #Value Name
Cuisine 10 Address
Location 10 Telephone
Price 3
Size 4
Table 2. Ontologies of the DMBD dataset. An informable slot means that user can provide values to the slot to constrain a query to KB; while a requestable slot can only be queried from KB without any user provided value.

Since the dialogue act types and slot-value pairs are not annotated in DBD, we have to do this by ourselves to train our model. Fortunately, we can easily annotate the system response utterances because the original data is generated with an underlying knowledge base and some simple natural language patterns. We thus did reverse engineering by conducting automatic annotations with manually-crafted rules utilizing the knowledge base of DBD to label the dialogue act type and slot-value pairs for each utterance. This processed dataset for dialogue management is termed as Dialogue Management bAbI Dataset (DMBD) in the following sections.

In DMBD, the original user and system utterances are reserved to serve as the input of each turn of dialogue, while the output is changed from system utterance to its dialogue act, as detailed in Table 1. The resulting DMBD dataset has fifteen dialogue act types, four informable slots and two requestable slots, as seen in Table 2. An informable slot means that user can provide values to the slot to constrain a query to KB; while a requestable slot can only be queried from KB without any user provided value. Note that DMBD shares the same KB with DBD. As the requestable slots are only used for issuing API calls, in our implementation, we design a special informable slot called Ask Slot, which tracks the slots that are to be queried. The values of Ask Slot are the names of requestable slots.

4.1.2. DM-DSTC: Dialogue Management of the Second Dialogue State Tracking Challenge dataset

The dialogues in the above DMBD are collected via a simulator which employs hand-crafted templates, and are thus more or less synthetic. In order to evaluate the performance of our model on real-world dialogue corpus, we conducted another experiment based on DSTC2 which is a real world dialogue dataset, and it is also about the task of restaurant reservation.

The original DSTC2 dataset is for dialogue state tracking, in which the output at each turn is the filled slots and their values which have already been presented by the user so far. The dialogue act of the system utterance is also annotated and is thus directly utilized as model output. We thus transform the original DSTC2 dataset to our settings for dialogue management, referred to as DM-DSTC. The ontologies of dialogue act type and slot in the original dataset are directly reused in the DN-DSTC.

The resulting DM-DSTC is composed of four informable and nine requestable slots, and the average value number of informable slots is 54, which is much higher than that of DMBD, and the enhanced complexity of DM-DSTC dataset reflects the characteristics of real-world data which is more stochastic and noisy. We also created a special slot for requestable slots in this experiment as we did in the DMBD experiment. Some statistics of DM-DSTC are shown in Table 3.

Informable slots Requestable slots
Name #Value Name
Food 91 Addr, Area, Food
Pricerange 3 Phone, Pricerange
Res_name 113 Postcode, Signature
Area 5 Res_name
Table 3. Ontology of the DM-DSTC dataset. The Res_name indicates restaurant name. The average value number of informable slots is 54 which is much higher than that of DMBD dataset. The enhanced complexity of DM-DSTC reflects the characteristics of real-world dialogue data.

4.1.3. ALDM: Alibaba Dialogue Management Dataset

The sizes of the above two datasets are limited, we thus propose ALDM to test our model’s performance on large-scale dataset. ALDM is a Chinese dataset, consisting of real conversations from the flight-booking domain, in which the system is supposed to acquire departure city, arrive city and departure date information from the user to book a flight ticket. To better fit our model, the departure date values in the corpus are preprocessed into an uniform MM.DD format, e.g., 12.25 for 25th, Dec.. ALDM is much larger than the other two datasets, where there are 15,330 sessions for training, 7,665 for validation, and 3,832 for test. On average, there are 5 turns in a session. The average sentence length is 4, and particularly, most of the user responses have only one word as users only provide the departure or arrival city, or the departure data. One difference to the other two datasets exists in that the departure city slot and the arrive city slot share the same value list, which raises additional difficulty to require the model to identify which slot the city name in the user utterance should be filled in. To handle this issue, the model should be able to fill slots conditioned on the dialogue context. For example, if the user responds with Beijing to the last system response Where are you flying from?, the value of Beijing should be filled in the departure city. Another difference is that there are not requestable slots due to the fact that ALDM is system-driven.

DA type Informable Slots
ask_dep_loc Name #Value
ask_arr_loc Dep_city 174
ask_dep_date Arr_city 174
offer, end Date 100
Table 4. Ontology of the ALDM dataset. The ask_ DA type means the system is asking the user for information, offer means the system is giving recommendation and end means the dialogue session is done. Dep_city and Arr_city represent the slot of departure city and arrive city respectively, and they share the same value list. The value of Date slot is transformed into a uniform MM.DD format.

As shown in Table 4, ALDM is composed of 3 informable slots, and the average value number is 150, which is remarkably larger than those of the above two datasets. And there are 5 dialogue act types as shown in Table 4.

4.2. Experimental Setup

Our model is implemented with Tensorflow

(Abadi et al., 2016). The word embeddings used in each dataset were pretrained on their own dialogue corpora, where there are 15,000 sessions in DMBD (3,000 per each task), 2,118 sessions in DM-DSTC and 26,827 sessions in ALDM, using the GloVe algorithm (Pennington et al., 2014). The dimensions of word embeddings, memory column vectors, and state vectors were all set to 128, and there are 8 columns in the external memory. We first pretrain our model with the heuristic loss (see Eq. 3.7) for 2 epochs and then continue to train it using in Eq. 22.

The parameters amd in are not constant during training. More specifically, in the first 7 epochs, increases linearly from 0 to 1 while remains zero, and in the following 7 epoches also rises from 0 to 1 linearly with

unchanged. The reason for this setting is that the process of the value update in the slot-value memory has strong influence on the training of other components. All the other parameters are initialized with a random uniform distribution

.

We used the train/valid/test partition of the original DBD for each task, where there are 1,000 sessions in each set; and the partition of DM-DSTC is 1412/353/353. For ALDM, we split the dataset into 15,330/7,665/3,832.

We trained our model using ADAM (Kingma and Ba, 2014) with a learning rate which is set to 0.002, and the momentum parameters and . For each dataset, the model is trained with at most 15 epochs. We use the model parameter with the lowest validation loss for test.

4.3. Baseline

We included two types of baselines in the evaluation. The first type is to select a sentence as answer from a predefined candidate answer set in a machine comprehension manner, as described in (Bordes and Weston, 2016). The second type is to predict a structured dialogue act, the same as our model, where the models need to make predictions over all combinations of dialogue act type and slot-value pairs.

In the baselines of the first type, each candidate answer sentence is a natural language utterance, which lexicalizes444Lexicalizing a dialogue act means converting the act from formal semantic representation to a natural language utterance. an underlying dialogue act. However, the candidate answer set is not complete, where not all possible combinations of dialogue act type and slot-value pairs are included. In other words, the size of the answer space in the first type is less than that in the second type. Thus, the first setting is therefore easier than the second one.

The baselines of the first type, which select an utterance from a predefined candidate answer set (Bordes and Weston, 2016), are listed as follows:

  • TF-IDF: A TF-IDF matching algorithm(Salton and McGill, 1986)

    which computes a cosine similarity score between the input (the whole dialogue history) and a candidate sentence, and the sentence with the highest score is selected as the final answer. Both the input and the candidate sentence are represented by the average of bag-of-word vectors.

  • TF-IDF(+ type): An enhanced version of TF-IDF by introducing additional entity type features.

  • Supervised Ebd: An information retrieval model based on trainable word embeddings. The similarity score between an input and a candidate sentence is the inner product of their averaged word embeddings. The is trained with a margin ranking loss (Bai et al., 2009).

  • MEMN2N: Standard end-to-end memory networks (Sukhbaatar et al., 2015; Bordes and Weston, 2016). It stores the dialogue history information in a memory network and chooses a response by running multi-hop reasoning upon the history.

  • MEMN2N(+ match): A variant of MEMN2N which included additional features about entity types.

The baselines of the second type, which predict a structured dialogue act, the same as our proposed model, are as follows:

  • MEM: A memory network model which predicts dialogue act. For each output structure (DA type, slot-value, and mask), a MEMN2N is introduced to make prediction.

  • RNN: A recurrent neural network model with turn-level input and output. The dialogue act predictions (type and slot-value) are based on the hidden state at each time step .

  • MAD - SM: A variant of our proposed model without the slot-value memory. Those predictions involving the slot-value memory are modified to using only the memory controller state to make prediction.

  • MAD - Attn: A variant of our model without the slot-level attention mechanism. In this setting, the averaged word embeddings of an utterance is used to update the slot-value memory.

  • MAD - EM: A variant of our model without the external memory. The predictions involving the external memory are modified to using the memory controller state only, just as MAD-SM.

It should be noted that the MEMN2N and MEM baseline take as input a context-question pair at each round, which means they have to make calculation on the cumulated dialogue context at each turn. Thus with the increasing of the dialogue context, there is an exponential increase in the computation complexity. While for our model, the context information is stored in the memory network, and the computation time in each turn is basically the same.

4.4. Performance on DMBD

In this section, we evaluated the performance of our model and the baselines on the DMBD dataset. The prediction accuracy on both turn-level and session-level evaluation is reported, similar to (Bordes and Weston, 2016). Based on the distribution defined in Section 3.6, our model chooses a dialogue act with the maximal probability as output, respectively for DA type, slot-value and mask. Note here that for DA type and mask, the prediction is judged as correct only if the output matches the target. As mentioned in Section 3.1, mask is an auxiliary variable helping to filter the undesired slot-value pairs in a predicted dialogue act. Thus for the prediction of slot-value, we only need to correctly predict those slot-value pairs whose mask value is 1. Finally, the overall dialogue act is correct only if its DA type, slot-value and mask are all correctly predicted. And a dialogue session is correct only if all the dialogue acts in the session are correctly predicted. We termed this session-level evaluation.

Metrics 1 Issuing 2 Updating 3 Displaying 4 Providing 5 Full
API calls API calls options options dialogs
TF-IDF (no type) 5.6 (0) 3.4 (0) 8.0 (0) 9.5 (0) 4.6 (0)
TF-IDF (+ type) 22.4 (0) 16.4 (0) 8.0 (0) 17.8 (0) 8.1 (0)
Nearest Neighbor 55.1 (0) 68.3 (0) 58.8 (0) 28.6 (0) 57.1 (0)
Supervised Ebd 100 (100) 68.4 (0) 64.9 (0) 57.2 (0) 75.4 (0)
MEMN2N (no match) 99.9 (99.6) 100 (100) 74.9 (2.0) 59.5 (3.0) 96.1 (49.4)
MEMN2N (+ match) 100 (100) 98.3 (83.9) 74.9 (0.0) 100 (100) 93.4 (19.7)
MEM 47.4 (0.1) 61.1 (0.1) 24.6 (0.1) 56.7 (0.8) 25.2 (0.1)
RNN 80.6 (0.1) 45.5 (0.0) 30.0 (0.0) 57.2 (0.0) 3.7 (0.0)
MAD 99.0 (94.2) 100 (100) 99.1 (90.6) 100 (100) 99.9 (97.8)
Table 5. The accuracy across all tasks and methods. The numbers in brackets are the accuracy at the session level, and numbers without brackets are at the turn level. A session is correct only if all the sentences in the session are predicted correctly.

4.4.1. Overall Performance Analysis

We first evaluated our proposed model based on the overall accuracy of dialogue act prediction, as shown in Table 5. The results of baselines of the first type are reprinted from their original paper (Bordes and Weston, 2016), because the partitions of training/validation/test data are the same as ours, and the results are hence directly comparable. Both turn-level and session-level results on all the five tasks are reported. We have the following observations:

  • MAD obtains the best performance on most of the tasks. The model obtains an accuracy of about 100% at both turn and session-level evaluation, which shows the effectiveness of our proposed model. While in Task 1, MAD is at the second place, where Supervised Ebd and MEMN2N (+match) methods obtains 100% accuracy at both turn and session-level evaluation, which is 1% higher than ours. MAD’s defect on Task 1 can be attributed to a potential rule in Task 1, where if the user doesn’t provide enough values to form a query, the agent will request for the value of slots in a fixed order. For example in task 1, the agent requests for slots in an order of (Cuisine Location Size Price). However, this order rule is not essential for a practica application, where the agent can request for values in an arbitrary order as long as it can obtain all necessary values.

4.4.2. Fine-grained Performance Analysis

To better understand how the slot-value memory and the external memory influence the performance, we further analyzed the fine-grained prediction accuracy of MAD and its variants in addition to the overall dialogue act prediction. Evaluation on the fine-grained predictions is shown in Table 6. We have the following observations:

  • The variants of MAD, MAD-SM, which ablates the slot-value memory module, obtains degraded performance on overall accuracy compared to MAD. MAD-Attn, which removes the slot-level attention mechanism, works worse than MAD but still slightly better than MAD-SM on each task. The performance of MAD-EM drops even more than MAD-SM on all tasks except for Task 1. The RNN model, which can be regarded as MAD without slot-memory and external memory, performs even worse on most of the 5 tasks.

  • The fine-grained results demonstrate the effectiveness of our proposed model more specifically. Here we can see that the accuracy of MAD on both slot-value and mask is 100%, while the prediction on DA type has very few errors. The high accuracy of slot-value prediction indicates that the slot addressing and the attentive question representation work well, which is attributed to the slot-value memory and attention supervision we applied. The contribution of the external memory is also shown by the high performance of DA type and mask prediction.

  • The slot-value memory leads to significant improvements in slot-value accuracy. In our model, the role of the slot-value memory is to extract semantic information about slots during the dialogue, thus the ability of tracking slot-value information should decrease if the slot-value memory is removed. As shown in Table 6, the prediction accuracy of MAD-SM on slot-value drops much from 100% to around 30%. However, the performance on dialogue act type and mask prediction are not heavily affected, and the accuracy is still above 90%.

  • The slot-level attention mechanism we applied on semantic information extraction influences the performance remarkably. In MAD-Attn, the slot-level attention mechanism is removed, and the value update is based on averaged word embeddings of user utterance. Intuitively, the update of the slot-value memory is not able to concentrate on relevant words without attention mechanism, thus the performance of slot-value prediction must be heavily influenced. The experiment results also support our hypothesis, where the accuracy of slot-value prediction degrades remarkably, but is still better than that of MAD-SM since MAD-Attn retains the slot-value memory. The attention mechanism affects dialogue act type and mask prediction very slightly.

  • The external memory significantly improves the performance of DA type and mask accuracy by enhancing the representation capacity of the original RNN state. In MAD-EM, the external memory is removed, and those predictions involving the external memory, that is the prediction of DA type and mask, are changed to use the memory controller state, which is identical to the hidden state in a RNN model. Compared to MAD, the accuracy of MAD-EM on DA type and slot-value prediction decreases heavily. This is attributed to the enhanced representation capacity, meaning that the model can do better in capturing longer term temporal dependencies in dialogue.

From the above analysis, we can see that the effect of the slot-value memory is mainly on predicting slot-value, while the effect of the external memory is on predicting dialogue act type and mask. However, the influence of the modules on the performance is more complex. We can see from Table 6 that DA type and mask accuracy will also decrease if the slot-value memory is removed, and so will slot-value accuracy when we remove external memory. This means the two memory networks in our model are coupled correlatively by the memory controller and can affect the performance of each other.

Task 1 2 3 4 5
DA type MAD-SM 93.9 (65.9) 100 (100) 95.6 (58.2) 100 (100) 90.9 (11.9)
MAD-EM 95.8 (80.5) 65.7 (3.5) 56.3 (5.8) 100 (100) 17.8 (0)
MAD-Attn 99.5 (96.9) 100 (100) 99.0 (90.3) 100 (100) 99.9 (98.6)
MAD 99.0 (94.2) 100 (100) 99.1 (90.6) 100 (100) 99.9 (97.8)
slot-value MAD-SM 21.1 (0.3) 22.3 (0) 18.4 (0) 40.3 (0.1) 20.9 (0)
MAD-EM 100 (100) 95.3 (65.8) 27.5 (0.1) 100 (100) 22.6 (0)
MAD-Attn 26.8 (0.5) 24.8 (0) 27.5 (0) 41.3 (0.1) 31.4 (0)
MAD 100 (100) 100 (100) 100 (100) 100 (100) 100 (100)
mask MAD-SM 1.0 (1.0) 100 (100) 99.9 (99.9) 100 (100) 98.8 (6)
MAD-EM 99.1 (88.8) 87.8 (2.6) 87.8 (16.4) 100 (100) 66.8 (0)
MAD-Attn 100 (100) 100 (100) 100 (100) 100 (100) 100 (100)
MAD 100 (100) 100 (100) 100 (100) 100 (100) 100 (100)
Overall MAD-SM 77.2 (0.2) 78.9 (0) 70.7 (0) 57.3 (0.1) 59.6 (0)
MAD-EM 95.2 (78.2) 57.4 (0.2) 40.5 (0.0) 1.0 (1.0) 3.1 (0.0)
MAD-Attn 82.7 (0.5) 79.0 (0) 73.9 (0) 57.3 (0.1) 67.7 (0)
MAD 99.0 (94.2) 100 (100) 99.1 (90.6) 100 (100) 99.9 (97.8)
Table 6. Fine-grained performance on the DMBD dataset. We tested the performance of our proposed model and three of its variations on both turn and session level, where for each model the dialogue act type, slot-value, mask and overall prediction accuracy on each task is reported. The highest accuracy on turn level which is lower than 100% is in bold font.

4.5. Performance on DM-DSTC

Although our proposed model obtains good results on DMBD, it should be noted that the performance reflected by the above results are somehow optimistic due to two facts: First, these dialogues are generated by rules, which are much simpler than real dialogue data. Second, the number of slots and values in DMBD is quite small, while in real applications the number may become very large.

To assess the performance of our proposed model on real dialogue data, we conducted another experiment on DM-DSTC. Different from DMBD, there is only one task in the DM-DSTC dataset. We only reported the results of the methods which predict dialogue act as output. It should be pointed out that in this new dataset, many values in dialogue act annotation didn’t appear exactly in user utterances (such as asian oriental), thus for those values we can not provide precise attention supervision, which will affect the performance of slot-level attention. Moreover, the Res_name slot in this dataset degrades the accuracy because its value does not appear in the dialogue context, and is queried from a knowledge base conditioned on previous search constraints, which is not consistent with our model setting. We reported the fine-grained and overall accuracy at the turn level and session level, as shown in Table 7.

Metrics DA type slot-value mask All
MEM 62.5 (9.9) 14.2 (0.0) 71.0 (0.1) 0 (0.0)
RNN 50.9 (0.3) 14.3 (0.1) 61.8 (0.3) 0.1 (0.0)
MAD-SM 64.1 (13.6) 11.6 (0.1) 81.6 (0.4) 17.1 (0.1)
MAD-Attn 64.6 (12.5) 18.5 (0.1) 80.8 (1.0) 16.9 (0.0)
MAD-EM 44.9 (2.3) 17.5 (0.1) 69.7 (0) 5.7 (0.0)
MAD 63.8 (11.0) 27.3 (0.1) 82.1 (1.3) 18.8 (0)
Table 7. Fine-grained and overall accuracy on the DM-DSTC dataset. The number in brackets are the accuracy at the session level, and number without brackets are at the turn level.

The results in Table 7 demonstrate our model is still comparable to the vanilla memory network model. Compared to MEM and RNN, our proposed method obtains higher accuracy on turn-level overall prediction, as well as the dialogue act type and mask prediction. Although MEM’s accuracy on DA type , slot-value and mask prediction is slightly lower than ours, its overall accuracy on turn-level is far less than our proposed model. This can be attributed to the framework of MEM, where its DA type, mask and slot-value prediction is trained separately, while in our model these three tasks are trained. For the variants of MAD, the experiment results are consistent with what we observed in DMBD. MAD-SM obtains lower accuracy on slot-value prediction compared to MAD, while maintains similar accuracy on DA type and mask. For MAD-Attn, the result is similar to MAD-SM when compared to MAD, but its accuracy on slot-value prediction is obviously higher than that of MAD-SM since it maintains the slot-value memory network. MAD-EM, which removes the external memory, obtains significantly lower accuracy on the prediction of dialogue act type and mask, and its accuracy on slot-value prediction is also reduced.

We can see that the performance of slot-value prediction is the bottleneck of promoting overall accuracy. That can be attributed to the data feature of DM-DSTC, where many values of slots does not appear precisely in the user utterance, which makes it hard to acquire accurate attention supervision, thus the model’s capacity of extracting semantic features from user utterance is negatively influenced. For the prediction of DA type and mask, although the result is far better than that of slot-value, the accuracy is still not so high as that in DMBD. This can be attributed to the characteristics of real-world data, where there exists much more probability uncertainty and noise than DMBD. More specifically, in different sessions, the DA type of agent response varies much even it is given the same dialogue context. What’s more, the agent response in original DSTC2 dataset is conditioned on the knowledge base query result which is not provided, and this also restricts our model’s ability on predicting DA type and mask.

4.6. Performance on ALDM

Metrics DA type Slot-value Mask All
MEM 64.9 (1.4) 73.5 (0.0) 100.0 (100.0) 0.0 (0.0)
RNN 60.0 (0.0) 80.0 (0.0) 100.0 (100.0) 40.0 (0.0)
MAD-SM 60.3 (0.0) 80.0 (0.0) 100.0 (100.0) 40.3 (0.0)
MAD-Attn 76.4 (15.7) 100.0 (100.0) 100.0 (100.0) 76.4 (17.1)
MAD-EM 76.4 (15.4) 98.6 (92.8) 100.0 (100.0) 74.9 (14.2)
MAD 76.7 (16.3) 100.0 (100.0) 100.0 (100.0) 76.7 (16.3)
Table 8. Fine-grained and overall accuracy on the ALDM dataset. The number in bracket is the accuracy at the session level, and the number without bracket is at the turn level.

We reported the results of the methods which can output a structured dialogue act as we did in Section 4.5. The mask prediction is relatively simple for ALDM in which most of the slot values only appear in the last system response, and thus all the models have an accuracy of 100%. Therefore, the following analysis will be focused on the DA type and slot-value.

A difference of ALDM compared to the other two datasets is that ALDM is more system-driven, which makes it hard for our model to correctly predict the order of ask_ DA type, For instance, ask_dep_Loc is only based on the currently filled slots. If the departure location is provided by the user, the system can ask for either the arrive location or the departure date in the next turn, which makes the next DA type difficult to predict. Thus the DA type accuracy is not as good as that in DMBD. However, when slots is already filled (N is the total number of slots to complete a booking task), the next slot to be asked is determinate. Thus, the dialogue state still has impact on DA type prediction, which is shown by the results of MAD-SM and RNN in which the two models removed the slot-value memory.

Although the average number of the slot values in ALDM is much larger than that in the other two dataset, we still obtain high slot-value accuracy. This can be attributed to the high data quality of ALDM which is carefully cleaned before training. By removing the slot-value memory (RNN and MAD-SM) we can see that the slot-value accuracy decreases remarkably, which shows the ability of slot-value memory for maintaining dialogue states. As it can be seen from Table 8, the slot-value accuracy of our full model is the same as that of MAD-Attn. This is because of the nature of the ALDM dataset that the user responses are mainly one-word sentences, which makes no difference between the models with/without attention mechanism.

Metrics Departure-City Arrive-City
MEM 2.7 4.1
RNN 0.2 0.1
MAD-SM 0.5 0.3
MAD-Attn 100.0 100.0
MAD-EM 96.5 96.2
MAD 100.0 100.0
Table 9. Prediction accuracy on the departure city and arrive city slots. The number in bracket is the accuracy at the session level, and that without bracket at the turn level.

To verify the model’s ability to combine context information in slot filling, we further analyzed the prediction accuracy on the Departure_City slot and the Arrive_City slot. As described in Section 4.1.3, they share the same value list. The ability of identifying values from different slots is mainly controlled by the update gate as defined in Section 3.3. Slot-value memory dominates the prediction of the next slot values, which can be seen from the results of MAD-SM, RNN, and MEM in Table 9. The results drop dramatically when removing the slot-value memory (RNN and MAD-EM). For MEM, although its accuracy is higher than that of RNN and MAD-EM, it’s still much lower than our proposed model. This is because that 1) the city number is too large for MEM to predict, and 2) MEM fails to identify which slot the value belongs to.

4.7. Parameter Tuning

Generally speaking, the performance of neural network models is highly correlated with the number of parameters. There are many important hyper-parameters in our model, including the dimensions of the slot-value memory and external memory, and the number of column vectors in the external memory. We evaluated the influence of these hyper-parameters on performance. The following experiments were performed on the DM-DSTC dataset.

Figure 6. Fine-grained prediction accuracy on DMBD with different (the number of column vectors in the external memory). The optimal number is 8.

First, we studied how the performance is influenced by the number of column vectors in the external memory . The number varies from 3 to 9, with a step size of 1. We studied the accuracy change on dialogue act type, slot-value, and mask, as shown in Figure 6. For predicting dialogue act type and mask, the optimal is 8 and the optimal accuracy is significantly better than others. For predicting slot-values, although the optimal is 4 with an accuracy of 0.331, the accuracy is almost the same (from 0.321 to 0.331) when varying from 4 to 8.

Figure 7. Accuracy change on DMBD with different dimensions of the column vectors in the external memory. The optimal number is 128.

Second, we studied the influence of the dimension of column vectors, as shown in Figure. 7. The dimension number in our experiment ranges from 32 to 256 with a step size of 32. The accuracy of dialogue act type and mask is highly correlated, whose best accuracy are both obtained with the dimension of 128. While the optimal value for slot-value accuracy is obtained with the dimension of 64.

4.8. Visualization Analysis

Figure 8 illustrates an example of the slot-level attention mechanism. For each slot, the model generates a distribution over the words of an utterance. Each row is thus a probability distribution over words, where the largest probability corresponds to the word that should be attended mostly. For utterance ”can you book a table with British cuisine for six people in Madrid in an expensive price range”, for slot Cuisine, the most attended word is British, while for slot Price , the word is expensive, and for slot Number, the word is six. Note that the weight of is also large, which is wrong intuitively in that Rating information has not yet been mentioned. However, this kind of wrong attention weight does not have influence on model performance. In other words, the inclusion of a slot-value pair in the predicted dialogue act is decided by two distributions: the value distribution and the slot mask distribution for a slot, as mentioned in Section 3.6. The effect of faulty attention will be filtered out by mask when deciding which slots are to be addressed in final dialogue act.

Figure 8. Attention visualization. For each slot, the attention weights (in a row) are a distribution over the words of an utterance. For utterance ”can you book a table with british cuisine for six people in madrid in an expensive price range.” the predicted slot-value pairs are , , , and .

Figure 9 illustrates the change of the dialogue state and the predicted next dialogue act in an exemplar dialogue session. We visualized the values stored in the slot-value memory and shown the next dialogue act type predicted by the model. At each turn, the model computes an update gate (Eq.8) for each slot . If a certain value of slot appears in user utterance , increases, and the color of the corresponding cell becomes darker. The darkness of a cell represents the value of , which is calculated independently for each slot at each turn . The value in each cell is computed by Eq. 21 and we only output the value for slot if for some turn . These values compose a search constraint at each turn. In the exemplar dialogue session, each value in user utterance is captured by the attention mechanism of a user utterance, and its values are filled into with large s.

For instance, when the user asks can you book a table in a cheap price range in london?, the price slot is filled with the value of cheap , and the location slot is filled with the value of london. The model predicts the next dialogue act ask_cuisine which prompts the user on the preference of cuisine. As the user supplied new information with the utterance with french food, the cuisine slot is filled with the value of french. At this state, the model predicts the next dialogue act ask_people which should ask the user about how many people are involved. As the dialogue proceeds, the slot-value memory explicitly tracks the dialogue state, and the next dialogue act is also predicted according to the state.

Figure 9. An example of DA prediction for a dialogue session. represents user utterance and system response.The values of slots at each turn are predicted by Eq. 21. The color darkness of each cell represents the value of defined in Eq. 8. Darker colors indicate larger values.

5. Conclusion

In this paper, we present a memory augmented dialogue management model for capturing long-range dialogue semantics by explicitly memorizing and updating the dialogue act types and slot-value pairs during interactions in task-oriented dialogue systems. The model employs two memory modules, namely the slot-value memory and external memory, to address the history semantics during the entire dialogue session. The slot-value memory tracks the dialogue state by memorizing and updating the values of semantic slots, and the external memory augments the single state representation of RNN by storing more context information. We also propose a slot-level attention mechanism for attentive read of a user utterance to update the slot-value memory. The attention mechanism helps to extract the slot-related information that is addressed in a user utterance. Through the attention mechanism and the memory modules, our proposed model can better interpret the dialogue context in a more observable and explainable way, which also helps to predict the next dialogue act given the current dialogue state. Results show that our model is better than the state-of-the-art baselines, and moreover, the model can offer more observable dialogue semantics by presenting predicted slot-value pairs at each dialogue turn. We believe that research on interactive IR may benefit from our work, particularly from the idea of enhancing the interpretability of dialogue management.

References