With the rapid development in deep learning, there is a recent boom of task-oriented dialogue systems in terms of both algorithms and datasets. The goal of task-oriented dialogue is to fulfill a user’s requests such as booking hotels via communication in natural language. Due to the complexity and ambiguity of human language, previous systems have included semantic decoding(nbt) to project natural language input into pre-defined dialogue states. These states are typically represented by slots and values: slots indicate the category of information and values specify the content of information. For instance, the user utterance “can you help me find the address of any hotel in the south side of the city” can be decoded as and , meaning that the user has specified the value south for slot area and requested another slot address.
Numerous methods have been put forward to decode a user’s utterance into slot values. Some use hand-crafted features and domain-specific delexicalization methods to achieve strong performance (henderson2014word; zilka2015incremental). nbt employs CNN and pretrained embeddings to further improve the state tracking accuracy. statnbt extends this work by using two additional statistical update mechanisms. liu2018dialogue uses human teaching and feedback to boost the state tracking performance. gl utilizes both global and local attention mechanism in the proposed GLAD model which obtains state-of-the-art results on WoZ and DSTC2 datasets. However, most of these methods require slot-specific neural structures for accurate prediction. For example, gl defines a parametrized local attention matrix for each slot. Slot-specific mechanisms become unwieldy when the dialogue task involves many topics and slots, as is typical in a complex conversational setting like product troubleshooting. Furthermore, due to the sparsity of labels, there may not be enough data to thoroughly train each slot-specific network structure. smdst; lsbt both propose to remove the model’s dependency on dialogue slots but there’s no modification to the representation part, which could be crucial to textual understanding as we will show later.
To solve this problem, we need a state tracking model independent of dialogue slots. In other words, the network should depend on the semantic similarity between slots and utterance instead of slot-specific modules. To this end, we propose the Slot-Independent Model (SIM). Our model complexity does not
increase when the number of slots in dialogue tasks go up. Thus, SIM has many fewer parameters than existing dialogue state tracking models. To compensate for the exclusion of slot-specific parameters, we incorporate better feature representation of user utterance and dialogue states using syntactic information and convolutional neural networks (CNN). The refined representation, in addition to cross and self-attention mechanisms, make our model achieve even better performance than slot-specific models. For instance, on Wizard-of-Oz (WOZ) 2.0 dataset(woz), the SIM model obtains a joint-accuracy score of 89.5%, 1.4% higher than the previously best model GLAD, with only 22% of the number of parameters. On DSTC2 dataset, SIM achieves comparable performance with previous best models with only 19% of the model size.
2 Problem Formulation
As outlined in young2010hidden, the dialogue state tracking task is formulated as follows: at each turn of dialogue, the user’s utterance is semantically decoded into a set of slot-value pairs. There are two types of slots. Goal slots indicate the category, e.g. area, food, and the values specify the constraint given by users for the category, e.g. South, Mediterranean. Request slots refer to requests, and the value is the category that the user demands, e.g. phone, area. Each user’s turn is thus decoded into turn goals and turn requests. Furthermore, to summarize the user’s goals so far, the union of all previous turn goals up to the current turn is defined as joint goals.
Similarly, the dialogue system’s reply from the previous round is labeled with a set of slot-value pairs denoted as system actions. The dialogue state tracking task requires models to predict turn goal and turn request given user’s utterance and system actions from previous turns.
Formally, the ontology of dialogue, , consists of all possible slots and the set of values for each slot, . Specifically, req is the name for request slot and its values include all the requestable category information. The dialogue state tracking task is that, given the user’s utterance in the -th turn, , and system actions from the -th turn, , where , the model should predict:
Turn goals: , where ,
Turn requests: , where .
The joint goals at turn are then computed by taking the union of all the predicted turn goals from turn to turn .
Usually this prediction task is cast as a binary classification problem: for each slot-value pair , determine whether it should be included in the predicted turn goals/requests. Namely, the model is to learn a mapping function .
3 Slot-Independent Model
To predict whether a slot-value pair should be included in the turn goals/requests, previous models (nbt; gl) usually define network components for each slot . This can be cumbersome when the ontology is large, and it suffers from the insufficient data problem: the labelled data for a single slot may not suffice to effectively train the parameters for the slot-specific neural networks structure.
Therefore, we propose that in the classification process, the model needs to rely on the semantic similarity between the user’s utterance and slot-value pair, with system action information. In other words, the model should have only a single global neural structure independent of slots. We heretofore refer to this model as Slot-Independent Model (SIM) for dialogue state tracking.
3.1 Input Representation
Suppose the user’s utterance in the -th turn contains words, . For each word , we use GloVe word embedding , character-CNN embedding , Part-Of-Speech (POS) embedding
, Named-Entity-Recognition (NER) embeddingand exact match feature
. The POS and NER tags are extracted by spaCy and then mapped into a fixed-length vector. The exact matching feature has two bits, indicating whether a word and its lemma can be found in the slot-value pair representation, respectively. This is the first step to establish a semantic relationship between user utterance and slots. To summarize, we represent the user utterance as.
For each slot-value pair either in system action or in the ontology, we get its text representation by concatenating the contents of slot and value111To align with previous work, we prepend the word “inform” to goal slot.. We use GloVe to embed each word in the text. Therefore, each slot-value pair in system actions is represented as and each slot-value pair in ontology is represented as , where and is the number of words in the corresponding text.
3.2 Contextual Representation
To incorporate contextual information, we employ a bi-directional RNN layer on the input representation. For instance, for user utterance,
We apply variational dropout (vd) for RNN inputs, i.e. the dropout mask is shared over different timesteps.
After RNN, we use linear self-attention to get a single summarization vector for user utterance, using weight vector and bias scalar :
For each slot-value pair in the system actions and ontology, we conduct RNN and linear self-attention summarization in a similar way. As the slot-value pair input is not a sentence, we only keep the summarization vector and for each slot-value pair in system actions and ontology respectively.
To determine whether the current user utterance refers to a slot-value pair in the ontology, the model employs inter-attention between user utterance, system action and ontology. Similar to the framework in gl, we employ two sources of interactions.
The first is the semantic similarity between the user utterance, represented by embedding and each slot-value pair from ontology , represented by embedding . We linearly combine vectors in via the normalized inner product with , which is then employed to compute the similarity score :
The second source involves the system actions. The reason is that if the system requested certain information in the previous round, it is very likely that the user will give answer in this round, and the answer may refer to the question, e.g. “yes” or “no” to the question. Thus, we first attend to system actions from user utterance and then combine with the ontology to get similarity score. Suppose there are slot-values pairs in the system actions from previous round222This includes a special sentinel action which refers to ignoring the system action., represented by :
The final similarity score between the user utterance and a slot-value pair from the ontology is a linear combination of and
and normalized using sigmoid function.
is a learned coefficient. The loss function is the sum of binary cross entropy over all slot-value pairs in the ontology:
where is the ground truth. We illustrate the model structure of SIM in Figure 1.
We evaluated our model on Wizard of Oz (WoZ) (woz) and the second Dialogue System Technology Challenges (dstc2). Both tasks are for restaurant reservation and have slot-value pairs of both goal and request types. WoZ has 4 kinds of slots (area, food, price range, request) and 94 values in total. DSTC2 has an additional slot name and 220 values in total. WoZ has 800 dialogues in the training and development set and 400 dialogues in the test set, while DSTC2 dataset consists of 2118 dialogues in the training and development set, and 1117 dialogues in the test set.
We use accuracy on the joint goal and turn request as the evaluation metrics. Both are sets of slot-value pairs, so the predicted set must exactly match the answer to be judged as correct. For joint goals, if a later turn generates a slot-value pair where the slot has been specified in previous rounds, we replace the value with the latest content.
|Joint goal||Turn request||Joint goal||Turn request|
|Delex. Model + Semantic Dictionary||83.7%||87.6%||72.9%||95.7%|
|Neural Belief Tracker (NBT)||84.2%||91.6%||73.4%||96.5%|
4.3 Training Details
We fix GloVe (glove) as the word embedding matrix. The models are trained using ADAM optimizer (adam) with an initial learning rate of 1e-3. The dimension of POS and NER embeddings are 12 and 8, respectively. In character-CNN, each character is embedded into a vector of length 50. The CNN window size is 3 and hidden size is 50. We apply a dropout rate of 0.1 for the input to each module. The hidden size of RNN is 125.
During training, we pick the best model with highest joint goal score on development set and report the result on the test set.
For DSTC2, we adhere to the standard procedure to use the N-best list from the noisy ASR results for testing. The ASR results are very noisy. We experimented with several strategies and ended up using only the top result from the N-best list. The training and validation on DSTC2 are based on noise-free user utterance. The WoZ task does not have ASR results available, so we directly use noise-free user utterance.
4.4 Baseline models and result
We compare our model SIM with a number of baseline systems: delexicalization model (woz; henderson2014word), the neural belief tracker model (NBT) (nbt)
, global-locally self-attentive model GLAD(gl), large-scale belief tracking model LSBT (lsbt) and scalable multi-domain dialogue state tracking model SMDST (smdst).
Table 1 shows that, on WoZ dataset, SIM achieves a new state-of-the-art joint goal accuracy of 89.5%, a significant improvement of 1.4% over GLAD, and turn request accuracy of 97.3%, 0.2% above GLAD. On DSTC2 dataset, where noisy ASR results are used as user utterance during test, SIM obtains comparable results with GLAD. Furthermore, the better representation in SIM makes it significantly outperform previous slot-independent models LSBT and SMDST.
Furthermore, as SIM has no slot-specific neural network structures, its model size is much smaller than previous models. Table 2 shows that, on WoZ and DSTC2 datasets, SIM model has the same number of parameters, which is only 23% and 19% of that in GLAD model.
Ablation Study. We conduct an ablation study of SIM on WoZ dataset. As shown in Table 3, the additional utterance word features, including character, POS, NER and exact matching embeddings, can boost the performance by 2.4% in joint goal accuracy. These features include POS, NER and exact match features. This indicates that for the dialogue state tracking task, syntactic information and text matching are very useful. Character-CNN captures sub-word level information and is effective in understanding spelling errors, hence it helps with 1.2% in joint goal accuracy. Variational dropout is also beneficial, contributing 0.9% to the joint goal accuracy, which shows the importance of uniform masking during dropout.
|Model||Joint Goal||Turn Request|
In this paper, we propose a slot-independent neural model, SIM, to tackle the dialogue state tracking problem. Via incorporating better feature representations, SIM can effectively reduce the model complexity while still achieving superior or comparable results on various datasets, compared with previous models.
For future work, we plan to design general slot-free dialogue state tracking models which can be adapted to different domains during inference time, given domain-specific ontology information. This will make the model more agile in real applications.
We thank the anonymous reviewers for the insightful comments. We thank William Hinthorn for proof-reading our paper.