SUMBT: Slot-Utterance Matching for Universal and Scalable Belief Tracking

07/17/2019 ∙ by Hwaran Lee, et al. ∙ 0

In goal-oriented dialog systems, belief trackers estimate the probability distribution of slot-values at every dialog turn. Previous neural approaches have modeled domain- and slot-dependent belief trackers, and have difficulty in adding new slot-values, resulting in lack of flexibility of domain ontology configurations. In this paper, we propose a new approach to universal and scalable belief tracker, called slot-utterance matching belief tracker (SUMBT). The model learns the relations between domain-slot-types and slot-values appearing in utterances through attention mechanisms based on contextual semantic vectors. Furthermore, the model predicts slot-value labels in a non-parametric way. From our experiments on two dialog corpora, WOZ 2.0 and MultiWOZ, the proposed model showed performance improvement in comparison with slot-dependent methods and achieved the state-of-the-art joint accuracy.



There are no comments yet.


page 5

Code Repositories


SUMBT: Slot-Utterance Matching for Universal and Scalable Belief Tracking (ACL 2019)

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

As the prevalent use of conversational agents, goal-oriented systems have received increasing attention from both academia and industry. The goal-oriented systems help users to achieve goals such as making restaurant reservations or booking flights at the end of dialogs. As the dialog progresses, the system is required to update a distribution over dialog states which consist of users’ intent, informable slots, and requestable slots. This is called belief tracking or dialog state tracking (DST). For instance, for a given domain and slot-types (e.g., ‘restaurant’ domain and ‘food’ slot-type), it estimates the probability of corresponding slot-value candidates (e.g., ‘Korean’ and ‘Modern European’) that are pre-defined in a domain ontology. Since the system uses the predicted outputs of DST to choose the next action based on a dialog policy, the accuracy of DST is crucial to improve the overall performance of the system. Moreover, dialog systems should be able to deal with newly added domains and slots

111For example, as reported by Kim et al. (2018), hundreds of new skills are added per week in personal assistant services. in a flexible manner, and thus developing scalable dialog state trackers is inevitable. Regarding to this, Chen et al. (2016) has proposed a model to capture relations from intent-utterance pairs for intent expansion.

Traditional statistical belief trackers Henderson et al. (2014)

are vulnerable to lexical and morphological variations because they depend on manually constructed semantic dictionaries. With the rise of deep learning approaches, several neural belief trackers (NBT) have been proposed and improved the performance by learning semantic neural representations of words

Mrkšić et al. (2017); Mrkšić and Vulić (2018). However, the scalability still remains as a challenge; the previously proposed methods either individually model each domain and/or slot Zhong et al. (2018); Ren et al. (2018); Goel et al. (2018) or have difficulty in adding new slot-values that are not defined in the ontology Ramadan et al. (2018); Nouri and Hosseini-Asl (2018).

In this paper, we focus on developing a “scalable” and “universal” belief tracker, whereby only a single belief tracker serves to handle any domain and slot-type. To tackle this problem, we propose a new approach, called slot-utterance matching belief tracker (SUMBT), which is a domain- and slot-independent belief tracker as shown in Figure 1. Inspired by machine reading comprehension techniques Chen et al. (2017); Seo et al. (2017), SUMBT considers a domain-slot-type (e.g., ‘restaurant-food’) as a question and finds the corresponding slot-value in a pair of user and system utterances, assuming the desirable answer exists in the utterances. SUMBT encodes system and user utterances using recently proposed BERT Devlin et al. (2018) which provides the contextualized semantic representation of sentences. Moreover, the domain-slot-types and slot-values are also literally encoded by BERT. Then SUMBT learns the way where to attend that is related to the domain-slot-type information among the utterance words based on their contextual semantic vectors. The model predicts the slot-value label in a non-parametric way based on a certain metric, which enables the model architecture not to structurally depend on domains and slot-types. Consequently, a single SUMBT can deal with any pair of domain-slot-type and slot-value, and also can utilize shared knowledge among multiple domains and slots.

We will experimentally demonstrate the efficacy of the proposing model on two goal-oriented dialog corpora: WOZ 2.0 and MultiWOZ. We will also qualitatively analyze how the model works. Our implementation is open-published.222

2 Sumbt

The proposed model consists of four parts as illustrated in Figure 1: BERT encoders for encoding slots, values, and utterances (the grey and blue boxes); a slot-utterance matching network (the red box); a belief tracker (the orange box); and a non-parametric discriminator (the dashed line on top).

Figure 1: The architecture of slot-utterance matching belief tracker (SUMBT). An example of system and user utterances, a domain-slot-type, and a slot-value is denoted in red.

2.1 Contextual Semantic Encoders

For sentence encoders, we employed a pre-trained BERT model Devlin et al. (2018) which is a deep stack of bi-directional Transformer encoders. Rather than a static word vector, it provides effective contextual semantic word vectors. Moreover, it offers an aggregated representation of a word sequence such as a phrase and sentence, and therefore we can obtain an embedding vector of slot-types or slot-values that consist of multiple words.

The proposed method literally encodes words of domain-slot-types and slot-values at turn as well as the system and user utterances. For the pair of system and user utterances, and , the pre-trained BERT encodes each word into a contextual semantic word vector , and the encoded utterances are represented in the following matrix representation:


Note that the sentence pairs are concatenated with a separation token [SEP]

, and BERT will be fine-tuned with the loss function (Eq.


For the domain-slot-type and slot-value , another pre-trained BERT which is denoted as encodes their word sequences and into contextual semantic vectors and , respectively.


We use the output vectors corresponding to the classification embedding token [CLS] that summarizes the whole input sequence.

Note that we consider as a phrase of domain and slot words (e.g., = “restaurant – price range”) so that represents both domain and slot information. Moreover, fixing the weights of during training allows the model to maintain the encoded contextual vector of any new pairs of domain and slot-type. Hence, simply by forwarding them into the slot-value encoder, the proposed model can be scalable to the new domains and slots.

2.2 Slot-Utterance Matching

In order to retrieve the relevant information corresponding to the domain-slot-type from the utterances, the model uses an attention mechanism. Considering the encoded vector of the domain-slot-type as a query, the model matches it to the contextual semantic vectors at each word position, and then the attention scores are calculated.

Here, we employed multi-head attention Vaswani et al. (2017) for the attention mechanism. The multi-head attention maps a query matrix , a key matrix , and a value matrix with different linear projections, and then the scaled dot-product attention is performed on those matrices. The attended context vector between the slot and the utterances at is


where is and and are .

2.3 Belief Tracker

As the conversation progresses, the belief state at each turn is determined by the previous dialog history and the current dialog turn. The flow of dialog can be modeled by RNNs such as LSTM and GRU, or Transformer decoders (i.e., left-to-right uni-directional Transformer).

In this work, the attended context vector is fed into an RNN,


It learns to output a vector that is close to the target slot-value’s semantic vector.

Since the output of BERT is normalized by layer normalization Ba et al. (2016), the output of RNN is also fed into a layer normalization layer to help training convergence,


2.4 Training Criteria

The proposed model is trained to minimize the distance between outputs and target slot-value’s semantic vectors under a certain distance metric. The probability distribution of a slot-value is calculated as


where is a distance metric such as Euclidean distance or negative cosine distance, and is a set of the candidate slot-values of slot-type

which is defined in the ontology. This discriminative classifier is similar to the metric learning method proposed in

Vinyals et al. (2016), but the distance metric is measured in the fixed space that BERT represents rather than in a trainable space.

Finally, the model is trained to minimize the log likelihood for all dialog turns and slot-types as following:


By training all domain-slot-types together, the model can learn general relations between slot-types and slot-values, which helps to improve performance.

3 Experimental Setup

3.1 Datasets

To demonstrate the performance of our approach, we conducted experiments over WOZ 2.0 Wen et al. (2017) and MultiWOZ Budzianowski et al. (2018) datasets. WOZ 2.0 dataset333Downloaded from is a single ‘restaurant reservation’ domain, in which belief trackers estimate three slots (area, food, and price range). MultiWOZ dataset444Downloaded from Before conducting experiments, we performed data cleansing such as correcting misspelled words. is a multi-domain conversational corpus, in which the model has to estimate 35 slots of 7 domains.

3.2 Baselines

We designed three baseline models: BERT+RNN, BERT+RNN+Ontology, and a slot-dependent SUMBT. 1) The BERT+RNN consists of a contextual semantic encoder (BERT), an RNN-based belief tracker (RNN), and a linear layer followed by a softmax output layer for slot-value classification. The contextual semantic encoder in this model outputs aggregated output vectors like those of . 2) The BERT+RNN+Ontology consists of all components in the BERT+RNN, an ontology encoder (Ontology), and an ontology-utterance matching network which performs element-wise multiplications between the encoded ontology and utterances as in Ramadan et al. (2018). Note that two aforementioned models BERT+RNN and BERT+RNN+Ontology use the linear layer to transform a hidden vector to an output vector, which depends on a candidate slot-value list. In other words, the models require re-training if the ontology is changed, which implies that these models have lack of scalability. 3) The slot-dependent SUMBT has the same architecture with the proposed model, but the only difference is that the model is individually trained for each slot.

3.3 Configurations

We employed the pre-trained BERT model that has 12 layers of 784 hidden units and 12 self-attention heads.555The pretrained model is published in
We experimentally found the best configuration of hyper-parameters in which search space is denoted in the following braces. For slot and utterance matching, we used the multi-head attention with heads and 784 hidden units. We employed a single-layer with hidden units as the RNN belief tracker. For distance measure, both Euclidean and negative cosine distances were investigated. The model was trained with Adam optimizer in which learning rate linearly increased in the warm-up phase then linearly decreased. We set the warm-up proportion to be of epochs and the learning rate to be

. The training stopped early when the validation loss was not improved for 20 consecutive epochs. We report the mean and standard deviation of joint goal accuracies over 20 different random seeds. For reproducibility, we publish our PyTorch implementation code and the pre-processed MultiWOZ dataset.

4 Experimental Results

4.1 Joint Accuracy Performance

The experimental results on WOZ 2.0 corpus are presented in Table 1. The joint accuracy of SUMBT is compared with those of the baseline models that are described in Section 3.2 as well as previously proposed models. The models incorporating the contextual semantic encoder BERT beat all previous models. Furthermore, the three baseline models, BERT+RNN, BERT+RNN+Ontology, and the slot-dependent SUMBT, showed no significant performance differences. On the other hand, the slot-independent SUMBT which learned the shared information from all across domains and slots significantly outperformed those baselines, resulting in 91.0% joint accuracy. This implies the importance of utilizing common knowledge through a single model.

Table 2 shows the experimental results of the slot-independent SUMBT model on MultiWOZ corpus. Note that MultiWOZ has more domains and slots to be learned than WOZ 2.0 corpus. The SUMBT greatly surpassed the performances of previous approaches by yielding 42.4% joint accuracy. The proposed model achieved state-of-the-art performance in both WOZ 2.0 and MultiWOZ datasets.

Model Joint Accuracy
NBT-DNN Mrkšić et al. (2017)
BT-CNN Ramadan et al. (2018)
GLAD Zhong et al. (2018)
GCE Nouri and Hosseini-Asl (2018)
StateNetPSI Ren et al. (2018)
BERT+RNN (baseline 1)
BERT+RNN+Ontology (baseline 2)
Slot-dependent SUMBT (baseline 3)
Slot-independent SUMBT (proposed)
Table 1: Joint goal accuracy on the evaluation dataset of WOZ 2.0 corpus.
Model Joint Accuracy
Benchmark baseline 666 The benchmark baseline is the model proposed in Ramadan et al. (2018) and the performance is described in
GLAD Zhong et al. (2018)
GCE Nouri and Hosseini-Asl (2018)
Table 2: Joint goal accuracy on the evaluation dataset of MultiWOZ corpus.

4.2 Attention Weights Analysis

Figure 2: Attention visualizations of the first three turns in a dialog (WOZ 2.0). At each turn, the first and second columns are the attention weights when the given slots are ‘area’ and ‘price range’, respectively. The slot-value labels are denoted in the parentheses.

Figure 2 shows an example of attention weights as a dialog progresses. We can find that the model attends to the part of utterances which are semantically related to the given slots, even though the slot-value labels are not expressed in the lexically same way. For example, in case of ‘price range’ slot-type at the first turn, the slot-value label is ‘moderate’ but the attention weights are relatively high on the phrase ‘reasonably priced’. When appropriate slot-values corresponding to the given slot-type are absent (i.e., the label is ‘none’), the model attends to [CLS] or [SEP] tokens.

5 Conclusion

In this paper, we propose a new approach to universal and scalable belief tracker, called SUMBT which attends to words in utterances that are relevant to a given domain-slot-type. Besides, the contextual semantic encoders and the non-parametric discriminator enable a single SUMBT to deal with multiple domains and slot-types without increasing model size. The proposed model achieved the state-of-the-art joint accuracy performance in WOZ 2.0 and MultiWOZ corpora. Furthermore, we experimentally showed that sharing knowledge by learning from multiple domain data helps to improve performance. As future work, we plan to explore whether SUMBT can continually learn new knowledge when domain ontology is updated.


We would like to thank Jinyoung Yeo and anonymous reviewers for their constructive feedback and helpful discussions. We are also grateful to SK T-Brain Meta AI team for GPU cluster supports to conduct massive experiments.