Task-oriented dialog systems have attracted more and more attention in recent years, because they allow for natural interactions with users to help them achieve simple tasks such as flight booking or restaurant reservation. Dialog state tracking (DST) is an important component of task-oriented dialog systems 
. Its purpose is to keep track of the state of the conversation from past user inputs and system outputs. Based on this estimated dialog state, the dialog system then plans the next action and responds to the user. In aslot-based dialog system, a state in DST is often expressed as a set of slot-value pairs. The set of slots and their possible values are typically domain-specific, defined in a domain ontology.
With the renaissance of deep learning, many neural network based approaches have been proposed for the task of DST[13, 11, 24, 17, 15, 14, 10, 3]. These methods achieve highly competitive performance on standard DST datasets such as DSTC-2  or WoZ 2.0 . However, most of these methods still have some limitations. First, many approaches require training a separate model for each slot type in the domain ontology [13, 24, 14]. Therefore, the number of parameters is proportional to the number of slot types, making the scalability of these approaches a significant issue. Second, some methods only operate on a fixed domain ontology [11, 24]. The slot types and possible values need to be defined in advance and must not change dynamically. Finally, state-of-the-art neural architectures for DST are typically heavily-engineered and conceptually complex [24, 17, 15, 14]. Each of these models consists of a number of different kinds of sub-components. In general, complicated deep learning models are difficult to implement, debug, and maintain in a production setting.
Recently, several pretrained language models, such as ELMo  and BERT , were used to achieve state-of-the-art results on many NLP tasks. In this paper, we show that by finetuning a pretrained BERT model, we can build a conceptually simple but effective model for DST. Given a dialog context and a candidate slot-value pair, the model outputs a score indicating the relevance of the candidate. Because the model shares parameters across all slot types, the number of parameters does not grow with the ontology size. Furthermore, because each candidate slot-value pair is simply treated as a sequence of words, the model can be directly applied to new types of slot-value pairs not seen during training. We do not need to retrain the model every time the domain ontology changes. Empirical results show that our proposed model outperforms prior work on the standard WoZ 2.0 dataset. However, a drawback of the model is that it is too large for resource-limited systems such as mobile devices.
To make the model less computationally demanding, we propose a compression strategy based on the knowledge distillation framework . Our final compressed model achieves results comparable to that of the full model, but it is around 8 times smaller and performs inference about 7 times faster on a single CPU.
, a self-attention based architecture. The base version of BERT consists of 12 Transformer layers, each with a hidden size of 768 units and 12 self-attention heads. The input to BERT is a sequence of tokens (words or pieces of words). The output is a sequence of vectors, one for each input token. The input representation of BERT is flexible enough that it can unambiguously represent both a single text sentence and a pair of text sentences in one token sequence. The first token of every input sequence is always a special classification token -CLS. The output vector corresponding to this token is typically used as the aggregate representation of the original input. For single text sentence tasks (e.g., sentiment classification), this CLS token is followed by the actual tokens of the input text and a special separator token - SEP. For sentence pair tasks (e.g., entailment classification), the tokens of the two input texts are separated by another SEP token. This input sequence also ends with the SEP token. Figure 1 demonstrates the input representation of BERT.
During pretraining, BERT was trained using two self-supervised tasks: masked language modeling (masked LM) and next sentence prediction (NSP). In masked LM, some of the tokens in the input sequence are randomly selected and replaced with a special token MASK, and then the objective is to predict the original vocabulary ids of the masked tokens. In NSP, BERT needs to predict whether two input segments follow each other in the original text. Positive examples are created by taking consecutive sentences from the text corpus, whereas negative examples are created by picking segments from different documents. After the pretraining stage, BERT can be applied to various downstream tasks such as question answering and language inference, without substantial task-specific architecture modifications.
2.2 BERT for Dialog State Tracking
Figure 2 shows our proposed application of BERT to DST. At a high level, given a dialog context and a candidate slot-value pair, our model outputs a score indicating the relevance of the candidate. In other words, the approach is similar to a sentence pair classification task. The first input corresponds to the dialog context, and it consists of the system utterance from the previous turn and the user utterance from the current turn. The two utterances are separated by a SEP token. The second input is the candidate slot-value pair. We simply represent the candidate pair as a sequence of tokens (words or pieces of words). The two input segments are concatenated into one single token sequence and then simply passed to BERT to get the output vectors . Here, denotes the total number of input tokens (including special tokens such as CLS and SEP).
Based on the output vector corresponding the first special token - CLS (i.e.,
) the probability of the candidate slot-value pair being relevant is:
where the transformation matrix W and the bias term b are model parameters. And
denotes the sigmoid function. It squashes the score to a probability between 0 and 1.
At each turn, the proposed BERT-based model is used to estimate the probability score of every candidate slot-value pair. After that, only pairs with predicted probability equal to at least 0.5 are chosen as the final prediction for the turn. To obtain the dialog state at the current turn, we use the newly predicted slot-value pairs to update the corresponding values in the state of previous turn. For example, suppose the user specifies a foodchinese restaurant during the current turn. If the dialog state has no existing food specification, then we can add foodchinese to the dialog state. If foodkorean had been specified before, we replace it with foodchinese.
Compared to previous works [24, 17, 15, 14], our model is conceptually simpler. For example, in the GLAD model , there are two scoring modules: the utterance scorer and the action scorer. Intuitively, the utterance scorer determines whether the current user utterance mentions the candidate slot-value pair or not, while the action scorer determines the degree to which the slot-value pair was expressed by the previous system actions. In our proposed approach, we simply use a single BERT model for examining the information from all sources at the same time.
2.3 Model Compression
BERT is a powerful language representation model, because it was pretrained on large text corpora (Wikipedia and BookCorpus). However, the original pretrained BERT models are computationally expensive and have a huge number of parameters. For example, the base version of BERT consists of about 110M parameters. Therefore, if we directly integrate an existing BERT model into our DST model, it will be difficult to deploy the final model in resource-limited systems such as mobile devices. In this part, we describe our approach to compress BERT into a smaller model.
Over the years, many model compression methods have been proposed [1, 6, 7, 9]. In this work, we propose a strategy for compressing BERT based on the knowledge distillation framework . Knowledge distillation (KD) aims at transferring knowledge acquired in one model (a teacher) to another model (a student) that is typically smaller. We generally assume that the teacher has previously been trained, and that we are estimating parameters for the student. KD suggests training by matching the student’s predictions to the teacher’s predictions. In other words, we train the student to mimic output activations of individual data examples represented by the teacher.
We choose the pretrained base version of BERT as the teacher model. Our student model has the same general architecture as BERT but it is much smaller than the teacher model (Table 1). In the student model, the number of Transformer layers is 8, each with a hidden size of 256 units and 8 self-attention heads. The feedforward/filter size is 1024. Overall, our student model has 14M parameters in total and it is 8x smaller and 7x faster on inference than our original teacher model.
We first extract sentences from the BooksCorpus , a large-scale corpus consisting of about 11,000 books and nearly 1 billion words. For each sentence, we use the WordPiece tokenizer [19, 5] to tokenize the sentence into a sequence of tokens. Similar to the pretraining phase of BERT, we mask 15% of the tokens in the sequence at random. After that, we define the cross-entropy loss for each token as follow:
where refers to the cross-entropy function,
is the teacher model’s pre-softmax logit for the current token,is the student model’s pre-softmax logit. Finally,
is the temperature hyperparameter. In this work, we set to be 10. Intuitively, will be small if the student’s prediction for the current token is similar to that of the teacher model. The distillation loss for the entire sentence is simply defined as the sum of all the cross-entropy losses of all the tokens. To summarize, during the KD process, we use this distillation loss to train our student model from scratch using the teacher model’s logits on unlabeled examples extracted from the BooksCorpus. After that, we can integrate our distilled student BERT model into our DST model (Figure 2) and use the final model for monitoring the state of the conversation.
Different from the very first work on exploring knowledge distillation for BERT 
, our approach does not use any data augmentation heuristic. We only extract unlabeled sentences from the BooksCorpus to build training examples for distillation. Our work is in spirit similar to DistilBERT, which also uses the original BERT as the teacher and a large-scale unlabeled text corpus as the basic learning material. However, as shown in Table 1, the DistilBERT model is about 5 times larger than our student model. Recently, at WWDC 2019, Apple presented a BERT-based on-device model for question answering . Instead of using knowledge distillation, Apple used the mixed precision training technique  to build their model. From the Table 1, we see that the model of Apple is much larger than our student model, as it has 8x more parameters and requires 4x more storage space. This implies that our student model is small enough to be deployed on mobile systems. To the best of our knowledge, we are the first to explore the use of knowledge distillation to compress neural networks for DST.
3 Experiments and Results
|Model||Number parameters||Storage Size|
|Teacher Model (, , , )||110M||440MB|
|Student Model (, , , )||14M||55MB|
|Apple Core ML BERT ||110M||220 MB|
|DistilBERT  (, , , )||66M||270 MB|
3.1 Data and Evaluation Metrics
To evaluate the effectiveness of our proposed approach, we use the standard WoZ 2.0 dataset. The dataset consists of user conversations with dialog systems designed to help users find suitable restaurants. The ontology contains three informable slots: food, price, and area. In a typical conversation, a user would first search for restaurants by specifying values for some of these slots. As the dialog progresses, the dialog system may ask the user for more information about these slots, and the user answers these questions. The user’s goal may also change during the dialog. For example, the user may want an ‘expensive’ restaurant initially but change to wanting a ‘moderately priced’ restaurant by the end. Once the system suggests a restaurant that matches the user criteria, the user can also ask about the values of up to eight requestable slots (phone, address
, …). The dataset has 600/200/400 dialogs for train/dev/test split. Similar to previous work, we focus on two key evaluation metrics introduced in: joint goal accuracy and turn request accuracy.
3.2 Dialog State Tracking Results
|Model||Joint goal||Turn Request|
|Full BERT-based model||90.5||97.6|
|Distilled BERT-based model||90.4||97.7|
|BERT-DST PS ||87.7||—|
|Neural Belief Tracker - CNN ||84.2||91.6|
|Neural Belief Tracker - DNN ||84.4||91.2|
Table 2 shows the test accuracies of different models on the WoZ 2.0 dataset. Our full BERT-based model uses the base version of BERT (i.e., the teacher model in the knowledge distillation process), whereas the distilled BERT-based model uses the compressed student model. Both our full model and the compressed model outperform previous methods by a large margin. Even though our compressed model is 8x smaller and 7x faster than the full model (Table 1), it still achieves almost the same results as the full model. In fact, the smaller model has a slightly higher turn request accuracy. From Table 1, we see that our compressed model even has less parameters than GLAD, a DST model that is not based on BERT. This demonstrates the effectiveness of our proposed knowledge distillation approach.
Note that BERT-DST PS  is a recent work that also utilizes BERT for DST. However, the work focuses only on situation where the target slot value (if any) should be found as word segment in the dialog context. According to Table 2, our models outperform BERT-DST PS on the WoZ dataset. Furthermore, BERT-DST PS only uses the original version of BERT, making it large and slow.
|Model||Inf. Time on CPU (secs)||Inf. Time on GPU (secs)|
|Full BERT-based Model||1.465||0.113|
|Distilled BERT-based Model||0.205||0.024|
3.3 Size and inference speed
Table 1 shows that our distilled student model is much smaller than many other BERT-based models in previous works. Table 3 shows the inference speed of our models on CPU (Intel Core i7-8700K @3.70GHz) and on GPU (GeForce GTX 1080). On average, on CPU, our compressed model is 7x faster on inference than our full model and 3x faster than DistilBERT. On GPU, our compressed model is about 5x faster than the full model and 2x faster than DistilBERT.
In this paper, we propose a simple but effective model based on BERT for the task of dialog state tracking. Because the original version of BERT is large, we apply the knowledge distillation method to compress our model. Our compressed model achieves state-of-the-art performance on the WoZ 2.0 dataset while being 8x smaller and 7x faster on inference than the original model. In future work, we will experiment on more large scale datasets such as the MultiWOZ dataset .
-  (2013) Do deep nets really need to be deep?. In NIPS, Cited by: §2.3.
-  (2018) MultiWOZ - a large-scale multi-domain wizard-of-oz dataset for task-oriented dialogue modelling. In EMNLP, Cited by: §4.
-  (2019) BERT-dst: scalable end-to-end dialogue state tracking with bidirectional encoder representations from transformer. ArXiv abs/1907.03040. Cited by: §1, §3.2, Table 2.
-  Core ml models. Note: https://developer.apple.com/machine-learning/models/Accessed: 2019-10-20 Cited by: §2.3, Table 1.
-  (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, Cited by: §1, §2.1, §2.3.
-  (2015) Deep compression: compressing deep neural network with pruning, trained quantization and huffman coding. CoRR abs/1510.00149. Cited by: §2.3.
-  (2015) Learning both weights and connections for efficient neural networks. In NIPS, Cited by: §2.3.
-  (2014) The second dialog state tracking challenge. In SIGDIAL Conference, Cited by: §1, §3.1.
-  (2015) Distilling the knowledge in a neural network. ArXiv abs/1503.02531. Cited by: §1, §2.3, §2.3.
-  (2019) Dialogue state tracking with convolutional semantic taggers. ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7220–7224. Cited by: §1.
-  (2017) An end-to-end trainable neural network model with belief tracking for task-oriented dialog. ArXiv abs/1708.05956. Cited by: §1.
-  (2017) Mixed precision training. ArXiv abs/1710.03740. Cited by: §2.3.
-  (2016) Neural belief tracker: data-driven dialogue state tracking. In ACL, Cited by: §1, Table 2.
-  (2018) Fully statistical neural belief tracking. ACL. Cited by: §1, §2.2.
-  (2018) Toward scalable neural dialogue state tracking model. arXiv preprint arXiv:1812.00899. Cited by: §1, §2.2, Table 2.
-  (2018) Deep contextualized word representations. In NAACL-HLT, Cited by: §1.
-  (2018) Towards universal dialogue state tracking. In EMNLP, Cited by: §1, §2.2, Table 2.
-  (2019) DistilBERT, a distilled version of bert: smaller, faster, cheaper and lighter. ArXiv abs/1910.01108. Cited by: §2.3, Table 1, Table 3.
-  (2012) Japanese and korean voice search. 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5149–5152. Cited by: §2.3.
-  (2019) Distilling task-specific knowledge from bert into simple neural networks. ArXiv abs/1903.12136. Cited by: §2.3.
-  (2017) Attention is all you need. In NIPS, Cited by: §2.1.
-  (2016) A network-based end-to-end trainable task-oriented dialogue system. In EACL, Cited by: §1.
-  (2013) POMDP-based statistical spoken dialog systems: a review. Proceedings of the IEEE 101, pp. 1160–1179. Cited by: §1.
-  (2018) Global-locally self-attentive encoder for dialogue state tracking. In ACL, Melbourne, Australia, pp. 1458–1467. External Links: Cited by: §1, §2.2, Table 1, Table 2.
Aligning books and movies: towards story-like visual explanations by watching movies and reading books.
2015 IEEE International Conference on Computer Vision (ICCV), pp. 19–27. Cited by: §2.3.