Task-oriented dialogue systems are developed to help users to achieve tasks such as, for instance, restaurant reservation and flight bookings. In such systems the dialogue state tracker (DST) is a core component, aimed to maintain a distribution over the dialogue states based on the dialogue history. A dialogue state at any turn in the dialogue is typically represented as a set of slot-value pairs, such as (price, moderate) or (food, italian) in the context of restaurant reservation. The goal of the DST is to determine the user’s intent and the user’s goal during the dialogue and represent them as such slot-value pairs. The downstream components of a dialogue system (e.g the dialogue manager) that are responsible to choose the next system action, rely on an accurate DST for an effective dialogue strategy.
Because of the importance of DST in dialogue systems, their development attracted lots of research both in academia and industry. Typical dialogue systems are modeled for a fixed ontology consisting of a single domain [9, 20, 14], and the domain ontology schema defines intents, slots and values for each slot of the domain. Though this approach simplifies the DST task, making it domain specific, nevertheless it has additional significant real-world limitations. Firstly, this approach imposes that the values for each slot are predefined, while in real situations the number of possible values for a given slot could be large (e.g. departure city) and it is not feasible to enumerate all possible values for a slot of any external service via an API . Secondly, the fixed domain approach has the consequence that any modification in the domain schema, such as the inclusion of a new slot, would require retraining the model before deploying it. Thirdly, the domain knowledge learned by the system on a certain domain can not be transferred to a new domain. This last point is particularly interesting for large-scale conversational agents, such as Alexa, Siri, Google Assistant or Cortana, as they have to interact with various external services and APIs, and they would benefit from the possibility to use knowledge from one domain for another similar domain. Recent research have proposed new approaches to tackle some of the above challenges.  proposed to use the concept of candidate set to tackle the unbounded set of values for a given slot.  proposed neural belief tracker (NBT) for multi domain belief tracking. NBT uses semantic similarity between dialogue utterance and ontology terms to make prediction. NBT cannot be used to predict for non-categorical slot value.  proposed Transferable Multi-Domain State Generator (TRADE) that generates dialogue state directly from the dialogue utterance using a copy mechanism. Since TRADE model is a generation based approach, predicting for categorical values is not straight forward and requires change in architecture.
In this paper we propose a domain-aware dialogue state tracker, able to serve any new domain, intent, slot or value without the need for re-training. We propose this model for the Schema-Guided State Tracking111https://github.com/google-research-datasets/dstc8-schema-guided-dialogue challenge  at the Eighth Dialog System Technology Challenge (DSTC8) . The proposed model is inspired from the works of  and .
The proposed domain-aware DST is modeled to read the schema of a domain for a dialogue and then make predictions for the domain based on this schema. Since the schema of the domain is not predefined in the model, it allows for easy integration with new schemas and domain transfer including zero-shot dialogue state predictions. We achieve this by leveraging a pretrained language model, BERT , which provides semantic representations for both the domain schema and the dialogue history. The domain-aware DST model then learns the relationship between the schema and the dialogue history based on the training data, and applies it to new schemas in the test data. In particular, we use a multi-head attention mechanism to extract both domain and slot specific representations from the dialogue history based on the schema, and to learn the relationship between these representations and the schema representation to make dialogue state predictions. The evaluation on the task shows that the proposed approach outperforms the baseline for both single-domain and multi-domain dialogues both in the schema-guided dataset (SGD) and in the WoZ2.0 dataset.
2 Schema-Guided Approach to DST
Typically, a task-oriented dialogue dataset consists of a single domain with a predefined ontology, which is used both for training and for testing [6, 16]. Recent work has shown promise in modeling multi-domain dialogues and has facilitated the release of multi-domain dialogue datasets, such as MultiWOZ  and FRAMES . However, these datasets still do not sufficiently capture the challenges that arise in scaling virtual assistants for real production, such as the ability to easily integrate new external services and handling new domains . Thus, DST models built using existing datasets can not be evaluated for their ability to integrate new services or to predict dialogue states for a different domain, possibly without training data (zero-shot learning). This is mainly due to the fact that different datasets consist of a different schema and that they are incompatible with each other.
To address the above challenges,  propose a schema-guided approach that allows easy integration of new services and APIs. In this approach, each service consists of a list of intents and slots supported by the service, along with their natural language descriptions. The supported slots could either be categorical or non-categorical; and if categorical, all the corresponding values are also defined. The details in the schema, such as service name, slot names, intent names, value names and their natural language description are used to obtain a semantic representation for the domain/service, which can be used in prediction. A sample schema for a given service in the schema-guided dataset is shown in Figure 1.
Each service provides a schema, similar to the one reported in Figure 1, listing all the slots and the intents supported by the service. The model is then trained to make predictions based on the schema of the service. Adopting this schema-guided approach helps to build models that contain no domain or service specific parameters.
3 Domain-Aware Dialogue State Tracker
In this section we describe the proposed domain-aware DST framework. First, we briefly introduce the pre-trained model, based on BERT, used for this work, and then we describe the different components of the Domain-Aware Dialogue State Tracker that relies on the BERT representations. While a single domain might have multiple services, for easier interpretability of the model, we refer to each service in the schema as a separate domain.
3.1 BERT Encoder
BERT (Bidirectional Encoder Representation from Transformers) is a multi-layer bidirectional Transformer encoder that is designed to learn representations from unlabelled text by conditioning on both the right and left context in a given text [4, 15]. The input to BERT can either be a single sentence or a pair of sentences, both referred to as “sequence”. A special token is always prepended to the input sequence, while to differentiate the sentence in the sequence, another special token, , is used to separate them. Finally the input sequence is formed as below:
where and are the sentence pairs. The representation for the input sequence consists of the i) token embedding constructed using WordPiece embedding; ii) segment embedding indicating whether a token belongs to or to and iii) Position embedding to indicate the position of each token in the corresponding sentence.
The model is then trained on two unsupervised tasks, namely a) Masked language model (MLM) and b) Next sentence prediction (NSP). The BooksCorpus (800M words) and English Wikipedia (2,500M words) are used as the dataset for pre-training. Once pre-trained on this dataset, the representations learnt by BERT is then fine-tuned for downstream tasks using corresponding task-specific labelled data by adding a task-specific output-layer. During fine-tuning, the output of the
token representation is used for sentence level classification tasks, such as sentiment analysis, and the output of corresponding token representation is used for the classification of token-level tasks, such as sequence tagging.
3.2 Schema Embedder
Given a schema consisting of the possible intents, slots (both categorical and non-categorical) and categorical values for a domain, the schema embedder component is used to obtain the representations for the corresponding domain, intents, slots, and the categorical values. We construct input sequences for each of the above information in the given schema as shown in Table 1, based on their corresponding name and description.
|Sentence 1||Sentence 2|
|Domain||Domain Name||Domain description|
|Intent||Intent Name||Intent description|
|Slot||Slot Name||Slot description|
These sequences are then passed to the pre-trained BERT model and the corresponding output of the token is used as their corresponding embedding. For a given schema, the schema embedder component outputs the following embeddings: Domain embedding (), Intent embedding (), Slot embedding () and Categorical Value embedding (). For better explainability, we represent the slot embeddings of categorical and non-categorical slots separately, as and respectively, while denotes all the slots.
3.3 Utterance Encoder
The utterance encoder of the model is similar to the one presented in [3, 13], which uses BERT for encoding the previous system response and the current user utterance . We follow their approach of treating and as sentence pairs to form an input sequence for BERT model. The output from BERT corresponding to the tokens in the sentence pair is used as the token level representation , while the representation of the token is used as the sequence representation .
The decoder relies on the schema embedding obtained from the schema embedder and makes predictions based on the obtained schema. As the input schema to the model is dynamic, the decoder is modeled to accommodate any new domain, and relies on the pre-trained knowledge of BERT to predict the new domain.
For a given domain , let the intent embeddings represent the possible intents. By default the intent of the user is NONE until a specific intent is initiated by the user. So the set of possible intents is then represented as follows:
where is a trainable parameter that represents the NONE intent. The representations of the intents are then combined with the utterance representation
to predict a probability for each intent. Formally,
where and are trainable parameters;
is the non-linear activation function; and
is the probability distribution over all intentsin a domain .
Domain and Slot Specific Representation.
Given a domain representation and the corresponding slot representation , we employ multi-head attention ()  to extract domain-specific and slot-specific representations from the token-level representation . The and are then combined to obtain a context representation . The architecture of the model that extracts the context representation from the input sequence is shown in Figure 2. Formally:
The context representation is then used to predict if a slot is expressed in the input sequence by the user. Based on the slot-type, the decoding strategy varies.
The prediction of the requested slots follows a similar approach as the intent prediction. Unlike in intent prediction, where only a single intent is active at a time, the user may request for multiple slots in the same turn. So we use a binary classification approach for each possible slot. Formally:
where and are trainable parameters; and is the probability distribution over all slots for a given domain .
Given the slot representations in a domain , we adopt a two-step strategy, similar to [3, 13], to decode the values for the slot. First, we predict if a value (other than dontcare) was expressed by the user, and then decide to decode the value for the slot.
where denotes that the user does not care about the value for the slot, while denotes that the user has expressed a value in the input and needs to be decoded.
To decode the corresponding value for the slot, based on the slot type (i.e categorical or non-categorical) the respective classifier is used.
The context representation of the input sequence and the representation of possible values for each slot
are then combined to get the logit score for each of the value.
where and are trainable parameters, and is the probability for each possible categorical values () of a given slot in domain . We adopt a binary classification for each value rather than a multi-class classification approach. This helps the model to predict for values iff there is strong evidence for the value being expressed in the input sequence.
For non-categorical slots, the context representation of the input sequence and the representation of non-categorical slots are combined to obtain the probability of each token being either the start or end of a span.
where denotes the probability of a token being at the beginning of a span, and denotes the probability of a token being at the end of a span.
In this section we describe the dataset and the experimental setting used for the state tracking task.
Schema-Guided Dataset (SGD)
is the official dataset for the schema-guided state tracking challenge at DSTC8 . The schema-guided dataset consists of 20 domains with a total of 45 services among those domains. The dataset consists of both single and multi-domain dialogues. The same domain could have multiple services, due to the fact that different external service providers could use different schema. Each dialogue in the dataset consists of a service it corresponds to and a schema file that captures all possible service schemas. This dataset consists of large number of unseen services in both the development-set and the test-set. The statistics of the data among different splits are shown in Table 2.
|No. of dialogs||16142||2482||4201|
|No. of Slots||214||136|
We also used the WoZ2.0 dataset , consisting of written text conversations for the restaurant booking domain, to evaluate the proposed model. Unlike the schema-guided dataset, WoZ2.0 is a single domain dataset collected using the Wizard of Oz framework. WoZ2.0 consists of a total of 1200 dialogues, out of which 600 are for training, 200 for development and 400 for testing. It consists of a predefined ontology listing all possible slots and values. We use this ontology to generate a schema consisting only of the slot names and to extract a schema representation based on these slot names. We experiment by treating the slots as both categorical and non-categorical slot-types.
4.2 Evaluation Metrics
We evaluated our models using the official evaluation script of  and compared the models on the following metrics.
Intent Accuracy: The fraction of user turns for which the active intent is correctly predicted by the model.
Requested Slots F1: This indicates the model performance in correctly predicting if a slot is requested by the user. This is the macro-averaged F1 score over for all requested slots.
Average Goal Accuracy: This is the average accuracy of predicting the correct value for a slot.
Joint Goal Accuracy: This indicates the model performance in predicting all slots in a given turn correctly. Joint Goal accuracy is the primary metric for state tracking task.
4.3 Experimented models
The proposed approach is modeled to be applicable for both single-domain and multi-domain dialogues. This means that the same model trained on single-domain dialogues can be used to predict multi-domain dialogues as well. This is because the model makes prediction based on the schema rather than a predefined ontology. As mentioned in Section 3, we build on top of the model proposed in , which also uses BERT to extract representations that are considered as baselines. We evaluate the proposed approach by applying both domain and slot representations extracted from the schema embedder. The default model defined in Section 3 is referred by D + S, meaning that both domain and slot representations are used to extract context representation from dialogue history. We also investigate an approach that makes use only of the domain representation to extract context representation from the dialogue history; this model is referred as D.
We use the pytorch library to implement the domain-aware state tracker. The encoder is the BERTBASE model  implemented by HuggingFace222https://github.com/huggingface/transformers/.  consisting of 12 layers with 768 hidden dimensions and 12 self-attention heads. The schema representation is extracted from the BERT model based on the schema for the dialogue, following the input sequence template shown in Table 1. This schema representation is not fine-tuned during training. We use a learning rate of 5e-5 with batch size of 32 for training.
5 Results and discussion
5.1 Performance on SGD
|Int. Acc||Req. F1||Avg. GA||Jnt. GA|
|D + S*||0.971||0.964||0.844||0.597|
|D + S+||0.963||0.971||0.861||0.627|
|Int. Acc||Req. F1||Avg. GA||Jnt. GA|
|D + S*||0.933||0.963||0.728||0.432|
|D + S+||0.958||0.976||0.783||0.502|
The evaluation of the proposed model on the SGD dev set for the single domain is shown in Table 3. We can see that all models presented in Table 3 perform very well on the intent accuracy and the requested slot F1 metrics. Using only the single domain data of the SGD dataset for training, the model (D*) with domain-specific representation obtains a joint goal accuracy of 0.541, while the model (D+S*) with both domain and slot specific representations obtains 0.597. Training the same models on all the dialogues in the training set (models D+ and D+S+), we can see that the corresponding models outperform the previous ones, which were trained on fewer dialogues (single domain dialogues). Evaluating the above four models on the complete dev-set, including both single and multi domain dialogues, we notice the same trend as shown in Table 4. This indicates that using both domain and slot specific representations help the model to learn to represent the input sequence better as compared to the baseline approaches.
|Int. Acc||Req. F1||Avg. GA||Jnt. GA|
|D + S*||0.890||0.954||0.610||0.254|
|D + S+||0.900||0.968||0.638||0.303|
The results on the test-set of the SGD dataset are shown in Table 5. Again, we notice the same pattern, proving that our approach of integrating domain and slot representation to learn better representation for the input sequence is helpful in improving the models performance.
5.2 Performance on WoZ2.0
|Model||Slot Type||Req. F1||Avg. GA||Jnt. GA|
|D + S||Non-Cat.||0.978||0.948||0.869|
|D + S||Cat.||0.968||0.962||0.899|
The evaluation of the proposed model on WoZ2.0 testset is shown in Table 6. Similarly to the SGD results, the model with both the domain and slot specific representation outperforms other approaches in terms of the joint goal accuracy metric, as shown in Table 6. In addition, the proposed models outperform the baseline model when the slot-type is treated either as a non-categorical slot or as a categorical slot. This shows the models ability to extract relevant information from the dialogue history for making prediction.
5.3 Zero-Shot Dialogue State Tracking
|Domain||Avg. GA||Joint GA|
The architecture of the proposed model is robust to variations in the service schema, thus enabling to adequately predict for any new schema that was not seen in training (unseen). Such zero-shot prediction capability of the domain-aware DST is crucial, as it enables knowledge transfer from high resource domains to similar domains with no or fewer training data. The overall performance of the (D+S+) model, both for seen and unseen services, on the test-set is shown in Figure 3. We can notice that the performance on the unseen services is comparatively low with respect to the seen services, due to the fact that unseen services contain a higher number of out-of-vocabulary (OOV) slots and values. The performance of the model on each domain in the test set is shown in Table 7. Of all the domains in test set, the domains Alarm, Messaging, Payment and Trains do not have any service in the training data. This results in the lowest average goal accuracy for Alarm, Messaging, Payment among all domains. However, for the Trains domain, we can notice that the average goal accuracy is even higher than the accuracy on some of the seen domains. This is due to the fact that the Trains domain is similar to other domains in the training set, such as Buses and Flights. This shows that the model is able to use the learned knowledge from existing domains in training data, and effectively apply this knowledge to new similar domains without any additional training data.
We presented a proposal for a schema-guided dialogue state tracking, which is robust to changes that may occur in the schema. The proposed model does not imply any domain or slot specific parameter, rather it utilizes domain and slot representations to learn corresponding representations from the input sequence. We showed that such approach outperforms the baseline approach and it is able to effectively perform knowledge transfer between domains. This is particularly promising for low-resource domains, where training data are scarce or even absent. In these settings the proposed approach could leverage data from high-resource domains, achieving significant data efficiency.
-  (2019) Scalable neural dialogue state tracking. arXiv preprint arXiv:1910.09942. Cited by: §1.
MultiWOZ - a large-scale multi-domain wizard-of-Oz dataset for task-oriented dialogue modelling.
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 5016–5026. External Links: Cited by: §2.
-  (2019) BERT-DST: scalable end-to-end dialogue state tracking with bidirectional encoder representations from transformer. CoRR abs/1907.03040. External Links: Cited by: §3.3, §3.4.
-  (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, Cited by: §1, §3.1, §4.4.
-  (2017-08) Frames: a corpus for adding memory to goal-oriented dialogue systems. In Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue, Saarbrücken, Germany, pp. 207–219. External Links: Cited by: §2.
-  (1990) The ATIS spoken language systems pilot corpus. In Speech and Natural Language: Proceedings of a Workshop Held at Hidden Valley, Pennsylvania, June 24-27,1990, External Links: Cited by: §2.
-  (2016) Bridging nonlinearities and stochastic regularizers with gaussian error linear units. CoRR abs/1606.08415. External Links: Cited by: §3.4.
-  (2019) The eighth dialog system technology challenge. arXiv preprint. Cited by: §1.
-  (2017) Neural belief tracker: data-driven dialogue state tracking. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1777–1788. External Links: Cited by: §1, §4.1.
-  (2017) Automatic differentiation in PyTorch. In NIPS Autodiff Workshop, Cited by: §4.4.
-  (2018-07) Large-scale multi-domain belief tracking with knowledge sharing. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Melbourne, Australia, pp. 432–437. External Links: Cited by: §1.
Scalable multi-domain dialogue state tracking.
IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Cited by: §1.
-  (2019) Towards scalable multi-domain conversational agents: the schema-guided dialogue dataset. arXiv preprint arXiv:1909.05855. Cited by: §1, §2, §2, §3.3, §3.4, §4.1, §4.2, §4.3.
-  (2018) Towards universal dialogue state tracking. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 2780–2786. External Links: Cited by: §1.
-  (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §3.1, §3.4.
-  (2017-04) A network-based end-to-end trainable task-oriented dialogue system. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, Valencia, Spain, pp. 438–449. External Links: Cited by: §2.
-  (2019) HuggingFace’s transformers: state-of-the-art natural language processing. ArXiv abs/1910.03771. Cited by: §4.4.
-  (2019-07) Transferable multi-domain state generator for task-oriented dialogue systems. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 808–819. External Links: Cited by: §1.
-  (2018) An end-to-end approach for handling unknown slot values in dialogue state tracking. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1448–1457. External Links: Cited by: §1.
-  (2018) Global-locally self-attentive encoder for dialogue state tracking. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1458–1467. External Links: Cited by: §1.