Goal-Oriented Multi-Task BERT-Based Dialogue State Tracker

02/05/2020 ∙ by Pavel Gulyaev, et al. ∙ Moscow Institute of Physics and Technology 0

Dialogue State Tracking (DST) is a core component of virtual assistants such as Alexa or Siri. To accomplish various tasks, these assistants need to support an increasing number of services and APIs. The Schema-Guided State Tracking track of the 8th Dialogue System Technology Challenge highlighted the DST problem for unseen services. The organizers introduced the Schema-Guided Dialogue (SGD) dataset with multi-domain conversations and released a zero-shot dialogue state tracking model. In this work, we propose a GOaL-Oriented Multi-task BERT-based dialogue state tracker (GOLOMB) inspired by architectures for reading comprehension question answering systems. The model "queries" dialogue history with descriptions of slots and services as well as possible values of slots. This allows to transfer slot values in multi-domain dialogues and have a capability to scale to unseen slot types. Our model achieves a joint goal accuracy of 53.97



There are no comments yet.


page 1

page 2

page 3

page 4

page 5

page 6

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


The advent of virtual assistants such as Amazon Alexa, Google Assistant and many others resulted in an urge to develop applications providing a natural language interface to services and APIs. These task-oriented dialogue systems can be implemented by either knowledge-based or data-driven approaches. Dialogue state tracking (DST) is the main component in task-oriented dialogue systems. DST is responsible for extracting the goals of the user and the (slot, value) pairs corresponding to them. For example, if the user’s goal is to order a taxi (intent: OrderTaxi), then the slots are destination, number_of_passengers etc. The development of task-oriented dialogue systems has been put forward by releases of task-oriented dialogue corpora such as NegoChat [konovalov2016negochat], ATIS [hemphill1990atis] and many others. However, these single-domain datasets do not fully represent the challenges of the real world, where a conversation often revolves around multiple domains.

The release of the multi-domain dialogue dataset (Multi-WOZ) raised the bar in DST due to its mixed-domain conversations [eric2019multiwoz]. This dataset contains dialogues where the domain switches over time. For example, a user might start a conversation by asking to reserve a restaurant, then go on to request a taxi ride to that restaurant. In this case, the DST has to determine the corresponding domain, slots and values at each turn of the dialogue, taking into account the history of the conversation if necessary.

The largest public task-oriented dialogue corpus, the Schema-Guided Dialogue (SGD) dataset, has been recently released by Google [rastogi2019towards]. It contains over 16,000 dialogues in the training set spanning 26 services in 16 domains. In addition, to measure the model’s ability to perform in zero-shot settings, the evaluation sets (dev and test) contain unseen services and domains.

SGD provides a schema for each service represented in the dialogues. A schema is a list of slots and intents for the service accompanied by their natural language description. The dialogue state consists of three fields: active_intent, requested_slots and slot_values. SGD encourages the model to recognize the semantics of intents and slots from their descriptions while predicting the dialogue state, enabling zero-shot generalization to new schemas. The authors also proposed a single unified task-oriented dialogue model for all services and APIs, achieving 41.1% joint goal accuracy when trained and evaluated on the entire dataset and 48.6% joint goal accuracy for single-domain dialogues only. The proposed model encodes each schema element (intents, slots and categorical slot values) using its natural language description provided in the schema file. These embedded representations are not fine-tuned afterwards, which seems to be a disadvantage of the suggested approach.

In this paper, we introduce a GOaL-Oriented Multi-task BERT-based dialogue state tracker (GOLOMB) inspired by recent reading comprehension question answering neural architectures such as [devlin2018bert]. These models search the text for a span that contains the answer to the user’s question. We reformulate the dialogue state tracking task to have a similar form. Given a dialogue history and a “question”, comprising the slot, service and intent descriptions, the model should return values for a dialogue state as an “answer”. To predict a dialogue state update, our model solves several classification tasks and a span-prediction task. For each of these tasks, there is a special head implemented as a fully connected linear layer. This architecture makes it possible to jointly train the representations for schema elements and dialogue history. Our approach is robust to changes in schema due to zero-shot adaptation to new intents and slots. In addition, the proposed model does not rely on a pre-calculated schema representation. GOLOMB outperforms the baseline and achieves a joint goal accuracy of 53.97% on the dev set. The model is publicly available 111https://gitlab.com/zagerpaul/squad˙dst.

Related Work

The main task of dialogue state tracking is the identification of all existing slots, their values and the intentions that form them. The pairs of slots and their values form the state of the dialogue. The dialogue state defines the interaction with the external backend API and the selection of the system’s next action.

Classic dialogue state tracking models combine the semantics extracted by natural language understanding module with the previous dialogue context to estimate current dialogue state

[thomson2010bayesian, wang2013simple, williams2014web] or jointly learn speech understanding and dialogue tracking [henderson2014word, zilka2015incremental, wen2016network]. In past tasks, such as DSTC2 [henderson-etal-2014-second] or WoZ [wen2016network], it was required to track the dialogue within one domain. At the same time, all possible values for all slots were given in the dataset ontologies. Thus, the dialogue state tracking task was reduced to enumerating and then selecting pairs of slot values. This results in ontology specific solutions unable to adapt to new data domains. For example, the Neural Belief Tracker uses word representation learning to obtain independent representations for each slot-value pair [mrkvsic2017neural]. Developers of Global-Locally Self-Attentive Dialogue State Tracker (GLAD) found that 38.6% of turns in the WoZ dataset contain rare slot-value pairs with fewer than 20 training examples [zhong2018global]. There is therefore not enough training data for many of the slots, which greatly decreases joint goal accuracy. To solve this problem, the authors of GLAD proposed sharing parameters between all slots. Thus, information extracted from some slots can be used for other slots during training, which increases the quality of state tracking and makes it possible to work with multi-domain dialogues. However, the model uses both the parameters common to all slots and the parameters trained individually for each slot.

As technology for dialogue state tracking developed, a more complex task was proposed in the MultiWoZ dataset [budzianowski2018multiwoz, eric2019multiwoz]. Here, the system needs to extract a state from dialogues where the user can switch between domains or even mention multiple domains at the same time. As the number of possible slots and their possible values grew, iterating over all pairs became labor-intensive and learning slot-specific parameters became less efficient.

The Globally-Conditioned Encoder (GCE) [nouri2018toward] is an improved version of GLAD. This model, in which all parameters were shared between all slots, surpassed the previous model for WoZ and MultiWoZ tasks. StateNet [ren2018towards] generates a representation of dialogue history and compares it to the slot value representation in the candidate set. Here, the dialogue history consists of a system’s act and the subsequent user utterance. The HyST model [goel2019hyst]

forms a representation of the user’s utterances with hierarchical LSTM, and then combines two approaches for selecting slot values. The first one independently estimates the probability of filling the slot by each candidate from the candidate set. The second estimates the probability distribution over all the possible values for that slot type.

The majority of the aforementioned models require a vocabulary with all the values supported by the model. Thus, it is not possible to process out-of-vocabulary values. To address this issue, the PtrNet [xu2018end] model uses an index-based pointer network for different slots. The TRADE model [wu2019transferable] tracks the dialogue state using a biGRU-based encoder and decoder. The encoder encodes each token in the dialogue history. The decoder generates slot value tokens using a soft-copy mechanism that combines attention over the dialogue history and value selection from the vocabulary. The authors also studied zero- and few-shot learning to track the state of the out-of-domain dialogues. Also, pre-trained language models can help with handling unknown values and zero-shot learning. BERT-DST [chao2019bert] uses BERT to predict the start and the end tokens of the value span for each slot.


Figure 1: The architecture of GOaL-Oriented Multi-task BERT-based dialogue state tracker (GOLOMB). The slot gate head is used to decide whether a slot has to be included in the final state. The requested slot gate predicts whether a slot has been requested by the user. The

intent classifier

head chooses the active intent. Depending on whether the slot is categorical or non-categorical, different heads are used. For a non-categorical slot, the free-form slot filler selects positions of the beginning and the end of the slot value in the dialogue history. For a categorical slot, the categorical slot filler selects the slot value among the possible values.

In this section, we provide a detailed description of the proposed GOLOMB model (see Figure 1). The input of the model consists of a slot description and a dialogue history followed by the possible slot values and the supported intent descriptions. A BERT-based encoder converts the input into contextualized sentence-level and token-level representations. These representations are then fed into the following task-specific output heads:

  1. Classification heads:

    • Intent classifier is responsible for active intent prediction.

    • Requested slot gate predicts the list of slots requested by the user in the current turn.

    • Slot gate predicts whether a slot is presented in the context.

    • Categorical slot filler performs slot value prediction by selecting the most probable value from the list specified in the schema.

  2. Span-prediction head:

    • Free-form slot filler performs a slot value prediction by identifying it as a span in the context.

Each head is implemented as a fully connected linear layer.

Schema-Guided Dialogue Task

The dialogue state in the Schema-Guided Dialogue dataset is a frame-based representation of the user’s goals retrieved from the dialogue context. It is used to identify an appropriate service call and assign the values of the slots required for that call. The dialogue state consists of active_intent, requested_slots and slot_values.

A dialogue in the SGD dataset is represented by a sequence of turns between a user and a system. The turn annotation is organized into frames where each frame corresponds to a single service. A separate dialogue state is maintained for each service in the corresponding frame. A state update is defined as the difference between the slot values present in the current service frame and the frame for the same service for the previous user utterance. The task is state update prediction.

BERT Encoder

We assemble a group of input sequences for every frame according to its service schema. For each slot in the schema an input sequence is formed and then fed into the pre-trained BERT encoder. We adopt the input structure from the BERT-based model for question answering [devlin2018bert] on the SQuAD dataset [rajpurkar2016squad], which consists of a question part and a context part. In our case, the context is a dialogue history and the question is a concatenation of slot and domain descriptions. Table 1 shows the components of the input sequence.


Input sequence
Question Slot and service description
Context Dialogue history
Possible intents Descriptions of intents supported by the service
Possible values Possible slot values (for categorical slots only)


Table 1: The components of GOLOMB input.

The full input sequence is shown at the bottom of the diagram in Figure 1. It starts with a [CLS] token followed by the concatenation of the slot and domain descriptions. The next part, separated by [SEP]

tokens, is the dialogue history. In our case, it is the current user utterance with the preceding system utterance. We then pad the input until the

max_hist_len is reached (by default max_hist_len=250). After that, we add all relevant intent descriptions separated by the special [int] token and padded to the max_intent_len (max_intent_len=50 by default). Finally, if the slot is categorical, we complete the input with its possible values accompanied by the special token [pv]. We also add the special values “NONE” to the possible intent and slot values so as not to penalize the model if the intent or slot is not present in the context. For every input token, the BERT-based encoder generates a contextualized embedding. Different parts of the encoder output are read out by different heads (see Figure 1).


All heads perform a linear transformation of the corresponding embeddings. Let

be a vector from

and let be an arbitrary positive integer. Then, for head , is a projection that transforms into the prediction vector :



is implemented as a single fully connected layer without an activation function:


Slot gate

For each slot, the model predicts its values, but not all slots should be included in the state update. The Slot gate head predicts the slot status, which can have three values: ptr, dontcare and none. If the slot status is predicted to be none, then this slot and its value will not be included in the state update. If the prediction is dontcare, then the special value dontcare is assigned to the slot. If the slot status is ptr, the slot value predicted by one of the slot fillers will be included in the state update.

The slot status is obtained by applying to the embedding:


The logits

are normalized using softmax to yield a distribution over three possible statuses. During inference, the status with the highest probability is assigned to the slot.

Categorical slot filler

We apply a fully connected layer to each possible slot value embedding to obtain a logit:


where is the maximum number of possible categorical slot values. An additional value corresponds to the “NONE” value. The calculated logits are combined into a vector and normalized with softmax to get a distribution over all possible values.

Free-form slot filler

To get a span for a non-categorical slot value, we predict the span start and the span end distributions over token level representations of dialogue history:


where is equal to the hidden state dimension (typically 384 or 512, as required for BERT input).

Requested slot gate

A request for a slot value by the user is predicted by applying to the embedding:


The calculated logits are normalized with softmax to yield a probability distribution over two possible requested slot statuses: requested or not_requested. If the predicted status is requested, then the slot is added to the requested slots list.

Intent classifier

To predict the active user intent for a given service, we apply a fully connected layer to every embedding and then obtain the probability distribution with softmax:


where is the maximum number of intents per service and the additional value corresponds to the “NONE” intent.

Experimental Setup

Schema-Guided Dialogue Dataset

To demonstrate the performance of our model, we use the recently released Schema-Guided Dialogue Dataset (SGD). It is the largest public task-oriented dialogue corpus as announced by its authors [rastogi2019towards]. SGD incorporates 34 services related to 16 different domains with over 18,000 dialogues in both train and dev sets. The evaluation set contains unseen services and domains, so the model is expected to generalize in zero-shot settings. The dataset consists of single-domain and multi-domain dialogues. A single-domain dialogue involves interactions with only one service, while a multi-domain dialogue has interactions with two or more different services.

The authors also proposed the schema-guided approach for the task-oriented dialogue. A schema defines the interface for a backend API as a list of user intents and slots, as well as their natural language descriptions. Each dialogue in the dataset is accompanied by one or more schemas relevant to the dialogue (one schema corresponds to a single service). The model should use the service’s schema as input to produce predictions over the intents and slots listed in the schema. The natural language descriptions of slots and intents allow the model to handle unseen services.

Training Details

As an encoder, we use the pre-trained BERT model (bert-large-cased-whole-word-masking-finetuned-squad 222https://huggingface.co/transformers/pretrained_models.html) with 24 layers of 1,024 hidden units, 16 self-attention heads and 340 million parameters. We fine-tune the model parameters using the Adam optimizer with weight decay [loshchilov2018decoupled]. The initial learning rate of the optimizer was set to

. The total loss is defined as a sum of cross-entropy losses for each head. We train the model for 5 epochs with a batch size of 8 and 12 gradient accumulation steps on one Tesla V100 32GB.

Due to our training procedure we get a substantial amount of examples where a slot is not present in the state update and the model has to predict either an empty span or a “NONE” value. These instances (we term them “negative”) force the model to make constant predictions. In order to mitigate this issue, we introduce the cat_neg_sampling_prob (by default 0.1) and noncat_neg_sampling_prob (by default 0.2) sampling rates for categorical and non-categorical slots respectively. Also, the number of non-categorical examples overwhelms that of categorical ones. We deal with this class imbalance by providing separate batches for categorical and non-categorical examples.


The following metrics were used to evaluate the dialogue state tracking task:

  • Active Intent Accuracy: The portion of user turns for which the active intent was correctly predicted.

  • Requested Slot F1: The macro-averaged F1 score for requested slots over all turns.

  • Average Goal Accuracy: For each user utterance, the model predicts a single value for every slot present in the dialogue state. Only the slots which have a non-empty assignment in the ground truth dialogue state are considered for this metric. This is the average accuracy of predicting the value of a slot correctly. A fuzzy matching score is used for non-categorical slots to reward partial matches with the ground truth.

  • Joint Goal Accuracy: This is the average accuracy of predicting all slot assignments for a turn correctly.

Experimental Results

The results of GOLOMB evaluation on the dev and test sets and the dev scores of the baseline model are shown in Table 2. The comparison between our model and the baseline model across the different domains is provided in Figure 2.

As we can see from Table 2, our model outperforms the baseline model in terms of joint goal accuracy and average goal accuracy, whereas the baseline model has better scores for requested slot F1 and active intent accuracy. A plausible explanation for the significantly higher active intent score of the baseline model is that it uses the [CLS] token output for intent predictions. We also tried to employ the [CLS] token output for the intent classifier and got better scores for intent accuracy. But at the same time, joint goal accuracy degraded.


Active Int Acc Req Slot F1 Avg GA Joint GA
GOLOMB, dev scores 0.660 0.969 0.817 0.539
Baseline, dev scores 0.908 0.973 0.740 0.411
GOLOMB, test scores 0.747 0.971 0.750 0.465


Table 2: Performance comparison between the baseline and our model on the dev set, and our model’s scores on the test set.
Figure 2: Per-domain performance comparison by joint goal accuracy and average goal accuracy between the baseline and our model. Here, “*” denotes a domain with a service present in dev and not present in train, and “**” denotes the domain with one seen and one unseen service. The other domains contain services from train only.
Figure 3: First 20 slots sorted by the error rate on the dev set. The location slot, which appears in the Hotels, Restaurants and Travel domains, has the highest error rate, 12%. The director slot from the Media and Movies domains has the lowest error rate, 1.6%.

Figure 2 shows a comparison between our model and the baseline model by joint goal accuracy and average goal accuracy. Our model exhibits better performance in every domain by joint goal accuracy. But in the Events domain both models performed well due to the large number of training examples in that domain. The greatest gap is evident in the Services domain, where our model shows superior performance. As one can observe from Figure 2, the model’s performance is in general better on the domains for which services were present in the train set. However, the best result for joint goal accuracy is achieved on the Banks domain, even though the corresponding service was unseen during training.

The Alarm domain exhibits the worst performance by joint goal accuracy. The most likely explanation is that no examples from this domain were seen by the model during training. However, a relatively good performance by average accuracy signals the presence of a few especially deceptive slots, on which the model makes more mistakes than elsewhere.

Figure 4: An example of domain switch in the dialogue 13_00089 from the dev set. The user requested a slot address from the Services domain, but its value was assigned to the destination slot from the RideSharing domain.


NLD CLS for categorical slots PV for categorical slots Intents SQuAD pre-trainig Active Int Acc Req Slot F1 Avg GA Joint GA
(a) + 0.969 0.782 0.460
(b) + + 0.969 0.778 0.464
(c) + + 0.969 0.814 0.524
(d) + + + 0.657 0.969 0.820 0.535
Final model + + + + 0.660 0.969 0.817 0.539


Table 3: Ablation study. Here, “NLD” denotes the case when the natural language descriptions of slots and domains are used, “PV” denotes possible slot values. For categorical slot value prediction, two approaches were implemented. The first approach, described in detail below, is to use the output. The second approach, which is part of our final architecture, uses the outputs of special tokens to select a slot value among the possible values.

The error rate across the different slots is shown in Figure 3. Not surprisingly, the location slot has the highest error rate (12%), as it appears in three domains: Hotels, Restaurants and Travel, of which only the Travel domain was seen during training. The date slot has the second-highest error rate of 7% and also appears in three domains, of which the Restaurants domain was unseen. The slots destination and destination_city also have high error rates of 6% and 4% respectively. These slots were often filled with origin places instead of destination places. However, the mismatch of origin and destination points was a frequent mistake in the SGD dataset itself, so the model could be confused by incorrect labels.

In multi-domain dialogues, we noticed that our model frequently makes mistakes on the turns where a domain switch happens. Typically, a false state tracking happens in a situation when a slot has been tracked for one domain and its value needs to be transferred to a slot in the new domain. The main obstacle here is that there is no mention of the value for the new slot in the recent context and our model cannot find this slot’s value by design.

An example of such a situation can be found in the dialogue 13_00089 from the dev set (see Figure 4). Here, the user requested a slot address which corresponds to the domain Services, but its value was assigned to the destination slot from the RideSharing domain. Our model is not able to share slot values directly between different domains, so the slot destination was filled incorrectly with the word “there” (the ending of the last user utterance).

Ablation Study

The results of an ablation study for our model are provided in Table 3. We perform the following ablation experiments:

  1. CLS for categorical slots. Our first version of the categorical slot filler used the output to predict categorical slot values. was fed into a fully connected layer with outputs, where is the maximum number of possible categorical slot values. The last position always corresponded to the “NONE” value. If a slot had possible values, the positions between and were filled with –INF to get zero probabilities after applying softmax. We also fed only the slot and domain names to the BERT Encoder, without their natural language descriptions.

  2. CLS for categorical slots + NLD. We added natural language descriptions (NLD) to the previous setup. Surprisingly, the increment in performance was not as substantial as we expected.

  3. PV for categorical slots + NLD. Introducing special tokens for possible values led to a huge increase of around 6% in performance by joint goal accuracy.

  4. PV for categorical slots + NLD + Intents. We implemented intent prediction by introducing special tokens for possible user intents (in the same manner as for categorical slots). Though the intent prediction accuracy is not particularly high, the overall performance showed an increase of around 1% by joint goal accuracy.

  5. PV for categorical slots + NLD + Intents + SQuAD pre-training. The encoder we used for our final model was the BERT model fine-tuned on the SQuAD dataset. We got an increase in joint goal accuracy, so we did not give up SQuAD pre-training, even though the average goal accuracy deteriorated.


We proposed a multi-task BERT-based model for multi-domain dialogue state tracking in zero-shot settings. Our approach is robust to schema modifications and is able to transfer the extracted knowledge to unseen domains. The model is consistent with real-life scenarios raised by virtual assistants and achieves substantial improvements over the baseline.


The work was supported by National Technology Initiative and PAO Sberbank project ID 0000000007417F630002.