Fine-Tuning BERT for Schema-Guided Zero-Shot Dialogue State Tracking

02/01/2020 ∙ by Yu-Ping Ruan, et al. ∙ USTC 0

We present our work on Track 4 in the Dialogue System Technology Challenges 8 (DSTC8). The DSTC8-Track 4 aims to perform dialogue state tracking (DST) under the zero-shot settings, in which the model needs to generalize on unseen service APIs given a schema definition of these target APIs. Serving as the core for many virtual assistants such as Siri, Alexa, and Google Assistant, the DST keeps track of the user's goal and what happened in the dialogue history, mainly including intent prediction, slot filling, and user state tracking, which tests models' ability of natural language understanding. Recently, the pretrained language models have achieved state-of-the-art results and shown impressive generalization ability on various NLP tasks, which provide a promising way to perform zero-shot learning for language understanding. Based on this, we propose a schema-guided paradigm for zero-shot dialogue state tracking (SGP-DST) by fine-tuning BERT, one of the most popular pretrained language models. The SGP-DST system contains four modules for intent prediction, slot prediction, slot transfer prediction, and user state summarizing respectively. According to the official evaluation results, our SGP-DST (team12) ranked 3rd on the joint goal accuracy (primary evaluation metric for ranking submissions) and 1st on the requsted slots F1 among 25 participant teams.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Virtual assistants have been commercialized and provide many services such as finding flights, checking weather, and booking hotels. Among these, the Siri, Alexa, Google Assistant, and Cortana are the most popular and advanced frameworks, which provide a conversational interface to a large number of services and APIs spanning multiple domains. Dialogue state tracking (DST) is a core component of such task-oriented dialogue systems. The DST keeps track of the user’s goal and what happened in the dialogue history, and output the dialogue state after each user utterance, which is a summary of the entire conversation till the current turn. The dialogue state then be used to determine what action should be taken by the system in next steps.

Deep learning models have achieved state-of-the-art results in dialogue state tracking [Mrkšić et al.2017, Liu and Lane2017, Rastogi, Gupta, and Hakkani-Tur2018, Zhong, Xiong, and Socher2018]. Common public datasets for DST like DSTC2 [Henderson, Thomson, and Williams2014], MultiWOZ [Budzianowski et al.2018], and M2M [Shah et al.2018] cover few domains and assume a single static ontology per domain, which do not sufficiently capture a number of challenges that arise with scaling DST in production [Rastogi et al.2019b]. The DST need to support a large, and constantly increasing number of services over a large number of domains.

To highlight these challenges, the DSTC8-Track 4 [Rastogi et al.2019a] presents the task of schema-guided zero-shot dialogue state tracking and released the schema-guided dialogue (SGD) dataset [Rastogi et al.2019b], which is the largest public task-oriented dialogue corpus, with over dialogues in training set spanning 26 services belong to 16 domains. Specifically, the SGD dataset is designed to develop and test models’ ability to generalize in zero-shot settings since the evaluation sets in SGD contain unseen services and domains. Zero-shot DST models utilizing domain and/or slot descriptions have gaining popularity for spoken language understanding tasks [Bapna et al.2017, Kumar et al.2017, Lee and Jha2019]. The SGD dataset in DSTC8-Track 4 also aims to motivate similar approaches for dialogue state tracking [Rastogi et al.2019a]. The targets of DST on the SGD dataset mainly include the intent prediction, and slot filling.

The DST task relies on the models’ ability of natural language understanding. Recently, the pretrained language models have achieved state-of-the-art results and shown impressive generalization ability on a wide variety of NLP tasks. Among these, the GPT [Radford et al.], ELMo [Peters et al.2018], BERT [Devlin et al.2019], and XLNet [Yang et al.2019] are the most popular and advanced ones. These pretrained language models provide a promising way to perform zero-shot learning for language understanding. Considering the zero-shot settings in the SGD dataset, we propose a schema-guided paradigm for zero-shot dialogue state tracking (SGP-DST) by fine-tuning BERT [Devlin et al.2019]. The SGP-DST system contains four modules: 1) intent prediction: which aims to predict the user intent for each user utterance, 2) slot prediction: which aims to predict slots value and user requested slots from corresponding user utterance, 3) slot transfer prediction: which aims to decide which slot value should be transferred (copied) from system history actions or user history states, and 4) user state summarization: which aims to summarize the prediction results from previous 3 modules and update the user states for current turn. According to the official evaluation results111https://docs.google.com/spreadsheets/d/19Z1e1mXch4HnPoXfMGxw2UEHBt3cTTcwulZ1H˙7U7DM/edit#gid=1291434552, our proposed SGP-DST system (team12) ranked 3rd among 25 participant teams.

2 The Schema-Guided Dialogue Dataset

service_name: “Banks_1”
desctiption: “Manage bank accounts and transfer money”
slot_name: “account_type”
categorical: True
desctiption: “The account type of the user”
possible_values: [“checking”, “savings”]
slot_name: “amount”
categorical: False
desctiption: “The amount of money to transfer”
possible_values: []
intent_name: “CheckBalance”
is_transactional: False
desctiption: “Check the amount of money in a user’s bank account”
required_slots: [“account_type”]
optional_slots: []
Table 1: Example schema (incomplete) for a bank service.
Train Dev Test
No. of dialogs 16,142 2,482 4201
No. of domains 16 16 18
No. of services 26 17 21
Avg. turns per dialogue 20.44 19.63 20.13
Avg. tokens per turn 9.75 9.66 10.40
dialogs with unseen APIs(%) 0.0 42.01 69.64
Table 2: Statistics of the SGD dataset.

The schema-guided dialogue (SGD) dataset consists of dialogues between a human and a virtual assistant, which are generated with the help of a dialogue simulator and crowd-workers along with schemas for one or more APIs relevant to the dialogues [Rastogi et al.2019b]. A schema defines the interface for a backend service API and contains a description, a set of slots and intents as shown in Table 1. The description is a natural language summary of the function of the API. Different from traditional schema definitions, the schemas in SGD also gives a natural language description for each slot and intent to help models generalize to unseen API schemas. There are two types of slots in SGD dataset: 1) categorical: a slot taking one of a finite set of possible values, which have been included in corresponding schema definition. 2) free-form: a slot can take any string value, which can be derived from the dialogue history.

The dialogue state tracking on SGD dataset contains following sub-targets: 1) predicting the user’s intent for each user utterance, 2) predicting the slots value, which may be from corresponding user utterance or transferred (copied) from the system history actions or user history states, and 3) predicting which slots are requested by the user.

Some main statistics of the SGD dataset are shown in Table 2. Except for the number of total dialogues, service domains, and service APIs, Table 2 also gives the average turns per dialogue, average tokens per turn, and percentage of dialogues with unseen service APIs. We can find that the test set has more dialogues than the dev set, and more critically, the test set includes more dialogues with unseen APIs than the dev set. For more details about the SGD dataset, we recommend readers to check the descriptions on the data website222https://github.com/google-research-datasets/dstc8-schema-guided-dialogue.

3 System Description

As introduced in Section 1, our proposed SGP-DST system has four modules, which work jointly in a pipeline manner to track the user dialogue state. The intent prediction, slot prediction, and slot transfer prediction modules are all based on fine-tuned BERT, and the last user state summarizing module just merges the results from previous three modules based on simple rules. Followings are the details about the module design.

3.1 Intent Prediction

Figure 1: BERT input representation in the intent prediction module. Except for the token embedding, segment embedding, and position embedding, we introduce another context feature embedding.

As shown in Figure 1, the inputs to BERT are the sum of the token embeddings, segment embeddings, position embeddings, which are the same with those in original BERT [Devlin et al.2019]

, and additionally context feature embeddings. The token sequences are the concatenation of utterance tokens and intents description tokens. Specifically, we concatenate the last system utterance and the current user utterance in form of “sys: [system_utterance] usr: [user_utterance]” as the input utterance, and we pad the intents description fragments to the max number of intents considering that different service APIs have different number of intents.

We design the context feature as the intent indicator of last user utterance, considering that the user intent prediction should be transited smoothly from last user intent. So there are only two kinds of context feature embeddings, which is set to “1” when the token belongs to last intent description and else is “0”. We use the ground-truth last intent labels for training but predicted intent labels for inference when derive the intent indicator feature.

Different from the original attention flow in BERT, in which each token can attend to all tokens in the input sequence, the tokens in intent_i can only attend to the utterance tokens and the tokens belong to the same intent_i. For the tokens in the utterance, they can attend to all tokens in the input sequence.

After BERT output the encoded sequence, we derive a contextual representation vector

for intent_i by max pooling the encoded token representations belongs to correspond intent description. Then the representation vector

will be input to a feed-forward network as follows,

(1)
(2)

in which the

gives the probability of selecting intent_i as the user intent. The cross-entropy between the predicted and ground truth distribution of intents is defined as the loss for training.

3.2 Slot Prediction

The slot prediction module has three submodules for categorical slot value prediction, free-form slot value prediction, and requested slot prediction respectively. Table 2 shows an example for the slot prediction targets.

Service_name: “Restanrants_2”
user utterance: “I want to make the reservation for 2 people at half past 11 in the morning, and can you give me the address?”
categorical slots: [“number_of_seats”: “2”]
free-form slots: [“time”: “half past 11 in the morning”]
required slots: [“address”]
Table 3: Example of slot prediction targets, including categorical slot value prediction, free-form slot value prediction, and required slot prediction.

Categorical slot value prediction

The BERT input representation for categorical slot value prediction is presented in Figure 2(a). The input tokens contain the utterance tokens, slot description tokens, and corresponding possible values tokens, in which the utterance tokens are the concatenation of the last system utterance and the current user utterance. The slot description tokens are the concatenation of slot name and original slot description tokens. The possible value description fragments are padded into the max number of possible values. Additionally, we add the “null” value as there may be no corresponding value present in the utterance.

The context feature here gives the information about whether the slot has been requested by system and whether system has offered an value for current slot, which are both binary. So there is totally 4 kinds of context feature embeddings. We derive the context feature from history system actions.

For the attention flow, the tokens in the utterance and the slot description can attend to all tokens in the input sequence. Similar with the attention flow of the intent tokens in Section 3.1, the tokens in value_i can only attend to the tokens in utterance, slot description, and tokens belong to value_i.

For the final categorical slot value prediction, we adopt the same manner for intent prediction in Section 3.1

: first derive a contextual representation vector for each possible value including the “null” value, then feed these contextual representation vectors into a feed-forward network with softmax output to get the probability distribution along all possible values. The training target is the cross-entropy between the predicted and ground truth distribution of categorical slot values.

Figure 2: BERT input representation in the slot prediction module: a) for categorial slot value prediction, and b) for free-form slot value prediction or requested slot prediction.

Free-form slot value prediction

We build the free-form slot value prediction as a reading comprehension task, in which the model need to return a text span from the user utterance as the predicted slot value. The input representation for BERT is shown in Figure 2(b), in which the utterance tokens, slot description tokens, and context feature are derived with the same manner for categorical slot value prediction. Specifically, the utterance tokens are further concatenated with the “null” token, considering that there may be no value present in the utterance for corresponding slot. The attention flow is identical with that in original BERT.

As for the text span prediction, we adopt the same processing way of BERT on SQuAD task [Devlin et al.2019]. Let the encoded representation vector for token in utterance as , then the probability of token being the start/end of the value span is computed as a dot product between and start vector or end vector followed by a softmax:

(3)
(4)

Then the maximum scoring span is used as the predicted slot value. And when the maximum scoring span is “null”, we do not return any predict results for the current concerning slot. The cross-entropy between the predicted and ground truth distribution of the start and end positions are used as the training objective.

Requested slot prediction

The input representation for requested slot prediction is identical with that for free-form slot value prediction. We treat the requested slot prediction as a sequence-level classification task, and we adopt the same processing way as the one for sequence classification in [Devlin et al.2019]

, in which the final encoded vector for “[CLS]” is fed into a classifier.

3.3 Slot Transfer Prediction

Service_name: “Buses_1”
user: “Can you find a bus? It’s for a group of 4.”
system: “How about a bus with 0 stops, departing at 7 am, and costs $29?”
user: “Okay, what bus station is it leaving from? What bus station am I arriving at?”
system: “The destination station is Santa Fe Depot and you will be departing from Downtown Station.”
user: “Sounds great. Book the bus.
Table 4: Example of in-domain slot transfer. When user makes the decision by saying “Sounds great. Book the bus.”, the value of slot “leaving_time” should be copied from system history action “OFFER: leaving_time: 7 am”.
Last turn:
Service_name: “RentalCars_1”
System utterance:: “Your car has been reserved in NYC.”
user utterance: “That’s good.”
user states:{“pickup_date”: “March 11th”, “pickup_city”: “NYC”,…}
Current turn:
Service_name: “Buses_1”
System utterance:: “Would you want to get there by taxi?”
user utterance: “No. I’d like a bus to get there.”
user states:{“to_location”: “NYC”}
Table 5: Example of cross-domain slot transfer. When the user state tracking switched from Service “RentalCars_1” to service “Buses_1”, the value of target slot “to_location” in ‘Buses_1” frame should be copied from the source slot “pickup_city” in “RentalCars_1” frame. Note that the source slot can also be from system history actions in “RentalCars_1” frame.

The slot-value in user state may be transferred (copied) from corresponding slot-value in system history actions or user history states, which cannot be figured out from current user utterance. For the SGD dataset, which mainly contains multi-domain dialogues, the slot-value can be transferred in domain or cross domain, noted as “in-domain slot transfer” and “cross-domain slot transfer” here.

For the in-domain slot transfer, as shown in Table 4, when user makes an agreement or transaction with the system, some certain slots should be transferred from system history actions which provide the values for corresponding slots. The table 5 shows one example for the cross-domain slot transfer, which usually happens when the user state tracking switches from one service API to another.

For both in-domain and cross-domain slot transfer, there may be multiple values for a slot in the dialogue history, we use the most recently mentioned slot value for transfer.

Figure 3: BERT input representation in the slot transfer prediction module: a) for in-domain slot transfer prediction, and b) for cross-domain slot transfer prediction.

In-domain slot transfer

Similar with previous settings, the BERT input representation for in-domain slot transfer is presented in Figure 3(a). The input tokens are the concatenation of the service description tokens, the utterance tokens, and the slot description tokens, in which the utterance and slot description tokens are derived with the same manner in categorical slot value prediction. Specifically, the context feature used here gives the information about whether the slot is optional/required in current user intent, whether system has given the value for current slot, and wether the slot has appeared in user history states, which contains four binary features. So there is totally 16 kinds of context feature embeddings. For the attention flow, it’s identical with that in original BERT.

Finally, We treat the slot transfer prediction as a sequence-level classification task, and we adopt the standard processing way of BERT for sequence classification [Devlin et al.2019].

Cross-domain slot transfer

The BERT input representation for cross-domain slot transfer prediction is shown in Figure 3(b), in which the input tokens are the concatenation of utterance tokens, target slot tokens and source slot tokens. For the context feature, totally 5 binary features are used, i.e., whether current service frame is the continuation of previous turn, whether the target slot is optional/required in current user intent, whether the target slot has appeared in the history of current service frame, and whether the source slot has been in source service frame. So we set total 32 kinds of context feature embeddings here.

The cross-domain slot transfer prediction is also treated as a sequence-level classification task, and the standard processing manner of BERT for sequence classification [Devlin et al.2019] is adopted.

Active Intent Acc. Requested Slot F1 Average Goal Acc. Joint Goal Acc.
SGP-DST
All APIs 0.9529 0.9839 0.9387 0.8001
Seen APIs 0.9571 0.9845 0.9659 0.8831
Unseen APIs 0.9476 0.9832 0.9027 0.6923
SGP-DST without in-domain slot transfer
All APIs 0.9529 0.9839 0.8100 0.4975
Seen APIs 0.9571 0.9845 0.8362 0.5416
Unseen APIs 0.9476 0.9832 0.7753 0.4400
SGP-DST without cross-domain slot transfer
All APIs 0.9529 0.9839 0.8406 0.6361
Seen APIs 0.9571 0.9845 0.8748 0.7106
Unseen APIs 0.9476 0.9832 0.7954 0.5393
SGP-DST without in&cross-domain slot transfer
All APIs 0.9529 0.9839 0.7048 0.3940
Seen APIs 0.9571 0.9845 0.7353 0.4170
Unseen APIs 0.9476 0.9832 0.6644 0.3640
Table 6: Evaluation results on the dev set. The “all APIs”, “seen APIs”, “and unseen APIs” represents the whole dev set, subset whose APIs appear in traning set, and subset whose APIs are unseen in training set.

3.4 User State Summarization

The user state summarization module works at inference stage. This module aims to summarize the prediction results from above three modules and update the user state at each turn. Specifically, the intent prediction and slot prediction modules should first finish their corresponding prediction targets, then the slot transfer prediction module can work for its targets since the user intent and user history states feature are used in this module.

The rules for user state summarizing and updating are very simple and mainly include: 1) for each prediction target in above three modules, we set a scoring threshold and only consider the predicted items whose accompanied prediction score returned by the model is above the threshold, in which the predicted probabilities are uses as the scores directly and the score thresholds are set to 0.8, 0.5, 0.9, 0.85, and 0.9 for categorical slot, free-form slot, requested slot, in-domain slot transfer, and cross-domain slot transfer prediction respectively. 2) we manually remove the slots which are not required or optional in corresponding user intent according the schema definition.

4 Experiments

4.1 Training Details

We extract training and dev samples from the original released SGD dialogue corpus for each module in our SGP-DST system. All dialogues, including the single-domain and the multi-domain, in the SGD training set are used for the model training. We use the PyTorch implementation of BERT-base with the pretrained model file

bert-base-cased provided by Google333https://github.com/huggingface/pytorch-pretrained-BERT#Fine-tuning-with-BERT-running-the-examples

. For the fine-tuning process, we use mostly default settings. Specifically, the learning rate is 2e-05, the batch size is set to 128, and the max training epochs is 3. Totally, we fine-tuned 6 individual BERT-base models for intent, categorical slot, free-form slot, requested slot, in-domain slot transfer, and cross-domain slot transfer prediction respectively in our SGP-DST systems.

4.2 Evaluation Metrics

The official metrics for the evaluation of DST on the SGD dataset are listed below, in which the joint goal accuracy is used as the primary metric for ranking systems.

  • Active intent accuracy: The fraction of user turns for which the active intent has been correctly predicted.

  • Requested slots F1: The macro-averaged F1 score for requested slots over the turns. For a turn, if there are no requested slots in both the ground truth and the prediction, that turn is skipped. The reported number is the average F1 score for all un-skipped user turns.

  • Average goal accuracy: This is the average accuracy of predicting the value of a slot correctly. A fuzzy matching based score is used for non-categorical slots. The slots which have a non-empty assignment in the ground truth dialogue state are only considered.

  • Joint goal accuracy: This is the average accuracy of predicting all slot assignments for a turn correctly. For non-categorical slots a fuzzy matching score is used to reward partial matches with the ground truth.

Active Intent Acc. Requested Slot F1 Average Goal Acc. Joint Goal Acc.
With 0% of dev set for training
All APIs 0.9529 0.9839 0.9387 0.8001
Seen APIs 0.9571 0.9845 0.9659 0.8831
Unseen APIs 0.9476 0.9832 0.9027 0.6923
With 50% of dev set for training
All APIs 0.9620 0.9839 0.9557 0.8452
Seen APIs 0.9625 0.9852 0.9755 0.9127
Unseen APIs 0.9614 0.9823 0.9295 0.7573
With 90% of dev set for training
All APIs 0.9629 0.9856 0.9631 0.8698
Seen APIs 0.9633 0.9942 0.9794 0.9225
Unseen APIs 0.9623 0.9864 0.9431 0.8059
Table 7: Evaluation results under the few-shot settings. We include 0%, 50%, and 90% dialogues in dev set for training respectively and conduct the evaluation on the remaining dialogues in dev set.
Active Intent Acc. Requested Slot F1 Average Goal Acc. Joint Goal Acc.
With 0% of dev set for training
All APIs 0.9185 0.9899 0.9125 0.7222
Seen APIs 0.9564 0.9935 0.9582 0.8798
Unseen APIs 0.9059 0.9887 0.8966 0.6696
With 90% of dev set for training
All APIs 0.9234 0.9948 0.9199 0.7375
Seen APIs 0.9581 0.9965 0.9566 0.8795
Unseen APIs 0.9118 0.9943 0.9071 0.6901
With 100% of dev set for training
All APIs 0.9260 0.9954 0.9151 0.7257
Seen APIs 0.9574 0.9973 0.9631 0.8877
Unseen APIs 0.9156 0.9947 0.8983 0.6717
Best submitted results among all participants
Best Reported 0.9692 0.9954 0.9712 0.8653
Our Best (Rank) 0.9260 (11) 0.9954 (1) 0.9199 (5) 0.7375 (3)
Table 8: Official evaluation results on test set. We submitted the SGP-DST systems trained without dev samples and with 90% dev samples respectively.

4.3 Results

Overall performance

Table 6 presents the overall performance of our SGP-DST system on the dev set. To analyze the SGP-DST’s ability of zero-shot dialogue state tracking more comprehensively, we present the results on the whole dev set, subset whose APIs appear in training set, and subset whose APIs are unseen in training set, denoted as “all APIs”, “seen APIs”, “and unseen APIs” respectively in Table 6.

We can find that there is a very small performance gap on the active intent and requested slot prediction between the “seen APIs” subset and the“unseen APIs” subset, which indicates that our SGP-DST system can generalize well on the active intent and requested slot prediction. However, on the metrics of average goal accuracy and joint goal accuracy, our SGP-DST system performs much worse on the “unseen APIs” subset than on the “seen APIs”, which indicates that it’s more difficult to generalize on the slot value prediction.

We also evaluate the importance of slot transfer prediction module in our SGP-DST by removing the in-domain slot transfer prediction and cross-domain slot prediction respectively. We can find that both in-domain and cross-domain slot transfer has a significant effect on improving the system performance, especially for the in-domain slot transfer.

Evaluation under few-shot settings

From results in Table 6

, we can know that our system has significant poorer performance on the “unseen APIs” subset than that on “seen APIs” subset. Generally, zero-shot settings is a tough testing bed for neural network models, and it’s actually an economy way to improve the models’ performance by including a small number of unseen labelled samples into the training process, which is called few-shot learning. In order to do this, we simply random sample some dialogues in the dev set and adding them into the training process, then we evaluate the model on the remaining dialogues in the dev set.

Table 7 presents the evaluation results of including dev samples for training. We can find that including samples from dev set for training can have a significant performance improvement on the joint goal prediction, especially for the “unseen APIs” subset, which demonstrates the effects of few-shot learning.

Official evaluation on test set

We submitted our SGP-DST systems trained without dev samples and with 90%, and 100% dev samples respectively for the final official evaluation on test set, whose results are present in Table 8.

For the SGP-DST system trained without dev samples, we can find that its performance on the “seen APIs” test subset is very close to that on the “seen APIs” dev subset, however, the performance on the “unseen APIs” test subset is obviously poorer than that on the dev subset. Also, the SGP-DST system performs much worse on the whole test set than on the dev set since there is much more dialogues with unseen APIs in the test set.

For the SGP-DST systems trained with dev samples, both of them outperform the one trained without dev samples on all four metrics for the “all APIs” set. Specifically, the system trained with extra 100% dev samples achieves best results on the active intent accuracy and requested slot F1, however, the one trained with 90% dev samples achieves best results on the average goal accuracy and joint goal accuracy, which means there may be overfitting for our SGP-DST systems trained with dev samples.

Compared with the bested reported results among all 25 participants, our SGP-DST system ranks 1st on the request slot F1, but achieves significantly worse result on the active intent prediction. For the slot value prediction, there is significant performance gap between our system and the best reported, our SGP-DST ranks 5th and 3rd on the average goal accuracy and joint goal accuracy respectively.

5 Conclusion

In this paper, we present our SGP-DST system for DSTC8-Track 4, which aims to perform dialogue state training (DST) under the zero-shot settings. Our proposed SGP-DST system includes four modules for intent prediction, slot prediction, slot transfer prediction, and user state summarizing respectively. According to the official evaluation results, our SGP-DST (team12) ranked 3rd (primary evaluation metric for ranking submissions) and 1st on the requsted slots F1 among all 25 participant teams. For the zero-shot dialogue state tracking, it’s still worth our study and exploration for more powerful models with better generalization ability of language understanding.

References