Log In Sign Up

MultiWOZ 2.1: Multi-Domain Dialogue State Corrections and State Tracking Baselines

by   Mihail Eric, et al.

MultiWOZ is a recently-released multidomain dialogue dataset spanning 7 distinct domains and containing over 10000 dialogues, one of the largest resources of its kind to-date. Though an immensely useful resource, while building different classes of dialogue state tracking models using MultiWOZ, we detected substantial errors in the state annotations and dialogue utterances which negatively impacted the performance of our models. In order to alleviate this problem, we use crowdsourced workers to fix the state annotations and utterances in the original version of the data. Our correction process results in changes to over 32 In addition, we fix 146 dialogue utterances throughout the dataset focusing in particular on addressing slot value errors represented within the conversations. We then benchmark a number of state-of-the-art dialogue state tracking models on this new MultiWOZ 2.1 dataset and show joint state tracking performance on the corrected state annotations. We are publicly releasing MultiWOZ 2.1 to the community, hoping that this dataset resource will allow for more effective dialogue state tracking models to be built in the future.


page 1

page 2

page 3

page 4


MultiWOZ 2.2 : A Dialogue Dataset with Additional Annotation Corrections and State Tracking Baselines

MultiWOZ is a well-known task-oriented dialogue dataset containing over ...

Oh My Mistake!: Toward Realistic Dialogue State Tracking including Turnback Utterances

The primary purpose of dialogue state tracking (DST), a critical compone...

CheckDST: Measuring Real-World Generalization of Dialogue State Tracking Performance

Recent neural models that extend the pretrain-then-finetune paradigm con...

Controllable User Dialogue Act Augmentation for Dialogue State Tracking

Prior work has demonstrated that data augmentation is useful for improvi...

Prompt-based Generative Approach towards Multi-Hierarchical Medical Dialogue State Tracking

The medical dialogue system is a promising application that can provide ...

Opponent Modeling in Negotiation Dialogues by Related Data Adaptation

Opponent modeling is the task of inferring another party's mental state ...

ASSIST: Towards Label Noise-Robust Dialogue State Tracking

The MultiWOZ 2.0 dataset has greatly boosted the research on dialogue st...

1 Introduction

In task-oriented conversational systems, dialogue state tracking refers to the important problem of estimating a user’s goals and requests at each turn of a dialogue. The state is typically defined by the underlying ontology of the domains represented in a dialogue, and a system’s job is to learn accurate distributions for the values of certain domain-specific slots in the ontology. There have been a number of public datasets and challenges released to assist in building effective dialogue state tracking modules  

Williams et al. (2013); Henderson et al. (2014); Wen et al. (2017).

One of the largest resources of its kind is the MultiWOZ dataset, which spans 7 distinct task-oriented domains including hotel, taxi, and restaurant booking among others  Budzianowski et al. (2018). This dataset has been a unique resource, in terms of the multi-domain interactions as well as slot value transfers between these domains, and has quickly attracted researchers for dialogue state tracking Nouri and Hosseini-Asl (2018); Goel et al. (2019); Wu et al. (2019) and dialogue policy learning Zhao et al. (2019).

Though the original MultiWOZ dataset comes with fine-grained dialogue state annotations for all the domains at the turn-level, in practice we have found substantial noise in the annotations of dialogue state values. While some amount of noise in annotations cannot be avoided, it is desirable to have clean data so the error patterns in various models can be attributed to model mistakes rather than the data. To this end, we re-annotated states in the MultiWOZ data with a different set of interannotators. We specifically accounted for 4 kinds of common mistakes in MultiWOZ, detailed in Section 2.1. In addition, we also corrected spelling errors and canonicalized entity names as detailed in Section 2.3.

Finally, we ran the state-of-the-art models of dialogue state tracking on the corrected data to provide competitive baselines for this new dataset. With this work, we release the corrected MultiWOZ 2.0 which we call MultiWOZ 2.1, as well as baselines consisting of state-of-the-art dialogue state tracking techniques on this new data.

Slot Names % changed # changed % changed # changed % changed # changed
Train Train Dev Dev Test Test
taxi-leaveAt 0.43% 246 0.30% 22 0.73% 54
taxi-destination 1.46% 830 1.33% 98 1.38% 102
taxi-departure 1.47% 833 1.29% 95 1.41% 104
taxi-arriveBy 0.29% 167 0.26% 19 0.43% 32
restaurant-people 0.74% 423 0.64% 47 0.71% 52
restaurant-day 0.72% 410 0.62% 46 0.68% 50
restaurant-time 0.74% 422 0.71% 52 0.77% 57
restaurant-food 2.77% 1574 2.45% 181 2.13% 157
restaurant-pricerange 2.36% 1338 1.83% 135 2.71% 200
restaurant-name 8.20% 4656 5.84% 431 9.58% 706
restaurant-area 2.34% 1328 1.55% 114 2.75% 203
bus-people 0.00% 0 0.00% 0 0% 0
bus-leaveAt 0.00% 0 0.00% 0 0% 0
bus-destination 0.00% 0 0.00% 0 0% 0
bus-day 0.00% 0 0.00% 0 0% 0
bus-arriveBy 0.00% 0 0.00% 0 0% 0
bus-departure 0.00% 0 0.00% 0 0% 0
hospital-department 0.12% 68 0.00% 0 0% 0
hotel-people 1.06% 603 0.61% 45 0.61% 45
hotel-day 1.00% 565 0.69% 51 0.65% 48
hotel-stay 1.18% 671 0.61% 45 0.84% 62
hotel-name 6.90% 3917 5.84% 431 5.81% 428
hotel-area 3.43% 1947 2.03% 150 3.95% 291
hotel-parking 2.69% 1526 2.78% 205 2.67% 197
hotel-pricerange 3.09% 1753 2.18% 161 2.39% 176
hotel-stars 1.69% 962 1.38% 102 1.95% 144
hotel-internet 2.27% 1290 2.17% 160 3.05% 225
hotel-type 3.58% 2035 2.64% 195 2.79% 206
attraction-type 4.57% 2594 4.43% 327 4.03% 297
attraction-name 5.99% 3400 6.60% 487 8.86% 653
attraction-area 2.13% 1212 1.79% 132 3.23% 238
train-people 0.92% 520 0.53% 39 0.75% 55
train-leaveAt 2.07% 1178 2.12% 156 4.64% 342
train-destination 0.91% 518 0.69% 51 0.87% 64
train-day 0.84% 476 0.54% 40 0.85% 63
train-arriveBy 1.29% 730 1.06% 78 2.82% 208
train-departure 1.01% 573 0.94% 69 0.66% 49
Joint 41.34% 23473 37.96% 2799 45.02% 3319
Table 1: Percentage of changes in dialogue state values before and after annotations. The highest number of changed values are in name slots (e.g., restaurant-name, attraction-name, and hotel-name). Such slots had particularly large numbers of spelling mistakes (e.g., shanghi family restaurant to shanghai family restaurant). Note that while the number of changes to individual slots is small, we ended up changing the joint dialogue state for over 40% of dialogue turns.

In Section 2 we provide details for the data correction process and provide examples and statistics on the corrections. We detail our baseline Models in Section 3. We discuss the performance on this new dataset in Section 4.

# Values Previous Value New Value
6279 none dontcare
2011 none yes
1159 none hotel
1049 dontcare none
920 none centre
Table 2: Top 5 slot value changes (all data) between MultiWOZ 2.1 and MultiWOZ 2.0 by frequency count
Slot Name 2.0 2.1
taxi-leaveAt 119 106
taxi-destination 277 247
taxi-departure 261 244
taxi-arriveBy 101 95
restaurant-people 9 9
restaurant-day 10 9
restaurant-time 61 61
restaurant-food 104 99
restaurant-pricerange 11 5
restaurant-name 183 166
restaurant-area 19 7
bus-people 1 1
bus-leaveAt 2 2
bus-destination 5 5
bus-day 2 2
bus-arriveBy 1 1
bus-departure 2 2
hospital-department 52 49
hotel-people 11 9
hotel-day 11 9
hotel-stay 10 9
hotel-name 89 66
hotel-area 24 7
hotel-parking 8 5
hotel-pricerange 9 5
hotel-stars 13 8
hotel-internet 8 4
hotel-type 18 5
attraction-type 37 33
attraction-name 137 136
attraction-area 16 7
train-people 14 12
train-leaveAt 134 149
train-destination 29 23
train-day 11 9
train-arriveBy 107 114
train-departure 35 27
Table 3: Comparison of slot value vocabulary sizes (training set) between MultiWOZ 2.0 and MultiWOZ 2.1. Note that vocabulary sizes reduced drastically for most slots (except train-arriveby and train-leaveat) due to the data cleaning.

2 Dataset Corrections

Correction Type % of Slot Values
no change 98.16%
none value 1.23%
valueA valueB 0.44%
value none 0.17%
value dontcare 0.23%
Table 4: % of values of slots changed in MultiWOZ 2.1 vs. MultiWOZ 2.0

2.1 Dialogue State Error Types

The most common errors types in the original dialogue state annotations include the following:

  • Delayed markups. These refer to slot values that were annotated one or more turns after the value appeared in the user utterances. Row 1 of Table 5 shows this case where the “Turkish” value appears one turn late in the MultiWOZ 2.0 dialogue.

  • Multi-annotations. The same value is annotated as belonging to multiple slots, usually one of these is correct and the other one is spurious. Row 2 of Table 5 shows such a case where “belf” is spurious.

  • Mis-annotations. The value is annotated as belonging to a wrong slot type. In row 3 of Table 5 we can see a case where “Thursday” appears in a wrong slot.

  • Typos. The value is annotated, but it includes a typo or is not canonicalized. Row 4 of Table 5 exhibits such a case with “centre” misspelled.

  • Forgotten values. The slot value never occurs in the dialogue state, even though it was mentioned by the user. Row 5 of Table 5 is an example where “dontcare” is never seen in the data.

Type Conversation MultiWOZ 2.0 MultiWOZ 2.1
Delayed User: I’d also like to try a Turkish
Markups restaurant. Is that possible? None Turkish
Agent: I’m sorry but the only
restaurants in that part of town serve
either Asian food or African food.
User: I don’t mind changing the area.
I just need moderate pricing and
want something that serves Turkish food. Turkish
Multi User: Can you tell me more about The Cambridge Belfry The Cambridge Belfry
-annotations Cambridge Belfry belf None
Mis User: Yes, I need to leave on
-annotations Thursday and am departing train.leaveAt: Thursday train.leaveAt: None
from London Liverpool Street. Not Mentioned Thursday
Typos Although, I could use some help finding
an attraction in the centre of town. attraction.area: cent attraction.area: Centre
Forgotten User: No particular price range, but
values I do need a restaurant that is available
to book 7 people on Friday at 19:15. restaurant.pricerange: None restaurant.pricerange: Dontcare
Value Cano- User: I think you should try
nicalization again. Cambridge to Bishop
Stafford on Thursday. train.destination: Bishop Stortford train.destination: Bishops Stortford
Table 5: Examples of annotation errors between MultiWOZ 2.0 and 2.1

2.2 Dialogue State Corrections

Our corrections were of two types: manual corrections and automated corrections. Manual corrections involved asking annotators to go over each dialogue turn-by-turn and correcting mistakes detected in the original annotations. During this step, we noticed that sometimes the dialogue state could include multiple values, and hence we annotated them as such. Table 7 includes examples of these cases. MultiWOZ 2.1 has over 250 such multi-value slot values.

After the first manual pass of annotation correction, we wrote scripts to canonicalize slot values for lookup in the domain databases provided as part of the corpus. Row 6 of Table 5 shows one such example. We also present some of the most frequent corrections for state values in Table 2. Table 4 presents statistics on the types of corrections made.

Due to our canonicalization and reannotation, the vocabulary sizes of most of the slots decreased significantly (Table 3) except 2 slots - “train-leaveAt” and “train-arriveBy”. For these slots we noticed that there were times missing in the dialogue states (such as “20:07”) which our annotations additionally introduced. We also canonicalized all times in the 24:00 format.

Model MultiWOZ 2.0 MultiWOZ 2.1
FJST 40.2% 38.0%
HJST 38.4% 35.55%
TRADE 48.6% 45.6%
DST Reader 39.41% 36.4%
HyST 42.33% 38.1%
Table 6: Test set joint state accuracies for various models on the MultiWoz 2.0 and Multiwoz 2.1 data. FJST refers to the Flat Joint State Tracker, and HJST refers to the Hierarchical Joint State Tracker.
Agent: I have two restaurants. They
are Pizza Hut Cherry Hinton and
Restaurant Alimentum.
User: What type of food do each
of them serve? Pizza Hut Cherry Hinton,
Restaurant Alimentum
User: I would like to visit a museum
or a nice nightclub in the north.
attraction.type: museum, nightclub
User: I would also like a reservation
at a Jamaican restaurant in that area
for seven people at 12:45, if there
is none Chinese would also be good. Jamaican (preferred), Chinese
User: I would prefer one in the cheap
range, a moderately priced one is
fine if a cheap one isn’t there.
restaurant.pricerange: cheap (preferred), moderate
Table 7: Example dialogue sections with multi-value slots in their states.

2.3 Dialogue Utterance Corrections

It is often the case when building dialogue state systems that the target slot values are mentioned verbatim in the dialogue history. Many copy-based dialogue state tracking models heavily rely on this assumption Goel et al. (2018). In these situations, it is crucial that the slot values are represented correctly within the user and system utterances. However, because dialogue datasets are often collected via crowdsourced platforms where workers are asked to provide utterances via free-form text inputs, these slot values within the utterances may be misspelled or they may not be consistent with the true values from the ontology.

To detect potential error cases within the utterances, for every single dialogue turn, we computed the terms that have Levenshtein distance less than 3 from the slot values annotated for that turn. We then performed string matching for these terms within the turn, forming a set of error candidates. This created a candidate set of 225 potential errors which we then manually inspected to filter out those candidates which were false positives, leaving a collection of 67 verified errors. We then programmatically scanned the entire dataset applying corrections to the verified errors, changing 146 total utterances.

As an example of a corrected utterance: “I’m leaving from camgridge and county folk museum.” was changed to “I’m leaving from cambridge and county folk museum.” Without such a correction, it would be very difficult for a span-based copy mechanism to correctly identify the slot value ”cambridge and county folk museum” in its original form.

3 Baseline Models

Within dialogue state tracking, there are two primary classes of models: and . In models, the state tracking mechanism operates on a predefined ontology of possible slot values, usually defined to be the values seen in the training and validation data splits. These models benefit from being able to fluidly predict values that aren’t present in a given dialogue history but suffer from the rigidity of having to define the potentially large slot value list per domain during the model training phase. By contrast models are able to flexibly extract slot values from a dialogue history but struggle to predict slot values that have not been seen in the history.

In order to benchmark performance on our updated dataset, we provide joint dialogue state accuracies for a number of and models which are reported in Table 4. For the models, the dialogue history up to turn is defined as , where and are the user and system utterances at turn respectively. Note that this history also includes the user utterance.

The Flat Joint State Tracker refers to a bidirectional LSTM network that encodes the full dialogue history and then applies a separate feedforward network to the encoded hidden state for every single state slot. In practice this amounts to 37 separately branching feedforward networks that are trained jointly. The Hierarchical Joint State Tracker incorporates a similar architecture but instead encodes the history using a hierarchical network in the vein of  Serban et al. (2016). TRADE is a recently proposed model that achieved state-of-the-art results on the original MultiWOZ 2.0 data, using a generative state tracker with a copy mechanism  Wu et al. (2019). The DST Reader is a newly proposed model that frames state tracking as a reading comprehension problem, learning to extract slot values as spans from the dialogue history  Review (2019). The HyST is another new model which combines a hierarchical encoder system with an n-gram copy-based system  Goel et al. (2019).

4 Results and Discussion

As we can see from Table 6, the relative performances of the models have remained the same across the data updates. However, we also noticed a consistent drop in performance for all models on MultiWOZ 2.1 compared to MultiWOZ 2.0, which was a particularly surprising result.

In order to understand the source of this drop, we investigated the performances of the Flat Joint State Tracker and Hierarchical Joint State Tracker on the MultiWOZ 2.0 and the MultiWOZ 2.1 datasets. Across the two datasets, we observed that there are 937 new turn-level prediction errors that the Flat Joint State Tracker makes on MultiWOZ 2.1 that it did not make on MultiWOZ 2.0. This constitutes 1370 total slot value prediction errors across the turns. Of these slot value errors, we saw that 184 errors () are a result of a dontcare target label for which our model predicts another value.

When we looked at predictions of the Hierarchical Joint State Tracker, we saw that a model trained on MultiWOZ 2.0 generated 331 errors for which the ground truth label was dontcare but it predicted none. Meanwhile a model trained on MultiWOZ 2.1 generated 748 such errors, a factor increase of over 2.25x. As shown in Table 4, of our corrections involved changing a value to a dontcare label so we hypothesize that our corrections have increased the complexity of learning the dontcare label correctly. Given that building systems that can effectively capture user ambiguity is an important characteristic of conversational systems, this leaves ample room for improvement in future models.

Also noteworthy is the fact that 439 new errors for the Flat Joint State Tracker () are caused when the target label is none but the model predicts another value. As Table 4 shows of our corrections involved changing a slot from a value to none, suggesting that MultiWOZ 2.1 now more heavily penalizes spurious slot value predictions.

For the Flat Joint State Tracker, we also observed that the largest slot accuracy decrease from MultiWOZ 2.0 to MultiWOZ 2.1 occurred for the slot (). We inspected the kinds of errors the model was generating and found that the vast majority of these errors were legitimate model prediction mistakes on correctly annotated dialogue states. This encourages further research in enhancing the performance of these state-tracking models, especially on proper name extraction.

5 Conclusion

We publicly release state corrected MultiWOZ 2.1 and rerun competitive state tracking baselines on this dataset. The dataset will be available on the original Cambridge University webpage111 We hope that the cleaner data allows for better model and performance comparisons on the task of multi-domain dialogue state tracking.


6 Appendix

We also present the percentage of state values which can be filled by copying over values directly from the conversation up to that turn. We call this the Copy Oracle. This is the upper performance limit for based approaches. This oracle accuracy is , a huge increase over the existing state-of-the-art systems which achieve on this dataset. This accuracy gap motivates the need for accurate slot value annotations as well as correct slot values within the dialogue utterances, as these will allow us to continue to improve on the performance of these open vocabulary systems. The statistics are provided in Table 8.

Oracle 2.0 Oracle 2.1 Oracle 2.0 Oracle 2.1
Slot Name Train set Train set Dev set Dev set
taxi-leaveAt 99.07% 0.11% 98.75% 0.11%
taxi-destination 98.90% -0.04% 99.38% -0.45%
taxi-departure 99.15% -0.19% 99.36% -0.14%
taxi-arriveBy 99.67% 0.01% 99.61% 0.05%
restaurant-people 99.36% -0.02% 99.28% -0.08%
restaurant-day 99.66% 0.10% 99.84% -0.01%
restaurant-time 99.40% 0.05% 99.73% -0.11%
restaurant-food 99.00% -0.52% 99.35% -0.05%
restaurant-pricerange 98.60% 0.12% 99.44% -0.12%
restaurant-name 98.33% -1.38% 98.58% -1.25%
restaurant-area 99.11% 0.16% 99.51% 0.03%
bus-people 100.00% 0.00% 100.00% 0.00%
bus-leaveAt 100.00% 0.00% 100.00% 0.00%
bus-destination 99.99% 0.00% 100.00% 0.00%
bus-day 100.00% 0.00% 100.00% 0.00%
bus-arriveBy 100.00% 0.00% 100.00% 0.00%
bus-departure 100.00% 0.00% 100.00% 0.00%
hospital-department 99.82% -0.01% 99.91% 0.00%
hotel-people 99.08% -0.03% 99.17% -0.05%
hotel-day 99.57% 0.06% 99.82% 0.01%
hotel-stay 99.15% 0.02% 99.34% -0.11%
hotel-name 97.88% -0.76% 98.63% -0.64%
hotel-area 99.18% 0.11% 99.82% 0.03%
hotel-parking 99.60% 0.13% 99.58% 0.42%
hotel-pricerange 98.80% 0.09% 99.30% -0.12%
hotel-stars 97.97% 0.01% 98.17% -0.29%
hotel-internet 99.52% 0.20% 99.57% 0.23%
hotel-type 98.44% 0.34% 98.41% 0.49%
attraction-type 96.30% -0.24% 96.09% -0.52%
attraction-name 98.96% -1.10% 99.40% -1.75%
attraction-area 99.53% 0.06% 99.68% 0.12%
train-people 97.82% 0.17% 97.46% 0.11%
train-leaveAt 97.56% -0.05% 96.91% 0.20%
train-destination 99.53% 0.02% 99.04% 0.16%
train-day 99.70% 0.07% 99.80% 0.04%
train-arriveBy 98.53% -0.05% 98.18% 0.16%
train-departure 99.49% 0.07% 99.21% 0.22%
Joint 74.15% -1.93% 75.60% -2.43%
Table 8: Oracle copy accuracy and accuracy percentage change between MultiWOZ 2.0 and MultiWOZ 2.1