Oh My Mistake!: Toward Realistic Dialogue State Tracking including Turnback Utterances

by   Takyoung Kim, et al.
sejong university
Korea University

The primary purpose of dialogue state tracking (DST), a critical component of an end-to-end conversational system, is to build a model that responds well to real-world situations. Although we often change our minds during ordinary conversations, current benchmark datasets do not adequately reflect such occurrences and instead consist of over-simplified conversations, in which no one changes their mind during a conversation. As the main question inspiring the present study,“Are current benchmark datasets sufficiently diverse to handle casual conversations in which one changes their mind?” We found that the answer is “No” because simply injecting template-based turnback utterances significantly degrades the DST model performance. The test joint goal accuracy on the MultiWOZ decreased by over 5%p when the simplest form of turnback utterance was injected. Moreover, the performance degeneration worsens when facing more complicated turnback situations. However, we also observed that the performance rebounds when a turnback is appropriately included in the training dataset, implying that the problem is not with the DST models but rather with the construction of the benchmark dataset.



There are no comments yet.


page 7


MultiWOZ 2.1: Multi-Domain Dialogue State Corrections and State Tracking Baselines

MultiWOZ is a recently-released multidomain dialogue dataset spanning 7 ...

Dialogue-based neural learning to estimate the sentiment of a next upcoming utterance

In a conversation, humans use changes in a dialogue to predict safety-cr...

CoCo: Controllable Counterfactuals for Evaluating Dialogue State Trackers

Dialogue state trackers have made significant progress on benchmark data...

NUANCED: Natural Utterance Annotation for Nuanced Conversation with Estimated Distributions

Existing conversational systems are mostly agent-centric, which assumes ...

Dial2Desc: End-to-end Dialogue Description Generation

We first propose a new task named Dialogue Description (Dial2Desc). Unli...

Improving Longer-range Dialogue State Tracking

Dialogue state tracking (DST) is a pivotal component in task-oriented di...

Generating Strategic Dialogue for Negotiation with Theory of Mind

We propose a framework to integrate the concept of Theory of Mind (ToM) ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

(a) Benchmark
(b) In reality
Figure 1: Dialogue flow example of MultiWOZ 2.1 (MUL1514.json).

The dialogue state tracking (DST) module is a part of a task-oriented dialogue system, the main role of which is to extract essential information from various conversational situations. Based on the given information from the previous module, the DST module finds appropriate slot-value pairs to understand the current situations, and these pairs are then delivered to the next module to continue the conversation. Hence, building an accurate DST model is a key success factor of an end-to-end task-oriented dialogue system not only because it can convince users that the system perfectly understands what they are talking about, but also because appropriate responses can be generated based on the result of the DST model. As in other natural language processing (NLP) tasks, two main components are mandatory to build a good DST model: (1) well-structured machine learning models and (2) sufficiently large datasets that contain various real-world conversational situations with fewer biases for training the model. Since the introduction of Transformer and BERT

vaswani2017attention; devlin2018ert, various breakthrough model structures have been designed for DST, such as SUMBT and SOM-DST lee2019sumbt; kim2019efficient, and have shown an excellent performance. With respect to DST-specific datasets, by contrast, some benchmark datasets, such as WOZ wen2017network and MultiWOZ budzianowski2018large

, have been introduced; however, their sizes and coverage are not yet satisfactory owing to the relatively high labeling cost. For example, the MultiWOZ only consists of approximately 10,000 dialogues from some different domains, which is significantly smaller than other NLP datasets such as SQuAD or IMDB

maas-etal-2011-learning; rajpurkar-etal-2016-squad.

Whereas the MultiWOZ has been used as a standard benchmark dataset for DST, there has been an increasing number of recent studies reporting the concerns regarding the inherent limitations of this dataset. First, newer versions of MultiWOZ have been proposed to address certain issues such as annotation errors, typos, standardization, annotation consistency, and other factorseric2019multiwoz; zang2020multiwoz; han2020multiwoz; ye2021multiwoz. In addition, qian2021annotation pointed out an entity bias issue, i.e., only a small number of values in the ontology account for the majority of labels. For example, a large number of ‘train-destination’ slots take the value ‘cambridge’ in the MultiWOZ qian2021annotation

. In addition, with CoCo, an overestimation of the held-out accuracy was pointed out by showing that the training and evaluation sets of the MultiWOZ have a similar distribution, and controllable counterfactual goals were proposed that do not change the original dialogue flow but generate a new dialogue with different responses


Although previous studies have raised inherent problems in the MultiWOZ, most have tended to focus on correcting the annotation inconsistency or entity biases, which enforces the dialogue in the dataset to be more idealistic. However, in real-world conversations, the dialogue flow between two speakers is not always as fluent as those in the MultiWOZ, e.g., one can occasionally change one’s mind during a conversation. For example, Figure 0(a) shows a sample dialogue in the MultiWOZ. No slot that appears once appears again in the subsequent dialogue turns. As the main hypothesis motivating this study, real conversations do not always continue as shown in Figure 0(a), but often continue as shown in Figure 0(b). Individuals change their mind during a conversation, and thus some slot-value pairs (same slot but different values) repeatedly appear in an entire dialogue. This hypothesis has led us to raise the main question of this paper: “Can the current benchmark dataset handle a situation in which users change their mind?” Our assumption is that the turnback situation of a user will degenerate the performance of a DST model because such models do not have a chance to learn the situation in which the values of specific slots are changed during the conversation. To experimentally verify our assumption, we investigate how the DST performance changes by injecting template-based utterances with the turnback situation on the MultiWOZ. To the best of our knowledge, this is the first approach investigating the change in decision of a user in a dialogue.

It is common for users to change their decisions in various ways in the real world, and thus we define four turnback situations as follows:

  • Single turnback: This is the simplest form in which the user changes the decision of a single slot only once.

  • Return turnback: This is the reverse of a decision twice but returning to the original value of a single slot.

  • Dual value turnback: The decision for a single slot is changed twice and thus the corresponding values are also changed twice.

  • Dual slot turnback: The decision for two slots are sequentially changed. The corresponding values are changed only once.

The remaining states are more complicated variants of the simplest versions by modifying the number of repetitions or slots. However, it is not easy to generate utterances that the benchmark does not consider because it requires substantial costs to newly create dialogues and annotate belief states. Hence, we use a simple template-based method to generate new turnback utterances and inject them at the end of the existing dataset. We believe that an additional turn that can be constructed by extracting the belief state of the previous turns and applying a template is a cost-effective way to sufficiently convey the intent to the model.

In this paper, we evaluate the performance of turnback situations with TRADE, SUMBT, and Transformer-DST WuTradeDST2019; lee2019sumbt; zeng2020jointly. The results show that the joint goal accuracy decreases significantly when injecting turnback utterances in the test set; the performance degeneration is more than 5%p even with the simplest single turnback situation. Moreover, when applying more complicated turnback utterances, i.e., return, dual value, and dual slot turnback, the performance decreases by more than 10%p. We further determined that including turnback utterances appropriately during the training phase can make a model robust without compromising the performance on the existing data because the model performance rebounds. To summarize, the main contributions of this paper can be summarized as follows:

  • We define a novel problem that the current benchmark cannot handle, i.e., the change in decision of the user, which must be considered when constructing an end-to-end conversational system.

  • We quantitatively and qualitatively evaluate three representative DST models to verify the effect of the turnback situation by simply injecting template-based utterances into the existing dataset. The more complicated the turnback situation is, the more significantly the performance of the DST decreases.

  • We explore the effect of various turnback proportions in both the training and testing datasets: When a turnback utterance is not considered during the training, the degradation in the performance of the DST increases as the proportion of the turnback increases in the test dataset. When the models are trained using turnback utterances, the performances of the DST become more robust to the turnback proportions in the test dataset.

2 Related Work

2.1 Dialogue state tracking (DST)

The goal of DST is to extract user goals/intentions expressed during a conversation and to encode them as a compact set of dialogue states, i.e., a set of slots and their corresponding values WuTradeDST2019. There are three main directions for the DST model development. First, classification-based methods select an appropriate value based on a predefined ontology. By contrast, extraction-based methods utilize the context of a dialogue to extract a span suitable for a slot. Generation-based methods, the most advanced strategy for DST, generate appropriate values by referring to dialogue contexts.

2.2 Previous DST models

Before the advent of generative models, DST was applied based on two different approaches: a delexicalization method using a semantic dictionary zilka2015ncremental; rastogi2017calable

, or a neural network with word embeddings

mrksic2016eural. Because these approaches cannot transfer information across different domains, their extendability and practical usefulness are limited. To overcome this limitation, TRADE proposed a generative multi-domain DST model, employing a pointer-generator network as a decoder to apply a copy mechanism WuTradeDST2019; see2017et. Because of its knowledge transferability, TRADE has become a baseline model to evaluate the performances of later DST models.

Although TRADE can transfer knowledge of different domains by adopting a copy mechanism, it uses a recurrent neural network (RNN) structure in the encoder such that the dialogue context is occasionally not fully preserved owing to a capacity limitation of the RNN encoder. This is resolved when a Transformer-based encoder structure is introduced, along with a large-scale pre-trained model such as BERT

devlin2018ert or RoBERTa liu2019roberta

. Based on these advances in general language models, SUMBT

lee2019sumbt employed a BERT-based encoder and used a single scalable and universal belief tracker to predict all domain and slot values. It scores every candidate slot-value pair in a non-parametric manner using a distance measure. By doing so, SUMBT significantly improves the flexibility and scalability of the previous DST models.

In contrast to TRADE, which generates a new value for every slot in each dialogue turn, SOM-DST selectively overwrites the value by regrading the dialogue states as an explicit fixed-sized memory kim2019efficient. It modifies operational values such as CARRYOVER to maintain the current state, and UPDATE to change the slot value. Another advance took place in the encoder-decoder structure. Because an operational prediction objective affects the BERT encoder, whereas only the value generation objective affects RNN decoder, Transformer-DST upgrades SOM-DST by using a single BERT as both an encoder and a decoder to jointly optimize the BERT for dialogue state tracking zeng2020jointly.

2.3 Data limitation

MultiWOZ budzianowski2018large is one of the most popular multi-domain task-oriented dialogue datasets. Although a new task-oriented dialogue dataset, such as SGD rastogi2020towards, has been recently proposed, most previous studies still evaluate the performance based on MultiWOZ. However, it has been revealed that the MultiWOZ has inherent errors and biases, and several studies have been proposed to resolve the reported issues.

Annotation error

Even the recent versions of MultiWOZ still have incorrect labels and inconsistent annotations eric2019multiwoz; zang2020multiwoz; han2020multiwoz; ye2021multiwoz; hosseini2020simple. These noises are the primary reason why it is challenging to accurately evaluate the model performance. Fortunately, the benchmark is continuously updated by progressively correcting any annotation errors found.

Biased slot

The slots in MultiWOZ are biased. The slots in the training and test sets overlap by more than 90%, and the co-occurrence between slots in the test set is also unequally distributed. DST models are vulnerable to unseen slots because biased slots do not consider rare but realistic slot combinations. CoCo generates counterfactual dialogues to allow the existing dataset to cover realistic conversation scenarios li2020coco.

Biased entity

Entities in the MultiWOZ are also significantly biased. A test dataset has most of the entities that appear in the training dataset, and existing models are vulnerable to unseen entities (e.g., “cambridge” appearing in 50% of the destination cities in the train domain) qian2021annotation. Thus, a new test dataset consisting of unseen entities is proposed, which also results in a decrease in performance qian2021annotation.

Change my mind

During a real conversation, people often change their minds. For example, when making a reservation for a restaurant, one might change the number of visitors, arrival time, or menu. When catching a taxi, the rider might ask the driver to go to their office first, and suddenly decide to go home to take a rest instead. Someone might want to sleep more, so they might delay their departure time. There are many other examples in which speakers change their mind or decision during a conversation. Unfortunately, the current well-known DST benchmark dataset does not seem to take these scenarios into serious consideration. All conversations continue naturally, and no one reverses what they have said. Our contention regarding the conditions of a good DST benchmark dataset is that the conversations in the dataset should reflect more realistic situations, e.g., frequent turnback utterances, which are a main component of ordinary conversations in the real world.

Figure 2: Template utterances of each phase (Training, Validation, Testing).
(a) Single turnback
(b) Return turnback
(c) Dual value turnback
(d) Dual slot turnback
Figure 3: An example of proposed turnback situations. Text in orange denotes a domain, blue denotes a slot, and green denotes a value.
Figure 4: Process of single turnback dialogue generation.

3 Method

To test whether the model trained with the current DST dataset can track the change in value of the turnback situation, we assume four turnback scenarios (single, return, dual value, and dual slot turnback) and inject these turnback utterances at the end of every dialogue. In other words, each data containing dialogue of turns can be formulated as , and we then append an extra template-generated turn with one of the aforementioned turnback situations at the end of the existing data, resulting in , where for a single turnback situation or for multiple situations. Figure 2 shows examples of a turnback used in each dataset. Note that we used different templates for different datasets to avoid an overlap across the datasets. Whenever applying a template-based utterance generation, the arbitrary template of each phase is selected at each turn of dialogue.

Single turnback

Users change the value of a particular slot only once, as shown in Figure 2(a). Basically, a single turnback utterance is constructed using the last turn of the dialogue because it contains accumulated belief states that appeared throughout the dialogue. Figure 4 shows the process of generating a single turnback utterance and skipping the process when there is no belief stated during the dialogue.

Return turnback

Users change the value of a particular slot but return to the original value again, as shown in Figure 2(b). This means that the final belief state after injecting a return turnback utterance is the same as the belief state of the original dataset. In this case, the first turnback utterance can be generated like a single turnback process, and the second turnback utterance is then generated identically by simply replacing the changed value with the original value.

Dual value turnback

Users sequentially change the value of a particular slot twice, as shown in Figure 2(c). Dual value turnback utterances can be generated in the same way as return turnback utterances, but can be generalized to a triple or quadruple value turnback if there are more than two available values in the slot on the ontology.

Dual slot turnback

Users first change the value of a particular slot and then also change the value of a different slot, as represented in Figure 2(d). This can be generated simply by applying a single turnback twice; however, there must be more than two total belief states to apply this scenario.

4 Experiments

4.1 Experimental setup

We verified our hypothesis using the MultiWOZ 2.1, the most commonly used DST dataset in previous studies. As a performance metric, the joint goal accuracy was employed. The joint goal accuracy is a standard criterion used to check if the model tracks the triplet of (domain, slot, value)

precisely. When tracked correctly, the joint goal accuracy is marked as 1, and is otherwise 0. The numbers of training, validation, and test sets are 8420, 1000, and 999, respectively. The open-source code for the TRADE model was from CoCo repository

111https://github.com/salesforce/coco-dst, while the code for SUMBT222https://github.com/SKTBrain/SUMBT and Transformer-DST333https://github.com/zengyan-97/Transformer-DST was from the original author, respectively. All the experiments explained later were conducted using a machine with the eight NVIDIA GeForce RTX 3090 GPUs.

4.2 Injecting turnback dialogues into a test set

Model Original Single Return Dual value Dual slot
TRADE 49.55 44.05 42.13 39.76 39.40
(5.50%p ) (7.42%p ) (9.79%p ) (10.15%p )
TRADE + CoCo 50.21 44.49 42.80 39.95 39.80
(5.72%p ) (7.41%p ) (10.26%p ) (10.41%p )
SUMBT 46.99 42.72 40.41 39.29 38.63
(4.27%p ) (6.58%p ) (7.70%p ) (8.36%p )
Transformer-DST 54.47 49.84 47.74 46.00 44.98
(4.63%p ) (6.73%p ) (8.47%p ) (9.49%p )
Table 1: Joint goal accuracy (%) of turnback-injected test set.
(b) TRADE + CoCo
(d) Transformer-DST
Figure 5: Performance gap based on the existence of turnback in the training data.

We first injected one of four turnback situations into every dialogue in the test dataset. The joint goal accuracy of the turnback-injected test dataset is shown in Table 1. Note that the original performances reported are the result of our implementation. For TRADE, we additionally evaluated when the CoCo-augmented dataset was included li2020coco. When models trained with the existing dataset face a turnback situation in the test phase, they cannot recognize the changes well. Even using the simplest single turnback injection, the performance decreases by up to 5.72%p. Moreover, more complicated turnback situations lead to even worse performances; in particular, reversing two belief states, i.e., a dual slot turnback, showed the biggest performance decrease of up to 10.41%p. We believe that performance degeneration is sufficiently large to support our hypothesis on the present research question, i.e., current DST models trained with the existing dataset cannot appropriately respond to user turnback requests.

Turn # Dialogue History
1 System: “ ”
User: “I need a taxi. I’ll be departing from la raza.”
2 System: “I can help you with that. When do you need to leave?”
User: “I would like to leave after 11:45 please.”
3 System: “Where will you be going?”
User: “I’ll be going to restaurant 17.”
4 System: “I have booked for you a black volkswagen, the contact number is 07552762364. Is there anything else I can help you with?”
User: “No, that’s it. Thank you!”
5 System: “Completed.”
User: “Wait , it might be better to change taxi leave at to 15:00.”
6 System: “Sure. Anything else?”
User: “Hold on , I’ve been thinking about it and I think changing taxi destination to finches bed and breakfast will be better.”
Table 2: Sample dialogue of test set with additional dual slot turnback situation (SNG01367.json).
Gold state Predicted state Predicted state
(label) (original model) (dual-slot-trained model)
"taxi-departure-la raza", "taxi-departure-la raza", "taxi-departure-la raza",
"taxi-leaveat-11:45", "taxi-leaveat-11:45", "taxi-leaveat-11:45",
"taxi-destination- "taxi-destination- "taxi-destination-
restaurant 17" restaurant 17" restaurant 17"
"taxi-departure-la raza", "taxi-departure-la raza", "taxi-departure-la raza",
"taxi-leaveat-15:00", "taxi-leaveat-11:45", "taxi-leaveat-15:00",
"taxi-destination- "taxi-destination- "taxi-destination-
restaurant 17" restaurant 17" restaurant 17"
"taxi-departure-la raza", "taxi-departure-la raza", "taxi-departure-la raza",
"taxi-leaveat-15:00", "taxi-leaveat-11:45", "taxi-leaveat-15:00",
"taxi-destination- "taxi-destination- "taxi-destination-
finches bed and breakfast" finches bed and breakfast" finches bed and breakfast"
Table 3: The model prediction on dual slot turnback situation at turn 4, 5, and 6 (SNG01367.json).

4.3 Including turnback dialogues in the training set

Because the main hypothesis was sufficiently supported by the first experiment, we further investigated whether including turnback situations in the training dataset can prevent the model from not being able to trace the changing values. We inserted turnback utterances at the end of all training train and validation data, and different template utterances were randomly used for the training and validation phases.

Figure 5 shows the joint goal accuracy for each turnback scenario before and after the turnback utterance are included in the training and validation datasets. The performance always improves irrespective of the turnback scenarios and DST models. Also note that the performance recovery is more significant for more complicated turnback scenarios. Injecting turnback utterances increases the joint goal accuracy by 1.83%p on average for a single turnback, whereas the average improvement is 4.90%p for the dual slot turnback.

In addition to achieving a quantitative rebound in performance, we also conducted a qualitative comparison of the model predictions before and after the turnback injection in the training and validation datasets. Table 2 shows an example of dual slot turnback dialogue, and the predicted states of the Transformer-DST model are as shown in Table 3. The prediction results of the remaining three turnback situations are also provided in Tables A1, A2, and A3 in the Appendix. The first row of Table 3 is the last turn of the original dialogue, and we can see that both the original and dual-trained model predict the belief states correctly. In the second and third rows of the same table, when the values of two slots are sequentially changed, the original model can catch only one changing value (‘finches bed and breakfast’). Not being able to follow all changes is frequently detected with the original model in other test dialogues. By contrast, the model trained with the turnback utterances can correctly predict the entire belief state, as shown in the last row and the last column of Table 3.

Based on the results shown in Figure 5 and Table 3, we can conclude that the performance degeneration of the DST models is not because the DST model structures are incorrect but because they do not have a chance to train such turnback utterances with the current benchmark DST dataset, which means that the MultiWOZ dataset does not have a sufficient coverage yet for dialogues in the real-world.

Single turnback
Train-0% Train-30% Train-50% Train-70% Train-100% Difference
Test-0% 54.47 54.40 54.32 54.44 52.80 -0.03%p
Test-30% 53.04 53.81 53.84 54.00 52.22 0.96%p
Test-50% 52.06 53.44 53.36 53.46 51.88 1.40%p
Test-70% 50.90 52.81 52.78 52.73 51.12 1.91%p
Test-100% 49.84 51.98 52.23 52.32 50.65 2.48%p
Drop JGA 4.63%p 2.98%p 2.09%p 2.12%p 2.15%p
* Bold denotes the best, and underline denotes second-best performance.
Table 4: Joint goal accuracy (%) of Transformer-DST with different single turnback proportions.

4.4 Difference in performance according to turnback proportion

We also conducted an ablation study on how the turnback utterance proportions in the training and test dataset affect the DST performance. We evaluate five different proportions of turnback-injected training and test datasets (i.e., 0%, 30%, 50%, 70%, and 100%) with corresponding turnback-test situations, resulting in a total of 25 combinations of training-test turnback proportions. We named each turnback-mixed dataset phase-N%. For example, Train-30% denotes the dataset in which 30% of the turnback utterances are applied to the existing dialogues, and the remaining 70% of the original dialogues are unmodified. The performances of Transformer-DST are shown in Table 4. The performance of the other models are provided in Tables A4, A5, and A6 in the Appendix. The last row of the table is the decrease in the joint goal accuracy from the original test dataset to the fully turnback-applied Test-100% dataset, and the last column of the table is the difference between the best-proportion model performance and the original performance.

Based on Table 4, we can draw the following observations. First, adding moderate turnback utterances does not affect the performance on Test-0%, which is the original test dataset. The joint goal accuracies of Train-30%, Train-50%, and Train-70% are very close to that of Train-0%. Second, adequately mixing the turnback utterances with the existing dataset generally yields a better performance when the turnback utterances exist in the test dataset, and the effect is more significant when turnback utterances appear more frequently. The last column of Table 4 shows that the difference for Test-0% is negligible (-0.03%p), but it gradually increases along with the turnback utterence proportion in the test dataset.

5 Conclusion

A dialogue state tracking model should focus on properly reacting to unpredictable scenarios from a human speaker. From this perspective, using realistic benchmark datasets for the model is crucial. To validate recent DST models trained on the commonly used DST benchmark dataset, we first designed a cost-effective template-based data injection method to create a turnback situation and modified the test dataset by appending one of four trunback scenarios to the end of the dialogue. Our experiment showed that the current model trained using the existing benchmark cannot track the changing values well when users change their decisions. We also conducted another experiment to investigate whether the model performance can be recovered if the turnback utterances are properly included in the training dataset. Experimental results showed that the joint goal accuracy was improved for all turnback scenarios when the models were trained on the dataset with turnback utterances. The ablation study shows that moderately including the turnback utterances can manage a broader range of turnback proportions. Our experimental results emphasize that constructing a right benchmark dataset is as important as developing an advanced model structure in an NLP task.

Despite the meaningful results, there are some limitations of the current work that lead us to some future research directions. First, we generated turnback utterances based on simple templates, but it would be more realistic if more diverse turnback dialogue expressions can be generated, which might be possible with large language models. Second, turnback utterance is just one of many situations that can happen in a real-world conversation. If more diverse realistic dialogue scenarios are reflected in the DST benchmark dataset, the bias of models trained on it can be significantly reduced.


Appendix A Appendix

Gold state Predicted state Predicted state
(label) (original model) (single-trained model)
"taxi-departure-la raza", "taxi-departure-la raza", "taxi-departure-la raza",
"taxi-leaveat-11:45", "taxi-leaveat-11:45", "taxi-leaveat-11:45",
"taxi-destination- "taxi-destination "taxi-destination
restaurant 17" restaurant 17" restaurant 17"
"taxi-departure-, "taxi-departure- "taxi-departure-
london liverpool street", la raza", london liverpool street",
"taxi-leaveat-11:45", "taxi-leaveat-11:45", "taxi-leaveat-11:45",
"taxi-destination- "taxi-destination- "taxi-destination-
restaurant 17" restaurant 17" restaurant 17"
Table A1: Model prediction on single turnback situation at turns 4 and 5 (SNG01367.json).
Gold state Predicted state Predicted state
(label) (original model) (return-trained model)
"taxi-departure-la raza", "taxi-departure-la raza", "taxi-departure-la raza",
"taxi-leaveat-11:45", "taxi-leaveat-11:45", "taxi-leaveat-11:45",
"taxi-destination- "taxi-destination- "taxi-destination-
restaurant 17" restaurant 17" restaurant 17"
"taxi-departure-, "taxi-departure- "taxi-departure-
the copper kettle", la raza", the copper kettle",
"taxi-leaveat-11:45", "taxi-leaveat-11:45", "taxi-leaveat-11:45",
"taxi-destination- "taxi-destination- "taxi-destination-
restaurant 17" restaurant 17" restaurant 17"
"taxi-departure-la raza", "taxi-departure-la raza", "taxi-departure-la raza",
"taxi-leaveat-11:45", "taxi-leaveat-11:45", "taxi-leaveat-11:45",
"taxi-destination- "taxi-destination- "taxi-destination-
restaurant 17" restaurant 17" restaurant 17"
Table A2: Model prediction on return turnback situation at turns 4, 5, and 6 (SNG01367.json).
Gold state Predicted state Predicted state
(label) (original model) (dual-value-trained model)
"taxi-departure-la raza", "taxi-departure-la raza", "taxi-departure-la raza",
"taxi-leaveat-11:45", "taxi-leaveat-11:45", "taxi-leaveat-11:45",
"taxi-destination- "taxi-destination- "taxi-destination-
restaurant 17" restaurant 17" restaurant 17"
"taxi-departure-la raza", "taxi-departure-la raza", "taxi-departure-la raza",
"taxi-leaveat-10:15", "taxi-leaveat-10:15", "taxi-leaveat-10:15",
"taxi-destination- "taxi-destination- "taxi-destination-
restaurant 17" restaurant 17" restaurant 17"
"taxi-departure-la raza", "taxi-departure-la raza", "taxi-departure-la raza",
"taxi-leaveat-12:00", "taxi-leaveat-10:15", "taxi-leaveat-12:00",
"taxi-destination- "taxi-destination- "taxi-destination-
restaurant 17" restaurant 17" restaurant 17"
Table A3: Model prediction on dual value turnback situation at turn 4, 5, and 6 (SNG01367.json).
Single turnback
Train-0% Train-30% Train-50% Train-70% Train-100% Difference
Test-0% 49.55 48.47 48.25 48.11 48.81 -0.74%p
Test-30% 47.82 47.41 47.16 47.16 47.82 0.00 %p
Test-50% 46.52 46.62 46.41 46.67 47.24 0.72%p
Test-70% 45.31 45.92 45.63 45.85 46.50 1.19%p
Test-100% 44.05 45.12 45.13 45.29 46.36 2.31%p
Drop JGA 5.50%p 3.35%p 3.12%p 2.82%p 2.45%p
* Bold denotes the best, and underline denotes second-best performance.
Table A4: Joint goal accuracy (%) of TRADE with different single turnback proportions.
Single turnback
Train-0% Train-30% Train-50% Train-70% Train-100% Difference
Test-0% 50.21 48.40 49.80 47.73 48.05 -0.41%p
Test-30% 48.36 47.30 48.74 46.81 47.22 0.38%p
Test-50% 47.13 46.57 48.16 46.07 46.62 1.03%p
Test-70% 46.02 45.57 47.42 45.38 45.89 1.40%p
Test-100% 44.49 44.75 46.73 44.75 45.30 2.24%p
Drop JGA 5.72%p 3.65%p 3.07%p 2.98%p 2.75%p
* Bold denotes the best, and underline denotes second-best performance.
Table A5: Joint goal accuracy (%) of TRADE + CoCo with different single turnback proportions.
Single turnback
Train-0% Train-30% Train-50% Train-70% Train-100% Difference
Test-0% 46.99 46.24 46.32 47.16 47.10 0.17%p
Test-30% 45.59 46.57 46.17 47.18 47.38 1.79%p
Test-50% 44.80 46.29 45.70 46.70 47.22 2.42%p
Test-70% 43.73 45.54 45.13 46.11 46.39 2.66%p
Test-100% 42.72 45.01 44.70 45.62 46.04 3.32%p
Drop JGA 4.27%p 1.23%p 1.62%p 1.54%p 1.06%p
* Bold denotes the best, and underline denotes second-best performance.
Table A6: Joint goal accuracy (%) of SUMBT with different single turnback proportions.