Dialogue State Tracking with Multi-Level Fusion of Predicted Dialogue States and Conversations

07/12/2021 ∙ by Jingyao Zhou, et al. ∙ 0

Most recently proposed approaches in dialogue state tracking (DST) leverage the context and the last dialogue states to track current dialogue states, which are often slot-value pairs. Although the context contains the complete dialogue information, the information is usually indirect and even requires reasoning to obtain. The information in the lastly predicted dialogue states is direct, but when there is a prediction error, the dialogue information from this source will be incomplete or erroneous. In this paper, we propose the Dialogue State Tracking with Multi-Level Fusion of Predicted Dialogue States and Conversations network (FPDSC). This model extracts information of each dialogue turn by modeling interactions among each turn utterance, the corresponding last dialogue states, and dialogue slots. Then the representation of each dialogue turn is aggregated by a hierarchical structure to form the passage information, which is utilized in the current turn of DST. Experimental results validate the effectiveness of the fusion network with 55.03 joint accuracy on MultiWOZ 2.0 and MultiWOZ 2.1 datasets, which reaches the state-of-the-art performance. Furthermore, we conduct the deleted-value and related-slot experiments on MultiWOZ 2.1 to evaluate our model.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Dialogue State Tracking (DST) is utilized by the dialogue system to track dialogue-related constraints and user’s requests in the dialogue context. Traditional dialogue state tracking models combine semantics extracted by language understanding modules to estimate the current dialogue states

(Williams and Young, 2007; Thomson and Young, 2010; Wang and Lemon, 2013; Williams, 2014), or to jointly learn speech understanding (Henderson et al., 2014; Zilka and Jurcicek, 2015; Wen et al., 2017)

. They rely on hand-crafted features and complex domain-specific lexicons, which are vulnerable to linguistic variations and difficult to scale. Recently proposed approaches attempt to automatically learn features from the dialogue context and the previous dialogue states. Most of them utilize only the context

(Shan et al., 2020) or encode the concatenation of context and dialogue states (Hosseini-Asl et al., 2020) or utilize a simple attention mechanism to merge the information from the above two sources (Ouyang et al., 2020). These methods do not fully exploit the nature of the information in the context and the predicted dialogue states. The information nature of the context is complete and may be indirect. While the nature of the predicted dialogue states is direct and may be incomplete or erroneous.

: I need a place to dine in the centre.
State: restaurant-area=centre
:I recommend the rice house. Would you like me
to reserve a table?
:Yes, please book me a table for 9.
State: restaurant-area=centre; restaurant-book people=9;
restaurant-name=rice house
:Unfortunately, I could not book the rice house for that
amount of people.
: please find another restaurant for that amount of
people at that time.
State: restaurant-area=centre; restaurant-book people=9
restaurant-name=none
: how about tang restaurant ?
: Yes, please make me a reservation. I also need
a taxi.
State: restaurant-area=centre; restaurant-book people=9
restaurant-name=tang;
: What is your destination ?
: To the restaurant.
State: restaurant-area=centre; restaurant-book people=9
restaurant-name=tang; taxi-destination=tang
Table 1: An example of dialogue contains (1) the deleted-value problem at the turn, which changes restaurant-name from rice house to none, and (2) the related-slot phenomenon at the turn, which carries over the value from restaurant-name to taxi-destination.

Our FPDSC model exploits the interaction among the turn utterance, the corresponding last dialogue states, and dialogue slots at each turn. A fusion gate (the turn-level fusion gate) is trained to balance the keep-proportion of the slot-related information from the turn utterance and the corresponding last dialogue states at each turn. Then it applies a hierarchical structure to keep the complete information of all dialogue turns. On top of the model, we employ another fusion gate (the passage-level fusion gate) to strengthen the impact of the last dialogue states. Ouyang et al. (2020) shows that such strengthening is vital to solve the related-slot problem. The problem is explained in Table 1. To eliminate the negative impact of the error in the predicted dialogue states, we train our models in two phases. In the teacher-forcing phase, previous dialogue states are all true labels. While in the uniform scheduled sampling phase (Bengio et al., 2015), previous dialogue states are half predicted dialogue states and half true labels. Training with such natural data noise from the error in the predicted dialogue states helps improve the model’s robustness.

For ablation studies, we test the following variants of FPDSC: base model (without turn/passage-level fusion gates), turn-level model (with only turn-level fusion gate), passage-level model (with only passage-level fusion gate) and dual-level model (with both turn/passage-level fusion gates). We also conduct the experiment for the deleted-value problem, which is explained in Table 1, and the related-slot problem. Besides, we design two comparative networks to validate the effectiveness of the turn-level fusion gate and the whole previous dialogue states. One comparative network employs only the attention mechanism to merge information from the turn utterance, the corresponding last dialogue states, and dialogue slots at each turn. Another comparative network utilize only the last previous dialogue states in the turn-level fusion gate. Our model shows strong performance on MultiWOZ 2.0 (Budzianowski et al., 2018) and MultiWOZ 2.1 (Eric et al., 2019) datasets. Our main contributions are as follows:

  • We propose a novel model, which utilizes multi-level fusion gates and the attention mechanism to extract the slot-related information from the conversation and previous dialogue states. The experimental results of two comparative networks validate the effectiveness of the turn-level fusion gate to merge information and the importance of the whole previous dialogue states to improve DST performance.

  • Both turn/passage-level fusion between the context and the last dialogue states helps at improving the model’s inference ability. The passage-level fusion gate on the top of the model is more efficient than the turn-level fusion gate on the root for slot correlation problem. While the turn-level fusion gate is sensitive to signal tokens in the utterance, which helps improve the general DST performance.

  • Experimental results on the deleted-value and the related-slot experiment shows the ability of the structure to retrieve information. Besides, our models reach state-of-the-art performance on MultiWOZ 2.0/2.1 datasets.

2 Related Work

Recently proposed methods show promising progress in the challenge of DST. CHAN (Shan et al., 2020) employs a contextual hierarchical attention network, which extracts slot attention based representation from the context in both token- and utterance-level. Benefiting from the hierarchical structure, CHAN can effectively keep the whole dialogue contextual information. Although CHAN achieves the new state-of-the-art performance on MultiWoz 2.0/2.1 datasets, it ignores the information from the predicted dialogue states. Figures 1 and 2 show the difference between CHAN and FPDSC in the extraction of the slot-related information in one dialogue turn.

In the work of Ouyang et al. (2020), the problem of slot correlations across different domains is defined as related-slot problem. DST-SC (Ouyang et al., 2020) model is proposed. In the approach, the last dialogue states are vital to solve the related-slot problem. The method merges slot-utterance attention result and the last dialogue states with an attention mechanism. However, the general performance of DST-SC is worse than CHAN.

SOM-DST (Kim et al., 2020) and CSFN-DST Zhu et al. (2020) utilize part of the context and the last dialogue states as information sources. The two methods are based on the assumption of Markov property in dialogues. They regard the last dialogue states as a compact representation of the whole dialogue history. Once a false prediction of a slot exists and the slot-related context is dropped, the dialogue states will keep the error.

Figure 1: Part of CHAN and FPDSC (base/passage-level)
Figure 2: Part of FPDSC (turn/dual-level)

3 Model

Figure 3: The structure of dual-level FPDSC. The dialogue utterance at the -th turn is [CLS][SEP][SEP]. The dialogue state is a list of slot-value pairs ([CLS]Slot[SEP]Value[SEP],,[CLS]Slot[SEP]Value[SEP]). All slot values are none in the initial dialogue states. The turn-level approach is without Attention and the passage-level fusion gate ( is the output weight of the gate). The passage-level approach is without Attention and the turn-level fusion gate ( are the output weights of the gate). The base approach is without Attention and turn/passage-level fusion gate. The base approach has the same structure as CHAN with different early stop mechanism.

Figure 3 shows the overall structure of FPDSC (dual-level). The followings are important notations for our model.

Inputs: The context where and represent utterance for user and system at the -th dialogue turn; The previous dialogue states where , is slot set, is one of the slot names, is the corresponding slot value at the -th turn; are slot value candidates of all slots .

Turn-level Information: The slot-related information for each dialogue turn in Figure 2 is the turn-level information. In Figure 3, the turn-level information is denoted as , which is the fusion (the turn-level fusion gate) result of the slot-utterance attention results and the slot-dialogue-states attention results . The weights are from the same fusion gate, which is utilized to allocate the keep-proportion from the conversations and previous dialogue states. The turn-level information of a slot is fed to a transformer encoder to form the mutual interaction information .

Passage-level Information: The attention result of the mutual interaction information and a slot is the passage-level information of a slot.

Core Feature: The weight are applied to balance the turn-level information of the current dialogue turn and the passage-level information of a slot. We employ the attention mechanism between the turn/passage-level balanced information and the last dialogue states to strengthen the impact of the last dialogue states. Another weight (from the passage-level fusion gate) merge the turn/passage-level balanced information and the strengthened information to form the core feature , which is utilized in the downstream tasks.

3.1 BERT-Base Encoder

Due to pre-trained models’ (e.g., BERT) strong language understanding capabilities (Mehri et al., 2020), we use the fixed-parameter BERT-Base encoder () to extract the representation of slot names, slot values and the previous dialogue states. Three parts share the same parameters from HuggingFace 111https://huggingface.co/. We also apply a tunable BERT-Base encoder () to learn the informal and noisy utterances distribution (Zhang et al., 2020b) in the dialogue context. The two BERT-Base Encoders are input layers of the model. [CLS] and [SEP] represent the beginning and the end of a text sequence. We use the output at [CLS] to represent the whole text for . A slot-value pair in the last dialogue states at the -th turn is denoted as:

(1)

where is the slot-related representation of last dialogue state at the dialogue -th turn. Thus the full representation of the last dialogue states at the -th turn is as follows:

(2)

means concatenation. The entire history of the dialogue states is . The representations of slot and its corresponding value are as follows:

(3)
(4)

extracts utterances distribution of user and system at the -th turn, which are marked as:

(5)

The dialogue context until -th turn is .

3.2 MultiHead-Attention Unit

We utilize MultiHead-Attention (Vaswani et al., 2017) here to get the slot-related information from the turn utterance and the corresponding last dialogue states. The representations at the -th turn are as follows:

(6)
(7)

Another attention unit is applied to get the passage-level information of a slot from the mutual interaction information , which is described in section 3.3.

(8)

We apply an attention unit to connect the representation of the merged turn/passage-level balanced information and the last dialogue states to enhance the impact of the last dialogue states.

(9)

is the enhanced result. All attention units above do not share parameters.

3.3 Transformer Encoder

The complete turn-level merged information has no dialogue sequence information. Besides, each turn representation does not fully share information. Thus we apply a transformer encoder (Vaswani et al., 2017).

(10)

where means the mutual interaction information. means the slot-related representation of the dialogue turn after turn interaction, when the dialogue comes to the -th turn. The transformer encoder utilizes positional encoding to record the position information and self-attention to get interacted information in each dialogue turn.

3.4 Fusion Gate

Fusion gate is applied to merge the information as follows:

(11)
(12)

and mean the matrix product and point-wise product.

is the sigmoid function.

is the output weight of the fusion gate to keep the information from the last dialogue state. is the turn-level information;

(13)
(14)

is the weight to balance the turn-level merged information and the passage-level extracted information ;

(15)
(16)

is the weight to balance the merged turn/passage-level balanced information and the enhanced result from equation 9. is slot-related core feature from context and the entire history of dialogue states.

3.5 Loss Function

Here we follow Shan et al. (2020)

to calculate the probability distribution of value

and predict whether the slot

should be updated or kept compared to the last dialogue states. Thus our loss functions are as follows:

(17)
(18)
(19)

is the distance loss for true value of slot ;

(20)
(21)
(22)

is the loss function for state transition prediction, which has the value set .

is update probability for slot

at the -th turn. is the state transition label with and . We optimize the sum of above loss in the training process:

(23)
Model MultiWOZ 2.0 MultiWOZ 2.1
Joint Acc () Joint Acc ()
TRADE (wu-etal-2019-transferable) 48.62 46.00
DST-picklist (Zhang et al., 2020a) 54.39 53.30
TripPy (Heck et al., 2020) - 55.30
SimpleTOD (Hosseini-Asl et al., 2020) - 56.45
CHAN (Shan et al., 2020) 52.68 58.55
CHAN (Shan et al., 2020) - 57.45
FPDSC (base) 51.03 54.91
FPDSC (passage-level) 52.31 55.86
FPDSC (turn-level) 55.03 57.88
FPDSC (dual-level) 53.17 59.07
Table 2: Joint accuracy on the test sets of MultiWOZ 2.0 and 2.1. CHAN means performance without adaptive objective fine-tuning, which solves the slot-imbalance problem. CHAN means performance with the full strategy. The overall structure of FPDSC (dual-level) is illustrated in Figure 3.

4 Experiments Setup

4.1 Datasets

We evaluate our model on MultiWOZ 2.0 and MultiWOZ 2.1 datasets. They are multi-domain task-oriented dialogue datasets. MultiWOZ 2.1 identified and fixed many erroneous annotations and user utterances (Zang et al., 2020).

4.2 Baseline

We compare FPDSC with the following approaches:

TRADE is composed of an utterance encoder, a slot-gate, and a generator. The approach generates value for every slot using the copy-augmented decoder (wu-etal-2019-transferable).

CHAN employs a contextual hierarchical attention network to enhance the DST. The method applies an adaptive objective to alleviate the slot imbalance problem (Shan et al., 2020).

DST-picklist adopts a BERT-style reading comprehension model to jointly handle both categorical and non-categorical slots, matching the value from ontologies (Zhang et al., 2020a).

TripPy applies three copy mechanisms to get value span. It regards user input, system inform memory and previous dialogue states as sources (Heck et al., 2020).

SimpleTOD is an end-to-end approach and regards sub-tasks in the task oriented dialogue task as a sequence prediction problem(Hosseini-Asl et al., 2020).

4.3 Training Details

Our code is public 222https://github.com/helloacl/DST-DCPDS, which is developed based on CHAN’s code 333https://github.com/smartyfh/CHAN-DST. In our experiments, we use the Adam optimizer (Kingma and Ba, 2015). We use a batch size of 2 and maximal sequence length of 64 for each dialogue turn. The transformer encoder has 6 layers. The multi-head attention units have counts of 4 and hidden sizes of 784. The training process consists of two phases: 1) teacher-forcing training; 2) uniform scheduled sampling (Bengio et al., 2015)

. The warmup proportion is 0.1 and the peak learning rate is 1e-4. The model is saved according to the best joint accuracy on the validation data. The training process stops with no improvement in 15 continuous epochs. Our training devices are GeForce GTX 1080 Ti and Intel Core i7-6800 CPU@3.40GHZ. The training time of an epoch takes around 0.8 hour in the teacher-forcing phase and 1.6 hours in the uniform scheduled sampling phase with a GPU.

5 Results and Analysis

Deleted-Value
Base Turn Passage Dual
Related-Slot
Base Turn Passage Dual
Table 3: Success change rate of the deleted-value and related-slot experiment for FPDSC. Turn, Passage, Dual mean turn-level, passage-level and dual-level FPDSC.

5.1 Main Results

We use the joint accuracy to evaluate the general performance. Table 2 shows that our models get 55.03 and 59.07 joint accuracy with improvements (0.64 and 0.52) over previous best results on MultiWOZ 2.0 and 2.1. All of our approaches get better performance on 2.1 than 2.0. This is probably because of fewer annotations error in MultiWOZ 2.1. Though Table 3 shows that the passage-level variant performs better than the turn-level variant in the deleted-value and the related-slot test, passage-level variant gets worse results in the general test. The small proportion of the above problem in the MultiWOZ dataset and the strong sensitivity of the turn-level fusion gate to signal tokens in the utterance explain the phenomenon.

5.2 The Comparative Experiment for the Fusion Gate

We design a comparative network to validate the effectiveness of the turn-level fusion gate. Figure 4 shows the part structure of the comparative network (no turn-level fusion gate). The rest of the comparative network is the same as the FPDSC (turn-level). Table 4 shows the performance of the comparative network and the FPDSC (turn-level) on the MultiWOZ 2.1. The result validates the effectiveness of the fusion gate to merge the different information sources.

Figure 4: Part of the comparative network
Dataset no-gate no-gate turn-level turn-level
dev 46.38 52.58 56.17 61.39
test 43.03 49.24 54.08 57.88
Table 4: Joint accuracy results of the comparative network (no-gate) and FPDSC (turn-level) on the MultiWOZ 2.1 dataset. indicates the approach is only trained with teacher-forcing, otherwise is further trained by uniform scheduled sampling after the teacher-forcing phase.

5.3 The Comparative Experiment for the Complete Dialogue States

We design another comparative network to validate the effectiveness of the complete previous dialogue states. As Figure 3 shows, are fed to the transformer encoder in the FPDSC (turn-level). In the comparative network (single), are fed to the transformer encoder. Table 5 shows the complete previous dialogue states improve the general performance of the model.

Dataset single single turn-level turn-level
dev 57.25 60.94 56.17 61.39
test 54.40 56.70 54.08 57.88
Table 5: Joint accuracy results of the comparative network (single) and the FPDSC (turn-level) on the MultiWOZ 2.1 dataset. means that the approach is only trained in the teacher-forcing training, otherwise is further trained by uniform scheduled sampling training after the teacher-forcing phase.

5.4 Deleted-value Tests

We select dialogues containing the deleted-value problem from test data in MultiWOZ 2.1. We regard the above dialogues as templates and augment the test data by replacing the original slot value with other slot values in the ontology. There are 800 dialogues in the augmented data. We only count the slots in dialogue turn, which occurs the deleted-value problem. As shown in Table 6, if restaurant-name=rice house at the turn and restaurant-name=None at the turn, we regard it as a successful tracking. We use the success change rate to evaluate the effectiveness. Table 3 shows that the explicit introduction of the previous dialogue states in both turn-level and passage-level helps solve the problem.

:Find me a museum please
restaurant-name: None
:There are 23 museums. Do you have an area as
preference?
:I just need the area and address for one of them.
restaurant-name: None
:I have the broughton house gallery in the centre
at 98 king street.
:Thank you so much. I also need a place to dine
in the centre that serves chinese food.
restaurant-name: None
:I have 10 place in the centre. Did you have a price
range you were looking at?
:I would like the cheap price range.
restaurant-name: None
:I recommend the rice house. Would you like me
to reserve a table?
:yes, please book me a table for 9 on monday at 19:30.
restaurant-name: rice house
:Unfortunately, I could not book the rice house for
that day and time. Is there another day or time that would
work for you?
:Can you try a half hour earlier or later and see if the
have anything available?
restaurant-name: rice house
Dual-level: restaurant-name: rice house
Base: restaurant-name: rice house
:No luck, would you like me to try something else?
:Yes, please find another cheep restaurant for that
amount of people at that time.
restaurant-name: None
Dual-level: restaurant-name: None
Base: restaurant-name: rice house
Table 6: Dialogue id MUL2359 from MultiWOZ 2.1

5.5 Related-slot Tests

We focus on the multi-domain dialogues which contain dialogue domain of taxi for the related-slot test. We select 136 dialogue turns from the MultiWOZ 2.1 test data, which contains the template such as book a taxi from A to B or commute between A and B. We replace the explicit expression in order to focus on the actual related-slot filling situation. For example, in the dialogue from Table 7, we replace the value Ballare to attraction in the user utterance at the turn. We only count slots taxi-departure and taxi-destination without value of None in the dialogue turns, which contain the related-slot phenomenon. We divide the sum of successful tracking counts by the number of the above slots to get the success change rate. Table 3 shows the result.

: Can you give me information on an attraction
called ballare?
taxi-departure: None;taxi-destination: None
: The Ballare is located in Heidelberg Gardens,
Lion Yard postcode cb23na, phone number is
01223364222. The entrance fee is 5 pounds.
: Thanks. I’m also looking for somewhere to stay
in the north. It should be in the moderate price range
and has a star of 2 as well
taxi-departure: None;taxi-destination: None
: Would you want to try the lovell lodge,
which is in the moderate price range and in the north.
: Let’s do that. Please reserve it for 6 people and
5 nights starting from thursday.
taxi-departure: None;taxi-destination: None
: The booking goes through and the reference
number is TY5HFLY1.
: Can you help me to book a taxi from the
hotel to the Ballare. I want to leave by 17:30.
taxi-departure: lovell lodge
taxi-destination: ballare;taxi-leave: 17:30
Table 7: Dialogue id MUL2657 from MultiWOZ 2.1
Joint Acc Normal Evaluation Evaluation with
Teacher Forcing
Dataset dev test dev test
Base 58.01 54.91
Turn-level 56.17 54.08 69.13 65.82
Turn-level 61.39 57.88
Passage-level 55.21 52.40 66.84 61.92
Passage-level 61.11 55.86
Dual-level 56.17 54.08 70.22 67.17
Dual-level 61.89 59.07
Table 8: Joint accuracy results of variants of our approach in different training phase on MultiWOZ 2.1. Normal evaluation means that the approach uses predicted dialogue states as inputs. Evaluation with teacher forcing means that it uses truth label as previous dialogue states. means that the approach is only trained in teacher-forcing training, otherwise is further trained by uniform scheduled sampling training after the teacher-forcing phase.
Figure 5: Visualization of output weights in the fusion gates. The weight represents the proportion of the information from the previous dialogue states. The large weight with dark color means that the prediction of the slot value pays much attention to the previous dialogue states. Turn, Passage, Dual mean FPDSC with turn-level, passage-level and dual-level.

5.6 Gate Visualization

Figure 5 shows the output weight of the turn/passage-level fusion gates in dialogue MUL2359 (Table 6) and MUL2657 (Table 7) from MultiWOZ 2.1. Turn, Passage, Dual in titles of subplots represent FPDSC with turn-level, passage-level, and dual-level. All the weights in Figure 5 mean the information keep-proportion from the last dialogue states.

When we focus on the slot restaurant-name in dialogue MUL2359. The output weight in the turn-level fusion gate is small at the and the dialogue turn in turn/dual-level approaches. Since the slot value rice house is first mentioned at the turn and the constraint is released at the turn, the change of the weight for slot restaurant-name is reasonable. When we focus on slots taxi-departure, taxi-destination, and taxi-leave at at the turn of dialogue MUL2657, the respective information sources for above three slots are only previous dialogue state (hotel-name to taxi-departure), both previous dialogue state and current user utterance (Ballare can be found in both user utterance and previous dialogue states of attraction-name), only user utterance (17:30 appears only in the user utterance at the dialogue turn). As shown in Figure 5, at the dialogue turn of MUL2657, taxi-departure has a large weight, taxi-destination has a middle weight, taxi-leave at has a small weight. This trend is as expected.

Figure 5 also shows that the turn-level fusion gate is sensitive to signal tokens in the current user expression. At the 4 dialogue turn of MUL2359, the word cheap triggers low output weight of the turn-level fusion gate for slots hotel-price range and restaurant-price range. It is reasonable that no domain signal is in the 4 utterance. The output of the passage-level fusion gate will keep a relatively low weight once the corresponding slot is mentioned in the dialogue except for the name-related slot.

Although the output weights of the passage-level fusion gate share similar distribution in passage/dual-level method at the dialogue turn of MUL2359. FPDSC (passage-level) has a false prediction of restaurant-name and FPDSC (dual-level) is correct. Two fusion gates can work together to improve the performance. It explains the high performance in dual-level strategy.

5.7 Ablation Study

Table 2 shows that the passage/turn/dual-level approaches get improvements (, , ) compared to the base approach in MultiWOZ 2.1. The results show the turn-level fusion gate is vital to our approaches. The entire history of dialogue states is helpful for DST. The uniform scheduled sampling training is crucial for improving our models’ performance. In Table 8, dev and test represent validation and test data. As the table shows, all of our approaches improve the joint accuracy around after uniform scheduled sampling training. The falsely predicted dialogue states work as the data noise, which improves the model’s robustness. The base approach utilizes only the information from the context without uniform scheduled sampling training.

6 Conclusion

In this paper, we combine the entire history of the predicted dialogue state and the contextual representation of dialogue for DST. We use a hierarchical fusion network to merge the turn/passage-level information. Both levels of information is useful to solve the deleted-value and related-slot problem. Therefore, our models reach state-of-the-art performance on MultiWOZ 2.0 and MultiWOZ 2.1.

The turn-level fusion gate is sensitive to signal tokens from the current turn utterance. The passage-level fusion gate is relatively stable. Uniform scheduled sampling training is crucial for improving our models’ performance. The entire history of dialogue states helps at extracting information in each dialogue utterance. Although some errors exist in the predicted dialogue states, the errors work as the data noise in the training to enhance the proposed model’s robustness.

Although our approach is based on predefined ontology, the strategy for information extraction is universal. Besides, the core feature can be introduced to a decoder to generate the slot state, which suits most open-domain DST.

Acknowledgement

We thank the anonymous reviewers for their helpful comments. This work is supported by the NSFC projects (No. 62072399, No. 61402403), Hithink RoyalFlush Information Network Co., Ltd, Hithink RoyalFlush AI Research Institute, Chinese Knowledge Center for Engineering Sciences and Technology, MoE Engineering Research Center of Digital Library, and the Fundamental Research Funds for the Central Universities.

References

  • S. Bengio, O. Vinyals, N. Jaitly, and N. Shazeer (2015)

    Scheduled sampling for sequence prediction with recurrent neural networks

    .
    In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett (Eds.), pp. 1171–1179. External Links: Link Cited by: §1, §4.3.
  • P. Budzianowski, T. Wen, B. Tseng, I. Casanueva, S. Ultes, O. Ramadan, and M. Gašić (2018) MultiWOZ - a large-scale multi-domain Wizard-of-Oz dataset for task-oriented dialogue modelling. In

    Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

    ,
    Brussels, Belgium, pp. 5016–5026. External Links: Document, Link Cited by: §1.
  • M. Eric, R. Goel, S. Paul, A. Sethi, S. Agarwal, S. Gao, and D. Hakkani-Tür (2019) Multiwoz 2.1: multi-domain dialogue state corrections and state tracking baselines. Cited by: §1.
  • M. Heck, C. van Niekerk, N. Lubis, C. Geishauser, H. Lin, M. Moresi, and M. Gasic (2020) TripPy: a triple copy strategy for value independent neural dialog state tracking. In Proceedings of the 21th Annual Meeting of the Special Interest Group on Discourse and Dialogue, 1st virtual meeting, pp. 35–44. External Links: Link Cited by: Table 2, §4.2.
  • M. Henderson, B. Thomson, and S. Young (2014)

    Word-based dialog state tracking with recurrent neural networks

    .
    In Proceedings of the 15th Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL), Philadelphia, PA, U.S.A., pp. 292–299. External Links: Document, Link Cited by: §1.
  • E. Hosseini-Asl, B. McCann, C. Wu, S. Yavuz, and R. Socher (2020) A simple language model for task-oriented dialogue. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin (Eds.), External Links: Link Cited by: §1, Table 2, §4.2.
  • S. Kim, S. Yang, G. Kim, and S. Lee (2020) Efficient dialogue state tracking by selectively overwriting memory. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 567–582. External Links: Document, Link Cited by: §2.
  • D. P. Kingma and J. Ba (2015) Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Y. Bengio and Y. LeCun (Eds.), External Links: Link Cited by: §4.3.
  • S. Mehri, M. Eric, and D. Hakkani-Tur (2020) Dialoglue: a natural language understanding benchmark for task-oriented dialogue. arXiv preprint arXiv:2009.13570. Cited by: §3.1.
  • Y. Ouyang, M. Chen, X. Dai, Y. Zhao, S. Huang, and J. Chen (2020) Dialogue state tracking with explicit slot connection modeling. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 34–40. External Links: Document, Link Cited by: §1, §1, §2.
  • Y. Shan, Z. Li, J. Zhang, F. Meng, Y. Feng, C. Niu, and J. Zhou (2020) A contextual hierarchical attention network with adaptive objective for dialogue state tracking. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 6322–6333. External Links: Document, Link Cited by: §1, §2, §3.5, Table 2, §4.2.
  • B. Thomson and S. Young (2010) Bayesian update of dialogue state: a pomdp framework for spoken dialogue systems. Computer Speech & Language 24 (4), pp. 562–588. Cited by: §1.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fergus, S. V. N. Vishwanathan, and R. Garnett (Eds.), pp. 5998–6008. External Links: Link Cited by: §3.2, §3.3.
  • Z. Wang and O. Lemon (2013) A simple and generic belief tracking mechanism for the dialog state tracking challenge: on the believability of observed information. In Proceedings of the SIGDIAL 2013 Conference, Metz, France, pp. 423–432. External Links: Link Cited by: §1.
  • T. Wen, D. Vandyke, N. Mrkšić, M. Gašić, L. M. Rojas-Barahona, P. Su, S. Ultes, and S. Young (2017) A network-based end-to-end trainable task-oriented dialogue system. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, Valencia, Spain, pp. 438–449. External Links: Link Cited by: §1.
  • J. D. Williams and S. Young (2007)

    Partially observable markov decision processes for spoken dialog systems

    .
    Computer Speech & Language 21 (2), pp. 393–422. Cited by: §1.
  • J. D. Williams (2014) Web-style ranking and SLU combination for dialog state tracking. In Proceedings of the 15th Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL), Philadelphia, PA, U.S.A., pp. 282–291. External Links: Document, Link Cited by: §1.
  • X. Zang, A. Rastogi, S. Sunkara, R. Gupta, J. Zhang, and J. Chen (2020) MultiWOZ 2.2 : a dialogue dataset with additional annotation corrections and state tracking baselines. In Proceedings of the 2nd Workshop on Natural Language Processing for Conversational AI, Online, pp. 109–117. External Links: Document, Link Cited by: §4.1.
  • J. Zhang, K. Hashimoto, C. Wu, Y. Wang, P. Yu, R. Socher, and C. Xiong (2020a)

    Find or classify? dual strategy for slot-value predictions on multi-domain dialog state tracking

    .
    In Proceedings of the Ninth Joint Conference on Lexical and Computational Semantics, Barcelona, Spain (Online), pp. 154–167. External Links: Link Cited by: Table 2, §4.2.
  • Y. Zhang, S. Sun, M. Galley, Y. Chen, C. Brockett, X. Gao, J. Gao, J. Liu, and B. Dolan (2020b) DIALOGPT : large-scale generative pre-training for conversational response generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Online, pp. 270–278. External Links: Document, Link Cited by: §3.1.
  • S. Zhu, J. Li, L. Chen, and K. Yu (2020) Efficient context and schema fusion networks for multi-domain dialogue state tracking. In Findings of the Association for Computational Linguistics: EMNLP 2020, Online, pp. 766–781. External Links: Document, Link Cited by: §2.
  • L. Zilka and F. Jurcicek (2015) Incremental lstm-based dialog state tracker. In

    2015 Ieee Workshop on Automatic Speech Recognition and Understanding (Asru)

    ,
    pp. 757–762. Cited by: §1.