Spoken task-oriented dialogue systems guide the user to complete a certain task through speech interaction. While such speech systems generally include an explicit automatic speech recognition (ASR) system (ASR-based pipelines), they now tend to be replaced by end-to-end architectures where the systems take speech as input and directly produces a decision from it. Examples include end-to-end approaches for spoken language understanding (SLU) proposed recently [9, 1, 4]. However, those end-to-end models currently lead to equivalent but no better performance compared to pipeline approaches based on ASR (see for instance [9, 1]). Besides, in some specific use cases, it may be preferable to deploy modular systems (instead of a monolithic one) for which only one component (ASR, SLU, dialog state tracker) can be modified at a time.
This article is positioned in this latter context and studies how to take better account of ASR ambiguity in a voice-based dialog state tracking system. Most recent work on spoken dialogue systems uses the top-N ASR hypotheses to track the dialogue state and infer user needs. However, ASR lattices provide a richer hypothesis space than the one represented by the top-N hypothesis list. More precisely, we revisit the use of word confusion networks (simply denoted as confnets) , derived from ASR lattices, as a compact and efficient representation of an ASR output. Encoding such graphical representations with existing state-of-the-art dialogue systems enables recent dialogue state trackers (DST) to incorporate speech-based user input in addition to textual inputs, thus increasing the scope of such systems. We introduce a neural confusion network encoder (see Figure 1) which can be used as a plug-in to any dialogue state tracker and achieves better results than using a list of top-N ASR hypotheses.
Our research contributions are the following:
we introduce a system which encodes the confusion network into a representation that can be used with any state-of-the-art dialogue state tracker as a plug-in component,
we propose two methods to incorporate ASR confusion network but also true transcripts while training the DST system,
we demonstrate that using a richer hypothesis space in the form of a confusion network leads to better performance / computation trade-off compared to using a list of top-N ASR hypotheses.
2 Related Work
Dialog state tracking. Recent work on dialogue state trackers [10, 8, 7] infer the state of the dialogue from conversational history and current user utterance. These systems assume a text-based user utterance and accumulate the user goal across multiple user turns in the dialogue.  generalizes on rare slot-value pairs by using global modules which share parameters and local modules to learn slot specific feature representations. 
proposes a universal state tracker which generates a fixed-length representation for each slot from the dialogue history and compares the distance between the representation and value vectors to make a prediction.
Using ASR graphs with neural models. Word lattices from ASR were used by  for intent classification in SLU with RNNs. Inspired from ,  proposed to use word confusion networks for DST. However, there are several differences with what we propose in our paper: (1) they only use average pooling to aggregate the hidden GRU states corresponding to the alternative word hypotheses whereas we introduce the use of attention to pool different alternative words (see illustration on figure 6), (2) our word confusion network encoder can be plugged into any neural architecture for dialogue state tracking, as it basically amounts to have a first layer that transforms a 2D-data structure (confnets) into a 1D-sequence of embeddings, while  keeps the 2D-data structure in the hidden layers and, consequently, is limited to simple RNNs such as GRU and (bi-)LSTM, and (3) they experiment with a simpler RNN-based dialog state tracker while we plug our confnet encoder into the state-of-the-art ’Global-locally Self-Attentive Dialogue State Tacker’ (GLAD) model of . Finally, our confnet encoder is most similar to the one of  but they used ASR confnets for classification of user intent, question-type and named entities while we apply our encoder to a dialogue state tracking task.
3 Word Confusion Network for DST
3.1 Confusion Network Encoder
Inspired from , we use a word confusion network encoder to transform the graph to a representation space which can be used with any dialogue state tracker. The multiple aligned ASR hypotheses, represented as parallel arcs at each position of the confusion network, are treated as a set of weighted arcs over which a self-attention layer is applied to emphasize more relevant words in aligned word space. The final embedding representation at time/position can be interpreted as an attended word embedding over the aligned word hypotheses at time/position . More formally, a confnet is a sequence of parallel weighted arcs, noted as , where is the arc (token) at time/position , and its associated weight. The equations defining the embedding representation for a set of parallel arcs at position in the confnet are the following:
where designates a standard trainable embedding layer for word/token ; the matrix and the vector are also trainable parameters of our model. Note that the training of these parameters is done jointly with the main task (see next subsection).
3.2 Dialogue State Tracking with Confnet
The confusion network encoder can be used as a plug-in to any state-of-the-art DST system. We have used the state-of-the-art ’Global-locally Self-Attentive Dialogue State Tacker’ (GLAD) model  with our confusion network encoder. GLAD addresses the issue of rare slot-value pairs which were not explicitly handled by previous DST models. The GLAD encoder module is a global-local self-attentive encoder which separately encodes the transcript/ASR hypothesis, system actions from previous turns and slot-value under consideration. We extend GLAD by replacing the user utterance representation, namely a sequence of trainable token embeddings, by the confnet embedding sequence. Remind that a confnet is also encoded as a 1 dimensional sequence of embeddings that corresponds to each time/position in the confnet (see Figure 1). This enables GLAD architecture to use graph-based inputs instead of (or even in addition to) token sequence inputs.
4 Model Training Strategies
At training time, both clean transcript and ASR graph are available. It is therefore tempting to use these two pieces of information to facilitate model training while trying to make it robust to ASR errors in the meantime. We propose two radically different strategies to take into account clean transcript and ASR graph at training time.
4.1 Data Augmentation
The confusion network contains noisy hypotheses with lower ASR confidence scores which makes training hard. Augmenting the confnet dataset with the clean transcript should help the system converge faster and better. We encode the transcript in the form of a confnet with a single arc between nodes. At training time we merge both noisy ASR and clean (single arc) confnet datasets. Consequently, we use each dialog twice for training.
4.2 Adding a Similarity Loss to Train Confnet Embeddings
We want the confnet representation to be very close to the one of the clean transcript in the embedding space. To enforce this, we add a similarity loss to the loss corresponding to our main task (binary classification for each slot,value pair). Binary-cross-entropy is used as classification loss, denoted as , and squared euclidean distance between the GLAD representation is used as similarity loss (). The binary cross-entropy loss can be modeled as shown in equation 5 where represents the prediction for the slot and value and is the binary ground-truth. Let be the transformation realised by the confnet encoder and be the standard embedding layer on word tokens; be the transformation function of the GLAD encoder that takes as input the sequence of one dimensional embeddings (either standard word token embeddings or confnet embeddings) and outputs a global-local context vector. The final loss, , of the model is defined as a linear combination of the cross-entropy loss, , and the similarity loss, as shown in equation 5:
Note that, while alone has a trivial solution (namely, projecting everything to a null vector), the combination with the loss associated to the primary task renders this trivial solution impossible.
5 DSTC-2 Dataset
We have evaluated our system on the standard Dialogue State Tracking Challenge 2 (DSTC-2) dataset. DSTC is a research challenge focused on improving the state-of-the art in tracking the state of spoken dialog systems. DSTC 2 contains dialog states which may change through the dialog, in the restaurant information domain. There are 1612, 506 and 1117 dialogues in the training, development and test sets respectively. The dialog state is captured by informable slots (slots which the user can provide a value for, to use as a constraint on their search) and non informable (unconstrained) slots. All the slots in the dataset are requestable as the user can request the value of any slot.111For example, for a given restaurant, the user can request the value of the phone number or price-range slot. DSTC-2 provides representation of the user speech utterance in the form of top-10 ASR hypotheses and word confusion networks. We followed a similar pre-processing pipeline of the confusion networks as mentioned in , i.e, we removed the interjections (um, ah, etc) and pruned arcs with a score of less than 0.001. To compare the performance of the dialogue state tracker on confusion network with top-N list of ASR hypotheses, we extracted top-10 best paths from the confnet. network.
6 Experiments and Results
To get a reference baseline, we first trained the GLAD model on transcripts and tested it on the TOP-N ASR hypotheses by using a weighted sum of scores, i.e, the prediction from each hypothesis is weighted by its ASR confidence score. However, we were only able to achieve similar accuracy as listed in the paper  after pruning the extremely noisy user utterances, namely the ones with a 222We were not able to have more details on the filtering –if any– applied to the test in the original paper although we contacted the authors of  by email. We thus decided to not prune the test set and to keep in entirely: for the entire test set, we obtained a joint-goal accuracy of 68.55%.
Our baseline is the model trained on ASR-10 hypotheses extracted from the confusion network. The final prediction is the weighted sum of prediction probabilities from each ASR hypothesis as input to the system. We also trained a separate model on augmented data, composed of ASR hypotheses and transcripts, to measure the benefits of adding clean samples. To demonstrate that a richer hypothesis space (confusion network) helps in improving accuracy, we trained 2 separate models on confusion networks333We will release the code composed of GLAD compatible with DSTC-2 and the confusion network encoder. with augmented and non-augmented dataset. The augmented dataset is composed of confusion networks and transcripts (modeled as a graph with one arc between nodes). While the ASR-N models are progressively slower as N increases, confusion network based models are faster and show improvements in the joint-goal accuracy(calculated for each turn by comparing the distribution over the values of each slot to the ground-truth). The results are shown in figure 4 and 5. We have performed Wilcoxon signed-rank (one tail significance test) and compared (A) Augmented Top-N ASR hypothesis with (B) Non-augmented Confnet-N [max arcs=8] and (C) Augmented Confnet-N [max arcs=7]. We observe that (B) is significantly better than (A) for significance level 10% (-value=0.068), (C) is significantly better than (A) for significance level 1% (-value0.0001), (C) is significantly better than (B) for significance level 1% (-value0.0001).
The figure 6 shows that the network learns to attend over more relevant words over the more confident hypothesis from the ASR. In the top-10 ASR hypothesis, the weighted sum of predictions gives preference to ‘that’s food’ instead of ‘basque food’ due to its higher ASR scores and frequency of occurrence in the ASR-N list.
In this paper, we have demonstrated that exploiting the rich hypothesis space of a confusion network instead of being limited to the top-N ASR hypotheses for dialogue state tracking leads to both performance gain in time and accuracy. The time gain is significant if we want to incorporate a larger set of alternative ASR hypotheses. Moreover, designed as an initial embedding layer transforming the confnet as a one-dimensional sequence of position-wise embeddings, this module can be plugged into numerous state-of-the-art text-based dialog state tracker systems.
Curriculum-based transfer learning for an effective end-to-end spoken language understanding and domain portability. CoRR abs/1906.07601. External Links: Cited by: §1.
Encoding word confusion networks with recurrent neural networks for dialog state tracking. In
Proceedings of the Workshop on Speech-Centric Natural Language Processing, Copenhagen, Denmark, pp. 10–17. External Links: Cited by: §2, §5.
-  (2016) LatticeRnn: recurrent neural networks over lattices. In INTERSPEECH, Cited by: §2.
-  (2019) Speech model pre-training for end-to-end spoken language understanding. CoRR abs/1904.03670. External Links: Cited by: §1.
-  (2000) Finding consensus in speech recognition: word error minimization and other applications of confusion networks. Computer Speech & Language 14 (4), pp. 373–400. External Links: Cited by: §1.
-  (2018-04) Neural confnet classification: fully neural network based spoken utterance classification using word confusion networks. pp. 6039–6043. External Links: Cited by: §2, §3.1.
-  (2016) Neural belief tracker: data-driven dialogue state tracking. CoRR abs/1606.03777. External Links: Cited by: §2.
-  (2018) Towards universal dialogue state tracking. CoRR abs/1810.09587. External Links: Cited by: §2.
-  (2018) Towards end-to-end spoken language understanding. CoRR abs/1802.08395. External Links: Cited by: §1.
-  (2018-07) Global-locally self-attentive encoder for dialogue state tracking. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, pp. 1458–1467. External Links: Cited by: §2, §2, §3.2, §6, footnote 2.