Jointly Encoding Word Confusion Network and Dialogue Context with BERT for Spoken Language Understanding

by   Chen Liu, et al.
Shanghai Jiao Tong University

Spoken Language Understanding (SLU) converts hypotheses from automatic speech recognizer (ASR) into structured semantic representations. ASR recognition errors can severely degenerate the performance of the subsequent SLU module. To address this issue, word confusion networks (WCNs) have been used to encode the input for SLU, which contain richer information than 1-best or n-best hypotheses list. To further eliminate ambiguity, the last system act of dialogue context is also utilized as additional input. In this paper, a novel BERT based SLU model (WCN-BERT SLU) is proposed to encode WCNs and the dialogue context jointly. It can integrate both structural information and ASR posterior probabilities of WCNs in the BERT architecture. Experiments on DSTC2, a benchmark of SLU, show that the proposed method is effective and can outperform previous state-of-the-art models significantly.


page 1

page 2

page 3

page 4

page 5


N-Best ASR Transformer: Enhancing SLU Performance using Multiple ASR Hypotheses

Spoken Language Understanding (SLU) systems parse speech into semantic s...

Modeling ASR Ambiguity for Dialogue State Tracking Using Word Confusion Networks

Spoken dialogue systems typically use a list of top-N ASR hypotheses for...

Joint Contextual Modeling for ASR Correction and Language Understanding

The quality of automatic speech recognition (ASR) is critical to Dialogu...

Attention-based Multi-hypothesis Fusion for Speech Summarization

Speech summarization, which generates a text summary from speech, can be...

Exploiting Sentence and Context Representations in Deep Neural Models for Spoken Language Understanding

This paper presents a deep learning architecture for the semantic decode...

Contrastive Learning for Improving ASR Robustness in Spoken Language Understanding

Spoken language understanding (SLU) is an essential task for machines to...

ASR error management for improving spoken language understanding

This paper addresses the problem of automatic speech recognition (ASR) e...

1 Introduction

The spoken language understanding (SLU) module is a key component of spoken dialogue system (SDS), parsing user utterances into corresponding semantic representations (e.g., dialogue acts [young2007cued]). For example, the utterance “I want a high priced restaurant which serves Chinese food” can be parsed into a set of semantic tuples “inform(pricerange=expensive), inform(food=Chinese)”. In this paper, we focus on SLU with semantic labels in the form of act(slot=value) triplets (i.e., unaligned annotations), which does not require word by word annotations. Both discriminative [Mairesse2009SpokenLU, henderson2012discriminative, zhu2014semantic, barahona2016exploiting] and generative [zhao2018improving, zhao2019hierarchical] methods have been developed to extract semantics from ASR hypotheses of the user utterance.

SLU systems trained on manual transcripts would get a dramatic decrease in performance when applied to ASR hypotheses [mesnil2013investigation]. To eliminate ambiguity caused by ASR errors, two kinds of input features can be exploited to enhance SLU models: (1) ASR hypotheses and (2) dialogue context information. 1) Considering the uncertainty of ASR hypotheses, previous works utilized ASR 1-best result [henderson2012discriminative, zhao2019hierarchical, zhu2018robust, li2019robust], N-best lists [henderson2012discriminative, robichaud2014hypotheses, khan2015hypotheses], word lattices [vsvec2015word, ladhak2016latticernn, huang2019adapting] and word confusion networks (WCNs) [hakkani2006beyond, henderson2012discriminative, tur2013semantic, yang2015using, jagfeld2017encoding, masumura2018neural] for inputs to train an SLU model. Masumura et al. [masumura2018neural]

proposed a fully neural network based method,

neural ConfNet classification

, to encode WCNs. It first obtains bin (each bin contains multiple word candidates and their posterior probabilities of ASR hypothesis in the same time step) vectors by the weighted sum of all word embeddings in each bin separately, and then exploits a bidirectional long short-term memory recurrent neural network (BLSTM-RNN) to integrate all bin vectors into an utterance vector. Nevertheless, bin vectors are extracted locally, ignoring contextual features beyond certain bins. 2) Furthermore, the last system dialogue act 

[young2007cued] can be utilized to track context [henderson2012discriminative, barahona2016exploiting], and provide some implications about the user intent under noisy conditions. However, utterance and context are independently encoded by different models to generate the final representation, resulting in a lack of interaction between them.

Recently, pre-trained language models, such as GPT [radford2018improving] and BERT [devlin2018bert], have been successfully adopted in various NLP tasks. Huang et al. [huang2019adapting] adapted GPT for modeling word lattices, where lattices are represented as directed acyclic graphs. However, GPT is modeled as a unidirectional Transformer [vaswani2017attention] and neglects context in the future, thus less expressive than BERT. Although both word lattices and WCNs contain more information than N-best lists, WCNs have been proven more efficient in terms of size and structure [hakkani2006beyond].

To these ends, we propose a novel BERT (Bidirectional Encoder Representations from Transformers) [devlin2018bert] based SLU model to jointly encode WCNs and system acts, which is named WCN-BERT SLU. It consists of three parts: a BERT encoder for jointly encoding, an utterance representation model, and an output layer for predicting semantic tuples. The BERT encoder exploits posterior probabilities of word candidates in WCNs to inject ASR confidences. Multi-head self-attention is applied over both WCNs and system acts to learn context-aware hidden states. The utterance representation model produces an utterance-level vector by aggregating final hidden vectors. Finally, we add both discriminative and generative output layers to predict semantic tuples. To the best of our knowledge, this is the first work to leverage the structure and probabilities of input tokens in BERT. Our method is evaluated on DSTC2 dataset [henderson2014second], and the experimental results show that our method can outperform previous state-of-the-art models significantly.

2 Wcn-Bert Slu

In this section, we will describe the details of the proposed framework, as shown in Fig. 1. For each user turn, the model takes the corresponding WCN and the last system act as input, and predicts the semantic tuples at turn level.

Figure 1:

An overview of our proposed WCN-BERT SLU, which contains a BERT encoder, an utterance representation model, and an output layer. First, the WCN and the last system act are arranged as a sequence to be fed into the BERT encoder. The token-level BERT outputs are then integrated into an utterance-level vector representation. Finally, either a discriminative (semantic tuple classifier, STC) or generative (hierarchical decoder, HD) approach is utilized in the output layer for predicting the act-slot-value triplets.

2.1 Input representation

The WCN is a compact lattice structure where candidate words paired with their associated posterior probabilities are aligned at each position [feng2009effects]. It is commonly considered as a sequence of word bins . The -th bin can be formalized as , where denotes the number of candidates in , and are respectively the -th candidate and its posterior probability given by ASR system. The WCN is flattened into a word sequence, .

System acts contain dialogue context in the form of “act-slot-value” triplets. We consider the last system act just before the current turn, , where is the number of triplets, , and are the -th act, slot and value, respectively. It is also arranged as a sequence . Act and slot names are tokenized, e.g., “pricerange” is split into “price” and “range”.

To feed WCNs and system acts into BERT together, we denote the input token sequence as , where concatenates sequences together, is the BERT tokenizer which tokenizes words into sub-words, [CLS] and [SEP] are auxiliary tokens for separation, . Considering the structural characteristics of WCN, i.e., multiple words compete in a bin, we define that all words (in fact sub-words after tokenization) in the same bin share the same position ID.

Eventually, the BERT input layer embeds into -dimensional continuous representations by summarizing three embeddings [devlin2018bert] as follows:


where , and stand for WordPiece [wu2016google], positional and segment embeddings, respectively.

2.2 BERT encoder

BERT consists a series of bidirectional Transformer encoder layers [vaswani2017attention]

, each of which contains a multi-head self-attention module and a feed-forward network with residual connections

[he2016deep]. For the -th layer, assuming the input is , the output is computed via the self-attention layer (consisting of heads):

where FC is a fully-connected layer, LayerNorm denotes layer normalization [ba2016layer], , .

WCN probability-aware self-attention  We propose an extension to the self-attention mechanism to consider the posterior probabilities of tokens in WCN. Following the notations in section 2.1, for each input token sequence , we define its corresponding probability sequence as , with


where denotes the ASR posterior probability of a token. Note that the probability of a sub-word equals that of the original word in WCN. Probabilities of tokens in the system acts are defined as . Now we inject ASR posterior probabilities into the BERT encoder by changing the computation of :


where is a trainable parameter.

Finally, token-level representations are produced after the stacked encoder layers, denoted as .

Figure 2: Illustration of the feature aggregation. Only the WCN part of an input token sequence is presented. The token-level features are in blue while bin-level features are in purple.

2.3 Utterance representation

The output hidden vector corresponding to the [CLS] token, i.e., , is usually applied to represent the whole utterance. Additionally, we propose to gather the other hidden vectors by considering the structural information of the WCN.

Firstly, the token-level hidden vectors of the WCN part are aggregated into bin-level through the following two steps: (1) BERT sub-word vectors belong to an input word are averaged to be the word vector; (2) word-level vectors of each bin are then weighted and summed to get a bin vector. An example of the feature aggregation of the WCN part is illustrated in Fig. 2, while the features corresponding to the system acts are unchanged. After aggregation, we get new feature vectors , where ( is the number of bins) and .

Then, we summarize the bin-level features with a self-attentive approach as follows:


where and are trainable parameters. The final utterance representation is obtained by concatenating with the hidden state of [CLS], i.e., .

2.4 Output layer

To evaluate the validity and portability of the encoding model, we apply both discriminative (Sec. 2.4.1) and generative (Sec. 2.4.2) approaches for predicting the act-slot-value labels.

2.4.1 Semantic tuple classifier (STC)

Upon the final utterance representation , we apply a binary classifier for predicting the existence of each act-slot pair, and a multi-class classifier over all possible values existing in the training set for each act-slot pair 111For act-slot pairs requiring no value (like thankyou, goodbye, request-phone, request-food etc.), the value classification is omitted.. Therefore, this method cannot predict values unseen in the training set.

2.4.2 Transformer-based hierarchical decoder (HD)

To improve the generalization capability of value prediction, we follow Zhao et al. [zhao2019hierarchical] to construct a hierarchical decoder consisting of an act classifier, a slot classifier, and a value generator. However, there are two main differences listed as follows:

1) Acts and slots are tokenized and then embedded by the BERT embedding layer as additional features for the slot classifier and the value generator. For each act or slot , sub-word level vectors from BERT embedding layer are averaged to get a single feature vector ( for and for ).

2) We replace the LSTM-based value generator with a Transformer-based one [vaswani2017attention]. Tokenized values are embedded by the BERT embedding layer and generated at the sub-word level. Therefore, we tie the BERT’s token embeddings with the weight matrix of the linear output layer in the value generator.

3 Experiments

We experiment on the dataset from the second Dialog State Tracking Challenge (DSTC2) [henderson2014second], which contains , , utterances for training, validation and testing, respectively. In order to shorten flattened WCN sequences, we prune WCNs by removing interjections 222Such as uh, oh, etc. according to Jagfeld et al. [jagfeld2017encoding]. and word candidates with probabilities below a certain threshold ( as recommended [jagfeld2017encoding]

). The evaluation metrics are F

score of act-slot-value triplets and utterance-level accuracy. We do not assume that all value candidates of each slot are known in advance.

In our experiments, we use English uncased BERT-base model, which has layers of hidden units, and 12 attention heads. During training, Adam [kingma2014adam] is used for optimization. We select the initial learning rate from {5e-5, 3e-5, 2e-5}, with a warm-up rate of and an L2 weight decay of

. The maximum norm for gradient clipping is set to

. The dropout rate is set to for the BERT encoder and for the utterance representation model and the output layer. The model is trained for epochs and saved according to the best performance on the validation set. Each experimental setting was run five times with different random seeds, and we report the averaged result.

3.1 Main results

Models TrainTest F Acc.
BLSTM+ Self-Attn STC manualmanual 98.56 97.29
manual1-best 83.57 74.91
1-best1-best 84.06 75.26
1-best10-best 84.92 75.70
10-best10-best 85.05 77.07
Neural ConfNet STC WCNWCN 85.01 77.03
Lattice-SLU STC WCNWCN 86.09 78.78
WCN-BERT (ours) STC WCNWCN 87.91 81.14
HD 87.33 80.74
Table 1: F scores (%) and utterance-level accuracies (%) of baseline models and our proposed model on the test set.

As shown in Table 1, different types of inputs are applied for baselines, including manual transcriptions, ASR 1-best, 10-best lists, and WCNs. The baseline model for the first three types (manual, 1-best, and 10-best) is BLSTM with self-attention [masumura2018neural]333In these baselines, word embedding is initialized with 100-dim Glove6B [pennington2014glove]. The learning rate is set to , fixed during training. The maximum norm for gradient clipping is , and dropout rate is . BLSTM encodes the input sequence and gets the utterance representation with the self-attention similar to Eq. (4). To test on 10-best lists with the model trained on 1-best, we run the model on each hypothesis from the list and average the results weighted by the ASR posterior probabilities. For direct training and evaluation on 10-bests, the representation vector is calculated as , where is the representation vector of hypothesis and is the corresponding ASR posterior probability.

For modeling WCNs, we follow the Neural ConfNet Classification [masumura2018neural] method. WCNs are fed into the model through a simple weighted sum representation method, where all word vectors are weighted by their posterior probabilities and then summed. We also apply the Lattice-SLU method with GPT [huang2019adapting] on WCNs. The output layer of all baselines is STC (Sec. 2.4.1).

By comparing across the baselines, we find that performances become better with larger ASR hypotheses space. The Neural ConfNet Classification method can outperform the system trained and tested with 1-best, and it achieves comparable results to the 10-best system. With powerful pre-trained language models (GPT [radford2018improving]), the Lattice-SLU beats the other baselines above. The last system act is not exploited in the baselines, which will be analyzed in the following ablation study.

Models F (%)
SLU2 [williams2014web] 82.1
CNN+LSTM_w4 [barahona2016exploiting] 83.6
CNN [zhao2018improving] 85.3
S2S-Attn-Ptr-Net [zhao2018improving] 85.8
Hierarchical Decoding [zhao2019hierarchical] 86.9
WCN-BERT + STC (ours) 87.9
Table 2: Comparison with prior arts on the DSTC2 dataset.

By joint modeling WCNs and the last system act with powerful pre-trained BERT, our proposed framework outperforms the baselines significantly in F score and utterance-level accuracy, and achieves new state-of-the-art performance on the DSTC2 dataset, as shown in Table 1 and Table 2.

3.2 Ablation study

In this section, we perform ablation experiments of our WCN-BERT SLU with both STC and HD, as presented in Table 3.

Model variations STC HD
(a) WCN-BERT 87.91 87.33
(b)  w/o WCN Prob 87.15 86.47
(c) 87.75 86.99
(d)  w/o BERT 86.12 86.01
(e)  w/o system act 86.69 86.18
(f)  w/o BERT and system act 85.85 85.06
(g) Replace system act with system utterance 88.01 87.28
Table 3: F scores (%) of ablation study on the test set.

Row (b) shows the results without considering WCN probabilities in the BERT encoder. In this case, the model lacks prior knowledge of ASR confidences, resulting in a significant performance drop ( for STC and for HD). By only considering the hidden state related to [CLS] as the utterance representation (row(c)), i.e., , the F scores decrease with both STC and HD, indicating that the structural information of WCN is beneficial for utterance representation.

By removing BERT (row (d)), we utilize vanilla Transformer to jointly encode WCNs and system acts but not fine-tune a pre-trained BERT. We use 100-dim word embeddings, initialized with Glove6B [pennington2014glove]. The results show that removing BERT brings about a remarkable decrease in F score ( for STC and for HD).

Besides, we investigate the effect of dialogue context by removing the last system act from the input, as demonstrated in row (e). Results show that jointly encoding the WCN and the last system act improves the performance dramatically. It is a fair comparison with the Lattice-SLU baseline in Table 1, both with pre-trained language models. The result implies that our model is much more effective, owing to (1) the capability of the bidirectional Transformer, which considers the future context, and (2) the different considerations of WCN structures.

In row (f), neither BERT nor the dialogue context is included. This also gives a fair comparison with the Neural ConfNet Classification method [masumura2018neural] at the model level (Transformer v.s. BLSTM). With STC as the output layer, the Transformer with WCN probability-aware self-attention mechanism is shown to have better modeling capability (an improvement of 0.84% in F) and higher training efficiency (over faster).

Moreover, we replace the last system act with the last system utterance in the input (row (g)), which only causes a slight change (within ) in F scores. This indicates that our model is also applicable to datasets with only system utterances, but no system acts.

3.3 Analysis of hierarchical decoder

As we can see from the previous results, models with hierarchical decoder (HD) perform worse than STC. However, the generative approach equips the model with generalization capability, and the pointer-generator network is beneficial to handle out-of-vocabulary (OOV) tokens. To analysis the generalization capability of the proposed model with hierarchical decoders, we randomly select a certain proportion of the training set to train our proposed WCN-BERT SLU with STC or HD. The validation and test set remain unchanged. Furthermore, we evaluate the F scores of seen and unseen act-slot-value triplets, according to whether an act-slot-value triplet is seen in the training set.

Train size Models overall seen unseen
1% + STC 62.9 67.1 0.0
+ HD 71.6 77.0 29.7
5% + STC 78.7 80.2 0.0
+ HD 81.5 83.1 35.3
10% + STC 81.3 81.9 0.0
+ HD 82.9 83.8 22.9
100% + STC 87.9 88.2 0.0
+ HD 87.3 87.5 4.8
Table 4: F scores (%) of our WCN-BERT SLU on the test set with varying training size.

As shown in Table 4, with the training size getting decreased, the overall performance of HD will not degrade sharply. Moreover, the hierarchical decoder is shown to have better generalization capability in the face of unseen labels.

4 Conclusion

In this paper, we propose to jointly encode WCN and dialogue context with BERT for SLU. To eliminate ambiguity caused by ASR errors, WCNs are utilized for involving ASR hypotheses uncertainties, and dialogue context implied by the last system act is exploited as auxiliary features. In addition, the pre-trained language model BERT is introduced to better encode WCNs and system acts with self-attention. Experimental results show that our method can beat all baselines and achieves new state-of-the-art performance on DSTC2 dataset.