Joint Contextual Modeling for ASR Correction and Language Understanding

by   Yue Weng, et al.

The quality of automatic speech recognition (ASR) is critical to Dialogue Systems as ASR errors propagate to and directly impact downstream tasks such as language understanding (LU). In this paper, we propose multi-task neural approaches to perform contextual language correction on ASR outputs jointly with LU to improve the performance of both tasks simultaneously. To measure the effectiveness of this approach we used a public benchmark, the 2nd Dialogue State Tracking (DSTC2) corpus. As a baseline approach, we trained task-specific Statistical Language Models (SLM) and fine-tuned state-of-the-art Generalized Pre-training (GPT) Language Model to re-rank the n-best ASR hypotheses, followed by a model to identify the dialog act and slots. i) We further trained ranker models using GPT and Hierarchical CNN-RNN models with discriminatory losses to detect the best output given n-best hypotheses. We extended these ranker models to first select the best ASR output and then identify the dialogue act and slots in an end to end fashion. ii) We also proposed a novel joint ASR error correction and LU model, a word confusion pointer network (WCN-Ptr) with multi-head self-attention on top, which consumes the word confusions populated from the n-best. We show that the error rates of off the shelf ASR and following LU systems can be reduced significantly by 14 with joint models trained using small amounts of in-domain data.


page 1

page 2

page 3

page 4


Jointly Encoding Word Confusion Network and Dialogue Context with BERT for Spoken Language Understanding

Spoken Language Understanding (SLU) converts hypotheses from automatic s...

Remember the context! ASR slot error correction through memorization

Accurate recognition of slot values such as domain specific words or nam...

Clinical Dialogue Transcription Error Correction using Seq2Seq Models

Good communication is critical to good healthcare. Clinical dialogue is ...

Minimising Biasing Word Errors for Contextual ASR with the Tree-Constrained Pointer Generator

Contextual knowledge is essential for reducing speech recognition errors...

ASR Adaptation for E-commerce Chatbots using Cross-Utterance Context and Multi-Task Language Modeling

Automatic Speech Recognition (ASR) robustness toward slot entities are c...

Modeling ASR Ambiguity for Dialogue State Tracking Using Word Confusion Networks

Spoken dialogue systems typically use a list of top-N ASR hypotheses for...

Cross-sentence Neural Language Models for Conversational Speech Recognition

An important research direction in automatic speech recognition (ASR) ha...

1 Introduction

Goal-oriented dialogue systems aim to automatically identify the intent of the user as expressed in natural language, extract associated arguments or slots, and take actions accordingly to satisfy the user’s requests [SLUBook]. In such systems, the speakers’ utterances are typically recognized using an ASR system. Then the intent of the speaker and related slots are identified from the recognized word sequence using an LU component. Finally, a dialogue manager (DM) interacts with the user (not necessarily in natural language) and helps the user achieve the task that the system is designed to support. As a result, the quality of ASR systems has a direct impact on downstream tasks such as LU and DM. This becomes more evident in cases where a generic ASR is used, instead of a domain-specific one [DBLP:conf/slt/MorbiniAASSGTN12].

A standard approach to improve ASR output is to use an SLM or a neural model to re-rank different ASR hypotheses and use the one with the highest score for downstream tasks. Moreover, neural language correction models can also be trained to recover from the errors introduced by the ASR system via mapping ASR outputs to the ground-truth text in end-to-end speech recognition [DBLP:conf/interspeech/TanakaMMA18, among others]. In this paper we experiment with training ASR reranking/correction models jointly with LU tasks in an effort to improve both tasks simultaneously, towards End-to-End Spoken Language Understanding (SLU).

The major contributions of this work are as follows:

  • Presented a cascaded approach to first select the best ASR output and then perform LU

  • Presented a novel alignment scheme to create a word confusion network from ASR n-best transcriptions to ensure consistency between model training and inference

  • Proposed a framework for using ASR n-best output to improve end-to-end SLU by multi-task learning, i.e. ASR correction, and LU (intent and slot detection).

  • Proposed several novel architectures adopting GPT [GPT2018] and Pointer network [oriol:nips15] with a D attention mechanism

  • Comprehensive experimentation to compare different model architectures, uncover their strengths and weaknesses and demonstrate the effectiveness of End-to-End learning of ASR ranking/correction and LU models.

2 Related Work

Word Confusion Networks: A compact and normalized class of word lattices, called word confusion networks (WCNs) were initially proposed for improving ASR performance  [lidia-es99]. WCNs are much smaller than ASR lattices but have better or comparable word and oracle accuracy, and because of this they have been used for many tasks, including SLU [gokhan-icslp02, among others]

. However, to the best of our knowledge they have not been used with Neural Semantic Parsers implemented by Recurrent Neural Networks (RNNs) or similar architectures. The closest work would be


, who propose to traverse an input lattice in topological order and use the RNN hidden state of the lattice final state as the dense vector representing the entire lattice. However, word confusion networks provide a much better and more efficient solution thanks to token alignments. We use this idea to first infer WCNs from ASR n-best and then directly use them for ASR correction and LU in joint fashion.

ASR Correction: Neural language correction models have been widely used to tackle a variety of tasks including grammar correction, text or spelling correction and completion of ASR systems. [DBLP:conf/interspeech/TanakaMMA18, DBLP:journals/corr/abs-1902-07178] are highly relevant to our work as they performed spelling correction on top of ASR errors to improve the quality of speech recognition. However, our work differs significantly from existing work as we tackle neural language correction together with a downstream task (LU in this case) in a multi-task learning setting. In addition, we use the alignment information contained in the n-best list by an inferred word confusion network and input all n-best into a single neural network.

Re-ranking and Joint Modeling: [DBLP:conf/cicling/MaCB17, among others] showed that n-best re-ranking helps in reducing WER, while [DBLP:conf/slt/MorbiniAASSGTN12, DBLP:conf/ijcnlp/CoronaTM17] showed that using ranking or in-domain language models or semantic parsers over n-best hypotheses significantly improves LU accuracy. Moreover, [DBLP:conf/slt/Jonson06, DBLP:conf/interspeech/RajuHLGKMVR18] showcased the importance of context in ASR performance. However, none of the above-mentioned works involved joint or contextual modeling with end-to-end comparison. [DBLP:conf/slt/HaghaniNBCGMPQW18, among others] showcased that audio features can be directly used for LU, however, such systems are less robust for task completion, especially those which involve multi-turn state tracking. Moreover, another objective of our research is to evaluate if generalized language models such as GPT [GPT2018] can be useful for joint ASR re-ranking and LU tasks.

3 SLU Background and Baselines

3.1 ASR Ranking and Error Correction

To prevent the propagation of ASR errors to downstream applications such as NLU in a dialogue system, ASR error correction [Roark2004CorrectiveLM, Kumar17] has been explored extensively using a variety of approaches such as language modeling and neural language correction. In the following, we cover the formulation of ASR error corrections using both approaches.

Language Modeling: Significant research has been conducted around count-based and neural LMs [DBLP:conf/naacl/1993, DBLP:conf/icassp/2010, among others]. Even though RNN-LMs have significantly advanced the state of the art (through re-ranking and Seq2Seq architectures), they still do not fully preserve the context, especially in ASR for Dialogue Systems, wherein context for a word might not correspond to words immediately observed before. Bidirectional and Attention based Neural LMs such as Embeddings for Language Models (ELMo) and Contextual Word Vectors (Cove) have shown some improvements [DBLP:conf/naacl/PetersNIGCLZ18, DBLP:conf/nips/McCannBXS17]

. More recently, Transformer Networks based LMs such as Bidirectional Encoder Representations from Transformers (BERT)

[DBLP:journals/corr/abs-1810-04805] and GPT [DBLP:journals/corr/abs-1801-10198, GPT2018] have significantly outperformed most baselines in a variety of tasks.

Statistical and Neural LMs for Re-ranking/Re-scoring: We trained a variety of LMs on the DSTC2 training data, which are then used for re-ranking the ASR hypotheses based on perplexity. We trained the following LMs: (1) Count based word level Statistical Language Model (SLM) (experimented with several context sizes with backoff) (2) Transformer based OpenAI GPT LM [GPT2018], which uses Multi-headed Self-attention over the context followed by position-wise Feed-Forward layers to generate distribution over output sequence. While the GPT is trained with sub-word level LMs as proposed in the initial architecture. We start with a pre-trained GPT-LM released by OpenAI [GPT2018] and then fine-tune on DSTC-2 data along with passing contextual information (past system and user turns along with current system turn separated by a special token) as input to the model. We experimented with the number of previous turns provided as context to the language model and picked the best configuration based on the development data. These LMs are used for re-ranking and obtaining the best hypothesis, which is then fed into a Bi-LSTM CRF [huang2015bidirectional] for intent and slot detection, which are used as baselines.

Neural Language Correction (NLC): Neural language correction [DBLP:journals/corr/XieAAJN16] aims at using neural architectures to map an input sentence containing errors, to a ground-truth output sentence . We use WCN (inferred from the n-best) to align the n-best list with the ground-truth. This way, the input and output will have the same length and they are aligned at word-level: namely and are highly plausible pairs. As a result, we can use the same RNN decoder for slot tagging as described in Section 4.1. Note that sequence tagging architectures can be used for multi-task learning with multiple prediction heads of word-correction and IOB tag prediction.

3.2 Language Understanding

The state-of-the-art in SLU relies on RNN or Transformer based approaches and its variations, which have first been used for slot filling by [yao2013RNN] and [mesnil2013RNN]

simultaneously. More formally, to estimate the sequence of tags

in the form of IOB labels as in [raymond-riccardi07] (with 3 outputs corresponding to ‘B’, ‘I’ and ‘O’), and corresponding to an input sequence of tokens , the RNN architecture consists of an input layer, a number of hidden layers, and an output layer. Nowadays, state-of-the-art slot filling methods usually rely on sequence models like RNNs [dilekIS16, RNN-TASL, among others]. Extensions include encoder-decoder models [bingIS16, zhu2016], transformers [DBLP:journals/corr/abs-1902-10909], or memory [vivianIS16]

. Historically, intent determination has been seen as a classification problem and slot filling as sequence classification problem, and in the pre-deep-learning era these two tasks were typically modeled separately. To this end  

[dilekIS16] proposed a single RNN architecture that integrates intent detection and slot filling. The input of this RNN is the input sequence of words (e.g., user queries) and the output is the full semantic frame (intent and slots).

4 Joint ASR Correction and NLU Models

4.1 Word Confusion Network and N-best Alignment

N-best output from out of box ASR systems are usually not aligned. So, for WCN based models (Section 4.4), an extra step is needed to align the n-best. Here’s our approach: Use the word level Levenshtein distance to align every ASR hypothesis with the one-best hypothesis (as we do not have the transcription during testing). To unify these n-references, we merge insertions across all hypotheses to create a global reference , which is then used to expand all the original n-best to obtain hypotheses of same length as . During training, we align transcriptions with for and NLU tasks such as tagging experiments.

Experiments WER SER DA-Acc Slot-F1 TER FER
1-best (C) 29.99 54.48 88.28 81.01 14.9 21.43
oracle (C) 19.83 41.0 90.37 85.78 9.14 17.69
ground truth (C) 0 0 94.91 98.88 0.31 5.42
SLM (C) 27.95 51.94 87.75 79.45 14.83 22.17
GPT-LM (C) 26.82 49.9 89.09 81.3 15.20 21.07
Hier-CNN-RNN_Ranker (C) 25.84 47.92 88.67 81.94 13.14 20.3
GPT_Ranker (C) 26.06 49.13 89.58 82.38 13.81 19.28
WCN_Pointer_Head_No_Attention (J) 26.99 49.13 89.67 81.84 13.18 20.87
WCN_Pointer_Head_Multiheads_Attention (J) 26.73 49.01 89.48 82.04 13.06 20.82
90.39 (C) 82.06 (C) 13.05(C) 20.79 (C)
WCN_Word_Generation_Head_Multiheads_Attention (J) - - 89.48 (J) 83.21 (J) 13.51 (J) 19.71 (J)
89.76 (C) 82.68 (C) 13.39(C) 18.45(C)
GPT_MultiHead (J) 25.97 49.05 92.69 (J) 77.8 (J) 13.24 (J) 23.77 (J)
89.73 (C) 82.78 (C) 13.63 (C) 19.09 (C)
GPT_MultiHead_Context (J) 25.80 48.65 92.28 (J) 78.09 (J) 13.08 (J) 23.09 (J)
89.82 (C) 82.67 (C) 13.56 (C) 18.92 (C)
Table 1: SLU results. All the “J" models perform LU in joint manner, while for the other models (C), output of the ranker or correction module is fed into a separate BiLSTM-CRF for tagging and dialogue act detection. The ASR correction performance is masked to ensure fair comparison for WCN-Word-Generation-Head-MultiHeads-Attention.

4.2 GPT based Joint SLU

As described in Section 3, GPT based LM is used for re-scoring the n-best hypothesis. We extend the GPT-LM with three additional heads (Figure 1): Discriminatory Ranking, Dialogue Act Classification, and Slot Tagging. In addition to the likelihood of the sequence obtained from the LM, we train a discriminatory ranker to select the oracle.

Figure 1: GPT joint model for ranking, intent and slot detection

The ranker takes the last state (or ‘clf’ token embedding) as input for each hypothesis and outputs 1 if it is oracle or 0 otherwise. Similarly, we sum the last state for all the hypotheses and use it for Dialogue Act classification. For tagging, we use the transcription during training and hypothesis selected by the ranker during testing or validation. We add a Bi-LSTM layer on top of the embeddings obtained from GPT-LM to predict IOB tags. The model inputs are context (last system and user utterance, current system utterance) and n-best hypothesis, all separated by a delimiter used in the original GPT.

4.3 Hierarchical CNN-RNN Neural Ranker

Given the n-best as input, we built a multi-head hierarchical CNN-RNN (Hier-CNN-RNN) model to predict the index of the oracle directly. The nbest ASR hypothesis is first input to a 1D Convolutional Neural Network (CNN) to extract the n-gram information. The motivation to use CNN is to align the words in the n-best hypothesis since the convolutional filters are invariant to translation. The features extracted from CNN are then fed to a RNN to capture the sequential information in the sentences. The hidden states from RNN are concatenated together. The last hidden states from all n-best is averaged to predict the index of the oracle in n-best. For the joint model, the predicted oracle is fed into a LU head module to predict the intent and slots. The joint model did not perform well, so we have excluded it from the results in the interest of space.

4.4 WCN Pointer Joint Neural Correction and NLU

Figure 2: WCN Pointer Model With Multiheads Self Attention

The WCN model, as illustrated in Figure 2

, takes all the N-best in at the same time. Specifically, for a given n-best, a word confusion network alignment is constructed. Then, for each time step, the model concatenates the embeddings of all its n-best into a word bin and processes them through a multi-headed Bi-LSTM, where each hidden states is concatenated with embedding vectors as residual connection. Next, a multihead self attention layer is applied to all hidden states, which, in addition to predicting the IOB tags, generates the correct word based on vocabulary (word generation head) or predicts the index at which the correct word is found (pointer head) for each time step. If there is no correct words, we select first best. We append an EOS token in the last time step and use the last hidden state for intent prediction. The rationale behind this is that the correct word often exists in the WCN alignment but can be at different positions.

5 Experiments and Results

Data: We use DSTC-2 data [dstc2], wherein the system provides information about restaurants that fit users’ preferences (price, food type, and area) and for each user utterance the 10-best hypotheses are given. We modified the original labels for dialogue acts to a combination of dialogue act and slot type (e.g. dialogue act for “whats the price range" becomes “request_pricerange" instead of “request"), which gets us a total of 25 unique dialogue acts instead of initial the 14. Further, we address the slot detection problem as a slot tagging problem by using the slot annotations and converting them into IOB format. In our analysis, we ignore the cases that have empty n-best hypotheses or dialogue acts, and those with the following transcriptions: “noise", “unintelligible", “silence", “system", “inaudible", and “hello and welcome". This leads to 10,881 train, 9,159 test, and 3,560 development utterances. Our objective is not to out-perform the state-of-the-art approaches on DSTC-2 data, but to evaluate if we can leverage ASR n-best in a contextual manner for overall better LU through multi-task learning. We also plan to release the data for enabling future research.

Baseline and Upper Bound: We obtain WER and sentence error rate (SER) to evaluate ASR and dialogue act accuracy (DA-Acc), tag error rate (TER), slot F1, and frame error rate (FER) to evaluate LU. We compare the metrics obtained for joint models with the ones through cascading (i.e. non-joint models). For ASR, we consider three baselines: 1-best, SLM and GPT based re-ranked hypothesis. For LU, we trained a separate Bi-LSTM CRF tagger with an extra head for Dialogue Act classification, which we run on top of the three baselines mentioned above to obtain LU baseline numbers. To better understand the upper-bound, we obtain the metrics for the oracle and ground truth transcription as well.

5.1 Results and Discussion

As shown in Table 1, it can be observed that all models outperform the 1-best in ASR metrics. Even SLM trained and GPT-LM fine-tuned on 11k training utterances perform significantly better than the 1-best on ASR metrics. However this does not translate into improvement in the LU metrics. In fact, the output reranked using SLM does worse on the LU metrics. This indicates that just reducing WER and SER doesn’t lead to improvement in LU. The Hier-CNN-RNN Ranker model achieves 14% lower WER while also improving the LU metrics (5.2% reduction in FER). The GPT based discriminatory ranker also improves both ASR (13% reduction in WER) and LU (10% reduction in FER). This indicates that training a discriminatory ranker which identifies the oracle would do better than training a task-specific Language Model. Some of the models even out-perform the oracle on DA-Acc (2% absolute improvement) because the Dialogue Act prediction head uses an encoding of all hypotheses (including oracle).

On the other hand, WCN models lead to the best LU slot tagging performance. WCN models out-perform the baseline with 2.2% absolute improvement in slot F1 score, 12% TER reduction and most importantly 8% FER reduction. The GPT joint models on the other hand improve the TER but their slot F1 is significantly lower compared to the GPT ranker. This is probably because there are a lot more ‘O’ tags compared to ‘B’ and ‘I’. We noticed that we were able to achieve even higher accuracy by running the baseline tagger on the corrected output of the joint models. Our lowest FER is achieved by running the baseline tagger model on the joint WCN model (with word generation head) output. While the WCN model’s performance is improved by using the baseline tagger, the difference is much more profound for the GPT models (the frame error rate drops by almost 4%). We believe this is because the WCN models consume aligned n-best, which improves the model learning efficiency and they converge better when data size is relatively small. Furthermore, we observed that adding multihead attention layer and multiple heads helps the WCN models across all metrics.

6 Conclusions

We have presented a joint ASR reranker and LU model and showed experimental results with significant improvements on the DSTC-2 corpus. To the best of our knowledge this is the first deep learning based study to this end. We have also contrasted these models with cascaded approaches building state-of-the-art GPT based rankers. Our future work involves extending such end to end LU approaches towards tighter integration with a generic ASR model.