Attentive Contextual Carryover for Multi-Turn End-to-End Spoken Language Understanding

by   Kai Wei, et al.

Recent years have seen significant advances in end-to-end (E2E) spoken language understanding (SLU) systems, which directly predict intents and slots from spoken audio. While dialogue history has been exploited to improve conventional text-based natural language understanding systems, current E2E SLU approaches have not yet incorporated such critical contextual signals in multi-turn and task-oriented dialogues. In this work, we propose a contextual E2E SLU model architecture that uses a multi-head attention mechanism over encoded previous utterances and dialogue acts (actions taken by the voice assistant) of a multi-turn dialogue. We detail alternative methods to integrate these contexts into the state-ofthe-art recurrent and transformer-based models. When applied to a large de-identified dataset of utterances collected by a voice assistant, our method reduces average word and semantic error rates by 10.8 dataset and show that our method significantly improves performance over a noncontextual baseline



There are no comments yet.


page 1

page 2

page 3

page 4


Sequential Dialogue Context Modeling for Spoken Language Understanding

Spoken Language Understanding (SLU) is a key component of goal oriented ...

End-to-End Joint Learning of Natural Language Understanding and Dialogue Manager

Natural language understanding and dialogue policy learning are both ess...

Energy-based Self-attentive Learning of Abstractive Communities for Spoken Language Understanding

Abstractive Community Detection is an important Spoken Language Understa...

An Efficient Approach to Encoding Context for Spoken Language Understanding

In task-oriented dialogue systems, spoken language understanding, or SLU...

A neural prosody encoder for end-ro-end dialogue act classification

Dialogue act classification (DAC) is a critical task for spoken language...

Memory Consolidation for Contextual Spoken Language Understanding with Dialogue Logistic Inference

Dialogue contexts are proven helpful in the spoken language understandin...

Automated Curriculum Learning for Turn-level Spoken Language Understanding with Weak Supervision

We propose a learning approach for turn-level spoken language understand...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

End-to-end (E2E) spoken language understanding (SLU) aims to infer intents and slots from spoken audio

via a single neural network. For example, when a user says

order some apples, the model maps this spoken utterance (in the form of audio) to the intent Shopping and slots such as Apple: Item. Recent research has made significant advances in E2E SLU [22, 2, 26, 21, 32, 30]. Notably, [30] develops

a jointly trained E2E model, consisting of automatic speech recognition (ASR) and natural language understanding (NLU) models connected by a differentiable neural interface,

that outperforms the compositional SLU where ASR and NLU models are trained separately. Yet, how to incorporate contexts into E2E SLU remains unexplored.

Figure 1: A multi-turn dialogue example.

Contexts have been shown to significantly improve performance separately for ASR [14, 13, 12, 17, 8, 27, 24, 38, 31] and NLU [8, 25, 34, 1, 3, 9, 4]. For example, [38] proposes a multi-hot encoding to incorporate contextual information into a RNN transducer network (RNN-T) via the speech encoder sub-network and found that contexts such as dialogue state could improve accuracy of ASR. [13] uses cross-attention mechanism in an E2E speech recognizer decoder for two-party conversations. [9] has shown that encoding dialogue acts using a feedforward network from dialogue history resulted in a faster and more generalizable model without any accuracy degradation compared to [4]. [36] encodes historical utterances with the BiLSTM and external knowledge with ConceptNet.

In this work, we propose a novel approach to encode dialogue history in a multi-turn E2E SLU system. Figure 1 illustrates a task-oriented turn-by-turn dialogue between a user and a voice assistant (VA). In this figure, the first turn is order some apples. To clarify the apple type, the VA asks What type of apples do you want? at the second turn; and the user’s answer is Fuji. To clarify the quantity, the VA asks How many Fuji apples do you want? at the third turn; and the user’s answer is three. If three is treated as a single-turn utterance, it is ambiguous since it can mean three apples or three o’clock. However, this utterance can correctly be interpreted as three apples when presented with previous dialogue contexts (e.g., order some apples and Fuji). Prior E2E SLU research has focused on single-turn interactions where the VA receives the user’s speech signals from just the current turn. They ignore the relevant contexts from previous turns that can enhance the VA’s ability to correctly disambiguate user’s intent.

In contrast to prior works, where dialogue acts are encoded singularly for ASR (e.g., [38]) or NLU(e.g., [9]), we encode both dialogue acts and previous utterances to improve an E2E SLU architecture.

Specifically, we propose a multi-head gated attention mechanism to encode dialogue contexts. The attention-based context can be integrated at different layers of a neural E2E SLU model. We explore variants where either the audio frames, the neural interface layer (from ASR to NLU), or both are supplemented by the attention-based context vectors. Furthermore, the learnable gating mechanism in our proposed multi-head gated attention can downscale the contribution of the context when needed.

Our proposed approach improves the performance of the state-of-the-art E2E SLU models – namelyrecurrent neural network transducer SLU and transformer transducer SLU on both internal industrial voice assistant datasets and publicly available ones.

2 Problem Definition

We formulate the problem of a multi-turn E2E SLU as follows: In a multi-turn setting, a dialogue between a user and the voice assistant system has turns. Each turn extends a growing list of dialogue acts = corresponding to the preceding system responses and a list of the user’s previous utterance transcripts =. Each dialogue act in comprises of a dialogue action from an action set and a dialogue slot from a slot set . Take the second turn in Figure 1 as an example: the previous utterance , the dialogue action and the dialogue slot .

Inputs and Outputs: The inputs of each turn include acoustic input and dialogue contexts. The acoustic input comprises of a sequence of frame-based acoustic frames, =. Dialogue contexts include preceding dialogue acts , and the previous utterance transcripts . Our goal is to build a contextual neural E2E SLU architecture that correctly generates transcription and semantic outputs for each spoken turn, namely intent , transcript token sequence , and slot sequence (one per token) .

3 Proposed Contextual E2E SLU

The proposed contextual E2E SLU architecture consists of a context encoder component, a context combiner, and a base E2E SLU model. The base model consists of ASR and neural NLU modules jointly trained via a differentiable neural interface [29, 30], which has been shown to achieve state-of-the-art SLU performance.

Figure 2 shows the contextual E2E SLU model architecture using speech encoder context ingestion. The context encoder, described in Section 3.1, converts dialogue acts and utterance transcriptions of previous turns into contextual embeddings. The contextual embeddings are then combined with input audio features = (described in Section 3.2) and then processed by the ASR module to obtain the output sequence , where the outputs are transcription graphemes, word or subword units [7]. Context encoder embeddings are trained along with the rest of the E2E SLU architecture. The hidden interface (or ASR-NLU interface) [28] is connected to the speech encoder via the joint network, which is a feedforward neural network that combines the outputs from the encoder and prediction network. This interface

passes the intermediate hidden representation sequence

= to a neural NLU module that predicts intents and a sequence of predicted slots, one per token, . Our objective is to minimize the E2E SLU loss: , where is the loss for word prediction, is the loss for slot prediction, and is the loss for intent prediction. The following section describes context encoder in detail.

Figure 2: Contextual joint SLU model architecture using speech encoder context ingestion.

3.1 Context Encoder

In this section, we describe approaches to encode dialogue acts and previous utterance transcripts. We first describe the Dialogue Act Encoder that encodes the dialogue acts. Then, we describe the Previous Utterance Transcript Encoder that encodes transcripts from previous utterances.

Figure 3: Encoding previous utterance transcripts and dialogue contexts in a multi-turn dialog between a user and a voice assistant system.

3.1.1 Dialogue Act Encoder

Input: For the -th turn, a list of dialogue acts for all previous turns denoted by = is provided as the input. We set the maximum number of dialogue action-slot pairs to . If has less than

dialogue action-slot pairs, we pad to length

with a default action and slot.

Embedding layer: The embedding layer maintains two embedding matrices - a dialogue action embedding matrix , and a dialogue slot embedding matrix , with and referring to the total number of unique dialogue actions and slot types in the system, respectively. By passing each dialogue action and dialogue slot through their respective embedding matrices, we obtain their corresponding embeddings and .

Encoding layer: Given the dialogue action and slot embeddings, and , we fuse both embeddings via an element-wise addition followed by a nonlinear transformation with a activation [9] as summarized below.


Output: We produce the output as a stack of dialogue act embeddings by aggregation of the list of .

3.1.2 Previous Utterance Transcript Encoder

Input: A list of previous utterance transcripts in the dialogue denoted by =. For each previous utterance transcript , we first tokenize it using the pre-trained BERT-base tokenizer. Next, we prepend a [CLS] token and append a [SEP] token to the tokenized transcript. We set the maximum number of previous utterance transcripts to . We pad empty sequences for if its length is less than , and take the latest sequences in if its length is greater than .
Encoding layer: From the tokenized transcripts, we apply the pre-trained BERT-base [6] model to obtain an utterance transcript embedding for each previous utterance where we use the [CLS] token embedding as the summarized embedding for a full utterance transcript.
Output: Similar to , we output by stacking the list of utterance embeddings from previous turns.

Figure 4: Architecture of Gated Multi-head attentions.

3.2 Context Combiner

The context combiner combines the context encodings and to create the final context vectors that are fed into the model. We explore different ways to combine the context encodings into the model: (i) averaged contextual carryover, (ii) attentive contextual carryover, and (iii) gated attentive contextual carryover.

To illustrate, we detail our approaches with an example that combines dialogue act encodings and the previous utterance transcript encodings with the acoustic embeddings = of the -th turn. Note that the same process can be applied to combine context encodings at different ingestion points in the model (see Section 3.2.3). We describe our context combiner methods below.

3.2.1 Averaged Contextual Carryover

Recall that is the stack of dialogue act contextual embeddings and is previous utterance transcript embeddings at turn . In this method, we first compute the average embeddings [38] of all dialogue act contextual embeddings and average encodings of all previous utterance transcript embeddings . Then, we combine the averaged contextual embeddings with the input by concatenating them with the acoustic embeddings, , for each acoustic time step, as follows:


3.2.2 Attentive Contextual Carryover

Averaging contextual embeddings of the previous turns can hamper the ability of the model to access fine-grained contextual information for a specific turn and time step. Therefore, we utilize the multi-head attention mechanism [35], which uses acoustic embeddings, of each time step, to attend to relevant dialogue contexts and create the final contextual embeddings.

Specifically, we create the queries, keys, and values, via linear projections as follows:


Here, , , are acoustic, dialogue act, and previous utterance embeddings for the -th turn, respectively. Matrices are learnt linear projections. A scaled dot-product attention is then used to calculate the final dialogue and utterance context vectors through the weighted sum of projected contextual embeddings of the previous turns. This process is formulated as:


where is the hidden size of the attention layer applied for numerical stability [35]. The attention outputs and are then concatenated with the acoustic embeddings provided as input.

3.2.3 Gated Attentive Contextual Carryover

One limitation of the attention mechanism is that it cannot downscale the contribution of a context when needed [39]. Take a two-turn dialogue as an example:

A user asks a voice assistant to call uncle sam in the first turn, and the system confirms back to see if the user wants to call Uncle Sam’s Sandwich Bar (associated dialogue act is REQUEST(restaurant)). Then, in the second turn, the user corrects that she wants to “call my uncle sam”.

In this case, simply applying multi-head attention as described in Eq.(3)-(4) on the previous turn utterance call uncle sam, , and dialogue act REQUEST(restaurant), , can lead to a wrong interpretation for the second turn. This is because the results of the Softmax function in Eq.(4) assigns dialogue act context to positive scores, misleadingly associating uncle sam with a restaurant name rather than a person name.

Inspired by the gating mechanism to control information flow or integrate different types of information [16, 5, 11, 15, 19, 13], we introduce a learnable gating mechanism on top of the multi-head attentive contextual carryover to further reduce a context’s influence when it does not help the interpretation. Specifically, we concatenate all the contextual embeddings in and to obtain . Then, we obtain the gating scores by computing the similarity between the linearly projected and , as follows:


where and are learnable parameters. and is the number of frames. Each entry shows how much contexts contribute to the acoustic embedding at -th frame, ). We replicate to make it have the same dimension as and . The gated attention scores are then computed by the element-wise product between scores and :


We compute the gated attentive contextual embeddings across each attention head, as follows:


Finally, and are row-wise concatenated with the acoustic embeddings as input.

3.3 Context Ingestion Scenarios

We consider the integration of the context encoder using three schemes: ingestion by the speech encoder network, ingestion with the hidden ASR-NLU interface, and finally at both insertion points.

Speech encoder ingestion: In this method, we incorporate the outputted context embeddings only into the acoustic embeddings for ASR pre-training/training task. This approach is motivated by prior research showing that context benefits the speech encoder more than the prediction network of ASR transducer models  [31] . To combine context with acoustic embeddings, we input the acoustic embeddings = as the query, and the context encodings , serve as the keys and values in the context combiner. The output with ingested context (Equation  (2)) are then used to perform the ASR task.

ASR-NLU interface ingestion: In this approach, we ingest the output context embeddings only into the ASR-NLU interface embeddings for the SLU training task. As such, we now use the ASR-NLU interface embeddings = as queries for context combiner instead of the acoustics.

Shared context ingestion: In this method, we integrate context into both acoustic embeddings and ASR-NLU interface embeddings. We maintain a shared context encoder between the ASR and NLU submodule, resulting in a shared , between them. For fusion, we maintain two separated context combiners to increase the context ingestion flexibility. Specifically, we establish a gated multi-head attentive context combiner for the ASR submodule with as queries, while having another gated multi-head attentive context combiner for the NLU submodule with as queries.

In the following sections, we perform experiments on incorporating multi-turn context into two SLU architectures: a Transformer-based Joint SLU model and an RNN-T based Joint SLU model.

4 Experimental Settings

4.1 Datasets

The internal industrial voice assistant (IVA) dataset is a far-field dataset with more than 10k hours of audio data and their corresponding intent and slot annotations. It is a multi-domain dataset with both single-turn and multi-turn utterances. In total, there are 55 intents, 183 slot types, and 49 dialogue acts. In addition, we built a synthetic and publicly available multi-turn E2E SLU (Syn-Multi) dataset based on [33]. [33] contains two datasets with a text-only format from Restaurant (11,234 turns in 1,116 training dialogues) and Movie (3,562 turns in 384 training dialogues) domains. To obtain audio signals, we used a Transformer text-to-speech model 111 to synthesize the audio and combine the two datasets into one dataset for model training and evaluation. Finally, we used SpecAugment [23] to augment audio feature inputs. In total, Syn-Multi has 3 intents, 12 slot types, and 21 user dialogue act types222

4.2 Implementation setup

Audio features:

The input audio features are 64-dimensional LFBE features extracted every 10 ms with a window size of 25 ms from audio samples. The features of each audio frame are stacked with the features of two previous audio frames, followed by a downsampling factor of 3 to achieve a low frame rate, resulting in 192 feature dimensions per audio frame. We use a token set

with 4,000 wordpieces trained by the sentence-piece tokenization model [20].

Model setup: Table 1 shows our model setup details. We built contextual E2E SLU models based on the Recurrent Neural Network Transducer (RNN-T) [10] and the Transformer Transducer (T-T) [40], respectively. E2E SLU models share an audio encoder network that encodes LFBE features, a prediction network that encodes a sequence of predicted wordpieces, a joint network that combines the encoder and the prediction network, and an NLU tagger that predicts intents and slots. The intent tagger contains two feedforward layers before projecting into the number of intents, and the slot tagger directly takes the output embeddings from the NLU tagger and projects them into the slot size. The audio encoder in the E2E T-T SLU and E2E RNN-T SLU are Transformer layers (with 4 attention heads) and LSTM layers, respectively. The NLU tagger in E2E T-T SLU and E2E RNN-T SLU are transformer layers (with 8 attention heads) and BiLSTM layers, respectively. For (the maximum number of dialog action-slot pairs) and (the maximum number of previous utterance transcripts), we set = = 5 in the IVA dataset. We set = = 20 in the Syn-Multi dataset.

Training setup: We adopt a stage-wise joint training strategy for the proposed contextual models and baseline non-contextual models. We first pre-trained an ASR model to minimize the RNN-T loss [7]. We then freeze the ASR module to train the NLU module to minimize the cross entropy losses for the intent and slot predictions. During training, all constituent subwords of a word are tagged with its slot. During inference, the constituent subwords are combined to form the word, and the slot tag for the last constituent subwords is taken as the slot tag for the word. Last, we jointly tuned ASR and NLU modules to minimize all three losses. We used the teacher forcing technique [37] that uses the human-annotated transcripts of previous turns for training, and the automatic transcripts of previous turns from our model for inference. We applied the Adam optimizer [18] for all model training. For E2E RNN-T SLU, the learning rate is warmed linearly from 0 to during the first 3K steps, held constant until 150K steps, and then decays exponentially to until 620K steps. For E2E T-T SLU, the learning rate is warmed from 0 to in the first 16K steps, then is decayed to

in the following 604K steps exponentially. We used 24 NVIDIA® V100 Tensor Core GPUs and a batch size of 32 for training

the model.

IVA Dataset Syn-Multi Dataset
Audio encoder network
    # Layers 5 6 4 2
    Layer embed-size 736 256 640 256
    # Attention heads 4 4
    #FeedForward layer 1 1 1 1
    FeedForward embed-size 512 2048 256 512
Prediction network
    # Layers 2 2 2 2
    Layer embed-size 736 736 640 640
    #FeedForward layer 1 1 1 1
    FeedForward embed-size 512 512 256 512
Joint network
    Vocab embed-size 512 512 512 512
    #FeedForward layer 1 1 1 1
    FeedForward embed-size 512 512 512 512
    Activation tanh tanh tanh tanh
NLU decoder network
    # Layers 2 2 2 2
    Layer embed-size 256 256 256 256
    #FeedForward layer 1 1 1 1
    Feedforward size 256 256 256 256
    Intent Predictor Network
           #FeedForward layer 2 2 2 2
           Feedforward size 512 512 512 512
           Activation relu relu relu relu
           #FeedForward layer 1 1 1 1
           Feedforward size #intent #intent #intent #intent
    Slot Tagger Network
           #FeedForward layer 1 1 1 1
           Feedforward size #slots #slots #slots #slots
           # Attention heads 8 8
Table 1: Model setup for E2E SLU.

4.3 Evaluation Metrics and Baselines

We evaluate the model performance on word error rate (WER), intent classification error rate (ICER), and semantic error rate (SemER). WER measures the proportion of words that are misrecognized (deleted, inserted, or substituted) in the hypothesis relative to the reference. ICER measures the proportion of utterances with a misclassified intent. SemER combines intent and slot accuracy into a single metric, i.e., SemER = # (slot errors + intent errors) / # (slots + intents in reference). We only show relative error rate reduction results on the IVA dataset. Take WER for example, given a method A’s WER () and a baseline B’s WER (), the relative word error rate reduction (WERR) of A over B can be computed by ; the higher the value, the greater the improvement. We denote relative errors for WER, ICER and SemER as WERR, ICERR and SemERR.

5 Results

Improving E2E SLU with contexts: Table 2 shows overall model performance and the total number of parameters (in millions) of the baseline and our proposed models on the IVA dataset. We observe that contexts play a crucial role in improving E2E SLU across speech recognition and semantic interpretations. Particularly, our contextual E2E RNN-T SLU model relatively reduces 7.75% of WER, 10.96% of ICER, and 14.56% of SemER. Our contextual E2E T-T SLU model relatively reduces 13.83% of WER, 11.06% of ICER, and 10.60% of SemER. Interestingly, encoding contexts with gated attentive contextual carryover performed better than the traditional multi-head attention [35]. It gave the best performance with a relative improvement for SemER of 14.56% and 10.6% respectively across RNN-T and T-T based models.

For all subsequent discussion, we focus on SemER, as it summarizes the performance across all tasks.

Relative Error Reduction
Model Config. (# params) WERR ICERR SemERR
No Context (35.12M) Baseline Baseline Baseline
w/ DA (35.31M) 5.86% 8.94% 8.06%
E2E PrevUtt (37.38M) 7.44% 6.62% 12.23%
RNN-T DA+ PrevUtt + AvC (37.57M) 7.38% 8.74% 12.66%
SLU DA + PrevUtt + AttC (37.72M) 7.88% 6.92% 13.14%
DA + PrevUtt + GAttC (37.94M) 7.75% 10.96% 14.56%
No Context (28.58M) Baseline Baseline Baseline
w/ DA (28.61M) 5.39% 4.59% 1.48%
E2E PrevUtt (28.78M) 12.37% 8.87% 6.32%
T-T DA+ PrevUtt + AvC (28.80M) 11.50% 8.46% 7.85%
SLU DA + PrevUtt + AttC (30.67M) 12.63% 9.50% 9.27%
DA + PrevUtt + GAttC (30.89M) 13.83% 11.06% 10.60%
Table 2: Overall results on the IVA dataset. NoContext: E2E without contexts. DA and PrevUtt: dialogue act and previous utterance context. AvC: average contextual carryover. AttC: attentive contextual carryover. GAttC: AttC with gating layers.

Table 3 summarizes the results for utterances with two turns, three turns, and at least four turns. We observe that encoding contexts can lead to an average relative improvement of 40.73% and 37.09% across RNN-T and T-T E2E SLU.

Model Config. 2-turn 3-turn 4-turn +
E2E RNN-T SLU No Context Baseline Baseline Baseline
DA + PrevUtt + GAttC 30.35% 37.80% 54.04%
E2E T-T SLU NoContext Baseline Baseline Baseline
DA + PrevUtt + GAttC 37.07% 37.91% 36.30%
Table 3: Results on the IVA multi-turn utterances.

The effect of context ingestion: Table 4 and Table 5 show the effects of context ingestion on the E2E SLU performance. We observe that the context encoder improves E2E SLU for all scenarios, giving an average relative improvement of 13.69% and 12.07%, respectively, across RNN-T and T-T E2E SLU. Compared to the speech encoder and hidden interface ingestion, the shared context ingestion gave the biggest improvement on T-T E2E SLU with a relative improvement of 19.4%.

Relative Error Reduction
Model Config WERR ICERR SemERR
No Context Baseline Baseline Baseline
E2E Speech Encoder 7.75% 10.96% 14.56%
RNN-T ASR-NLU Interface 8.83% 8.23% 13.56%
SLU Shared Context 9.14% 7.13% 12.95%
No Context Baseline Baseline Baseline
E2E Speech Encoder 13.83% 11.06% 10.6%
T-T ASR-NLU Interface 1.26% 6.89% 6.21%
SLU Shared Context 15.16% 20.81% 19.4%
Table 4: The effect of context ingestion: IVA datasets.
Absolute Error Rate
Model Config WER ICER SemER
No Context 16.02% 32.49% 40.76%
E2E Speech Encoder 19.76% 3.62% 29.24%
RNN-T ASR-NLU Interface 10.59% 0.36% 18.99%
SLU Shared Context 12.14% 0.25% 18.55%
No Context 13.06% 30.4% 36.83%
E2E Speech Encoder 14.1% 2.38% 26.56%
T-T ASR-NLU Interface 12.81% 0.25% 18.49%
SLU Shared Context 13.68% 0.21% 18.62%
Table 5: The effect of context ingestion: Syn-Multi datasets.

We also qualitatively examined the effect of contexts. Contextual models recognized cancel correctly with the Select(Time) dialogue act context, whereas non-context model recognized the word as cascal. Further, contextual models can better handle ambiguous utterances. For example, contextual models correctly predict utterance next Monday for inferno as BuyMovieTickets intent as its previous utterance is i want to buy movie tickets, whereas non-context models confuse this utterance with ReserveRestaurant intent.

6 Conclusion

We propose a novel E2E SLU approach where a multi-head gated attention mechanism is introduced to effectively incorporate the dialogue history from the spoken audio. Our proposed approach significantly improves E2E SLU accuracy on the internal industrial voice assistant and publicly available datasets compared to the non-contextual E2E SLU models. In the future, we will apply our proposed approach on other datasets and further improve our contextual model architecture.


  • [1] W. A. Abro, G. Qi, H. Gao, M. A. Khan, and Z. Ali (2019) Multi-turn intent determination for goal-oriented dialogue systems. In 2019 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. Cited by: §1.
  • [2] S. Bhosale, I. Sheikh, S. H. Dumpala, and S. K. Kopparapu (2019) End-to-end spoken language understanding: bootstrapping in low resource scenarios.. In Interspeech, pp. 1188–1192. Cited by: §1.
  • [3] Q. Chen, Z. Zhuo, W. Wang, and Q. Xu (2019) Transfer learning for context-aware spoken language understanding. In 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 779–786. Cited by: §1.
  • [4] Y. Chen, D. Hakkani-Tür, G. Tür, J. Gao, and L. Deng (2016) End-to-end memory networks with knowledge carryover for multi-turn spoken language understanding.. In Interspeech, pp. 3245–3249. Cited by: §1.
  • [5] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555. Cited by: §3.2.3.
  • [6] J. Devlin, M. Chang, K. Lee, and K. N. Toutanova (2018) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. Cited by: §3.1.2.
  • [7] A. Graves (2012) Sequence transduction with recurrent neural networks. CoRR, vol. abs/1211.3711, 2. Cited by: §3, §4.2.
  • [8] A. Gupta, P. Zhang, G. Lalwani, and M. T. Diab (2019) CASA-nlu: context-aware self-attentive natural language understanding for task-oriented chatbots. In

    Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

    pp. 1285–1290. Cited by: §1.
  • [9] R. Gupta, A. Rastogi, and D. Hakkani-Tür (2018) An efficient approach to encoding context for spoken language understanding. In Interspeech 2018, pp. 3469–3473. Cited by: §1, §1, §3.1.1.
  • [10] Y. He, T. N. Sainath, R. Prabhavalkar, I. McGraw, R. Alvarez, D. Zhao, D. Rybach, A. Kannan, Y. Wu, R. Pang, et al. (2019) Streaming end-to-end speech recognition for mobile devices. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6381–6385. Cited by: §4.2.
  • [11] S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §3.2.3.
  • [12] S. Kim, S. Dalmia, and F. Metze (2019) Cross-attention end-to-end asr for two-party conversations.. In Interspeech 2019, pp. 4380–4384. Cited by: §1.
  • [13] S. Kim, S. Dalmia, and F. Metze (2019) Gated embeddings in end-to-end speech recognition for conversational-context fusion. arXiv preprint arXiv:1906.11604. Cited by: §1, §3.2.3.
  • [14] S. Kim and F. Metze (2018) Dialog-context aware end-to-end speech recognition. In 2018 IEEE Spoken Language Technology Workshop (SLT), pp. 434–440. Cited by: §1.
  • [15] S. Kim and M. L. Seltzer (2018) Towards language-universal end-to-end speech recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4914–4918. Cited by: §3.2.3.
  • [16] S. Kim (2019) End-to-end speech recognition on conversations. Ph.D. Thesis, Carnegie Mellon University. Cited by: §3.2.3.
  • [17] S. Kim (2020) End-to-end speech recognition on conversations. . Cited by: §1.
  • [18] D. P. Kingma and J. L. Ba (2015) Adam: a method for stochastic optimization. In ICLR 2015 : International Conference on Learning Representations 2015, Cited by: §4.2.
  • [19] J. Kiros, W. Chan, and G. Hinton (2018) Illustrative language understanding: large-scale visual grounding with image search. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 922–933. Cited by: §3.2.3.
  • [20] T. Kudo (2018) Subword regularization: improving neural network translation models with multiple subword candidates. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vol. 1, pp. 66–75. Cited by: §4.2.
  • [21] L. Lugosch, B. H. Meyer, D. Nowrouzezahrai, and M. Ravanelli (2020) Using speech synthesis to train end-to-end spoken language understanding models. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8499–8503. Cited by: §1.
  • [22] L. Lugosch, M. Ravanelli, P. Ignoto, V. S. Tomar, and Y. Bengio (2019) Speech model pre-training for end-to-end spoken language understanding. In Interspeech 2019, pp. 814–818. Cited by: §1.
  • [23] D. S. Park, W. Chan, Y. Zhang, C. Chiu, B. Zoph, E. D. Cubuk, and Q. V. Le (2019) SpecAugment: a simple data augmentation method for automatic speech recognition.. In Interspeech 2019, pp. 2613–2617. Cited by: §4.1.
  • [24] G. Pundak, T. N. Sainath, R. Prabhavalkar, A. Kannan, and D. Zhao (2018) Deep context: end-to-end contextual speech recognition. In 2018 IEEE spoken language technology workshop (SLT), pp. 418–425. Cited by: §1.
  • [25] L. Qin, W. Che, M. Ni, Y. Li, and T. Liu (2021) Knowing where to leverage: context-aware graph convolutional network with an adaptive fusion layer for contextual spoken language understanding. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, pp. 1280–1289. Cited by: §1.
  • [26] M. Radfar, A. Mouchtaris, and S. Kunzmann (2020) End-to-end neural transformer based spoken language understanding. In Interspeech 2020, pp. 866–870. Cited by: §1.
  • [27] A. Raju, B. Hedayatnia, L. Liu, A. Gandhe, C. Khatri, A. Metallinou, A. Venkatesh, and A. Rastrow (2018)

    Contextual language model adaptation for conversational agents

    In Interspeech 2018, pp. 3333–3337. Cited by: §1.
  • [28] A. Raju, G. Tiwari, M. Rao, P. Dheram, B. Anderson, Z. Zhang, B. Bui, and A. Rastrow (2021) End-to-end spoken language understanding using rnn-transducer asr. arXiv preprint arXiv:2106.15919. Cited by: §3.
  • [29] M. Rao, P. Dheram, G. Tiwari, A. Raju, J. Droppo, A. Rastrow, and A. Stolcke (2021) DO as i mean, not as i say: sequence loss training for spoken language understanding. In ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Cited by: §3.
  • [30] M. Rao, A. Raju, P. Dheram, B. Bui, and A. Rastrow (2020) Speech to semantics: improve asr and nlu jointly via all-neural interfaces.. In Interspeech 2020, pp. 876–880. Cited by: §1, §3.
  • [31] S. N. Ray, M. Wu, A. Raju, P. Ghahremani, R. Bilgi, M. Rao, H. Arsikere, A. Rastrow, A. Stolcke, and J. Droppo (2021) Listen with intent: improving speech recognition with audio-to-intent front-end.. Interspeech. Cited by: §1, §3.3.
  • [32] S. Rongali, B. Liu, L. Cai, K. Arkoudas, C. Su, and W. Hamza (2021) Exploring transfer learning for end-to-end spoken language understanding. AAAI. Cited by: §1.
  • [33] P. Shah, D. Hakkani-Tur, B. Liu, and G. Tur (2018)

    Bootstrapping a neural conversational agent with dialogue self-play, crowdsourcing and on-line reinforcement learning

    In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 3 (Industry Papers), Vol. 3, pp. 41–51. Cited by: §4.1.
  • [34] S. Su, P. Yuan, and Y. Chen (2019) Dynamically context-sensitive time-decay attention for dialogue modeling. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7200–7204. Cited by: §1.
  • [35] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Vol. 30, pp. 5998–6008. Cited by: §3.2.2, §3.2.2, §5.
  • [36] Y. Wang, T. He, R. Fan, W. Zhou, and X. Tu (2019) Effective utilization of external knowledge and history context in multi-turn spoken language understanding model. In 2019 IEEE International Conference on Big Data (Big Data), pp. 960–967. Cited by: §1.
  • [37] R. J. Williams and D. Zipser (1989) A learning algorithm for continually running fully recurrent neural networks. Neural Computation 1 (2), pp. 270–280. Cited by: §4.2.
  • [38] Z. Wu, B. Li, Y. Zhang, P. S. Aleksic, and T. N. Sainath (2020) Multistate encoding with end-to-end speech rnn transducer network. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7819–7823. Cited by: §1, §1, §3.2.1.
  • [39] L. Xue, X. Li, and N. L. Zhang (2020) Not all attention is needed: gated attention network for sequence data. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    Vol. 34, pp. 6550–6557. Cited by: §3.2.3.
  • [40] Q. Zhang, H. Lu, H. Sak, A. Tripathi, E. McDermott, S. Koo, and S. Kumar (2020) Transformer transducer: a streamable speech recognition model with transformer encoders and rnn-t loss. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7829–7833. Cited by: §4.2.