Integration of Pre-trained Networks with Continuous Token Interface for End-to-End Spoken Language Understanding

04/15/2021
by   Seunghyun Seo, et al.
0

Most End-to-End (E2E) SLU networks leverage the pre-trained ASR networks but still lack the capability to understand the semantics of utterances, crucial for the SLU task. To solve this, recently proposed studies use pre-trained NLU networks. However, it is not trivial to fully utilize both pre-trained networks; many solutions were proposed, such as Knowledge Distillation, cross-modal shared embedding, and network integration with Interface. We propose a simple and robust integration method for the E2E SLU network with novel Interface, Continuous Token Interface (CTI), the junctional representation of the ASR and NLU networks when both networks are pre-trained with the same vocabulary. Because the only difference is the noise level, we directly feed the ASR network's output to the NLU network. Thus, we can train our SLU network in an E2E manner without additional modules, such as Gumbel-Softmax. We evaluate our model using SLURP, a challenging SLU dataset and achieve state-of-the-art scores on both intent classification and slot filling tasks. We also verify the NLU network, pre-trained with Masked Language Model, can utilize a noisy textual representation of CTI. Moreover, we show our model can be trained with multi-task learning from heterogeneous data even after integration with CTI.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/25/2019

SpeechBERT: Cross-Modal Pre-trained Language Model for End-to-end Spoken Question Answering

While end-to-end models for spoken language understanding tasks have bee...
research
12/12/2018

Recurrent Neural Networks with Pre-trained Language Model Embedding for Slot Filling Task

In recent years, Recurrent Neural Networks (RNNs) based models have been...
research
11/16/2020

End-to-end spoken language understanding using transformer networks and self-supervised pre-trained features

Transformer networks and self-supervised pre-training have consistently ...
research
10/21/2022

Audio-to-Intent Using Acoustic-Textual Subword Representations from End-to-End ASR

Accurate prediction of the user intent to interact with a voice assistan...
research
11/02/2022

BECTRA: Transducer-based End-to-End ASR with BERT-Enhanced Encoder

We present BERT-CTC-Transducer (BECTRA), a novel end-to-end automatic sp...
research
10/27/2022

Token-level Sequence Labeling for Spoken Language Understanding using Compositional End-to-End Models

End-to-end spoken language understanding (SLU) systems are gaining popul...
research
09/17/2020

Cross-Modal Alignment with Mixture Experts Neural Network for Intral-City Retail Recommendation

In this paper, we introduce Cross-modal Alignment with mixture experts N...

Please sign up or login with your details

Forgot password? Click here to reset