Integration of Pre-trained Networks with Continuous Token Interface for End-to-End Spoken Language Understanding

by   Seunghyun Seo, et al.

Most End-to-End (E2E) SLU networks leverage the pre-trained ASR networks but still lack the capability to understand the semantics of utterances, crucial for the SLU task. To solve this, recently proposed studies use pre-trained NLU networks. However, it is not trivial to fully utilize both pre-trained networks; many solutions were proposed, such as Knowledge Distillation, cross-modal shared embedding, and network integration with Interface. We propose a simple and robust integration method for the E2E SLU network with novel Interface, Continuous Token Interface (CTI), the junctional representation of the ASR and NLU networks when both networks are pre-trained with the same vocabulary. Because the only difference is the noise level, we directly feed the ASR network's output to the NLU network. Thus, we can train our SLU network in an E2E manner without additional modules, such as Gumbel-Softmax. We evaluate our model using SLURP, a challenging SLU dataset and achieve state-of-the-art scores on both intent classification and slot filling tasks. We also verify the NLU network, pre-trained with Masked Language Model, can utilize a noisy textual representation of CTI. Moreover, we show our model can be trained with multi-task learning from heterogeneous data even after integration with CTI.



page 1

page 2

page 3

page 4


SpeechBERT: Cross-Modal Pre-trained Language Model for End-to-end Spoken Question Answering

While end-to-end models for spoken language understanding tasks have bee...

Speech-language Pre-training for End-to-end Spoken Language Understanding

End-to-end (E2E) spoken language understanding (SLU) can infer semantics...

Recurrent Neural Networks with Pre-trained Language Model Embedding for Slot Filling Task

In recent years, Recurrent Neural Networks (RNNs) based models have been...

End-to-end spoken language understanding using transformer networks and self-supervised pre-trained features

Transformer networks and self-supervised pre-training have consistently ...

Enriched Pre-trained Transformers for Joint Slot Filling and Intent Detection

Detecting the user's intent and finding the corresponding slots among th...

Fine-tuning of Pre-trained End-to-end Speech Recognition with Generative Adversarial Networks

Adversarial training of end-to-end (E2E) ASR systems using generative ad...

Cross-Modal Alignment with Mixture Experts Neural Network for Intral-City Retail Recommendation

In this paper, we introduce Cross-modal Alignment with mixture experts N...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.