Multimodal Audio-textual Architecture for Robust Spoken Language Understanding

06/12/2023
by   Anderson R. Avila, et al.
0

Recent voice assistants are usually based on the cascade spoken language understanding (SLU) solution, which consists of an automatic speech recognition (ASR) engine and a natural language understanding (NLU) system. Because such approach relies on the ASR output, it often suffers from the so-called ASR error propagation. In this work, we investigate impacts of this ASR error propagation on state-of-the-art NLU systems based on pre-trained language models (PLM), such as BERT and RoBERTa. Moreover, a multimodal language understanding (MLU) module is proposed to mitigate SLU performance degradation caused by errors present in the ASR transcript. The MLU benefits from self-supervised features learned from both audio and text modalities, specifically Wav2Vec for speech and Bert/RoBERTa for language. Our MLU combines an encoder network to embed the audio signal and a text encoder to process text transcripts followed by a late fusion layer to fuse audio and text logits. We found that the proposed MLU showed to be robust towards poor quality ASR transcripts, while the performance of BERT and RoBERTa are severely compromised. Our model is evaluated on five tasks from three SLU datasets and robustness is tested using ASR transcripts from three ASR engines. Results show that the proposed approach effectively mitigates the ASR error propagation problem, surpassing the PLM models' performance across all datasets for the academic ASR engine.

READ FULL TEXT
research
04/04/2022

Deliberation Model for On-Device Spoken Language Understanding

We propose a novel deliberation-based approach to end-to-end (E2E) spoke...
research
11/29/2021

Do We Still Need Automatic Speech Recognition for Spoken Language Understanding?

Spoken language understanding (SLU) tasks are usually solved by first tr...
research
11/03/2020

Warped Language Models for Noise Robust Language Understanding

Masked Language Models (MLM) are self-supervised neural networks trained...
research
02/12/2021

Do as I mean, not as I say: Sequence Loss Training for Spoken Language Understanding

Spoken language understanding (SLU) systems extract transcriptions, as w...
research
08/30/2021

ASR-GLUE: A New Multi-task Benchmark for ASR-Robust Natural Language Understanding

Language understanding in speech-based systems have attracted much atten...
research
09/07/2020

Robust Spoken Language Understanding with RL-based Value Error Recovery

Spoken Language Understanding (SLU) aims to extract structured semantic ...
research
11/08/2022

Robust Unstructured Knowledge Access in Conversational Dialogue with ASR Errors

Performance of spoken language understanding (SLU) can be degraded with ...

Please sign up or login with your details

Forgot password? Click here to reset