Rethinking Speech Recognition with A Multimodal Perspective via Acoustic and Semantic Cooperative Decoding

05/23/2023
by   Tian-Hao Zhang, et al.
0

Attention-based encoder-decoder (AED) models have shown impressive performance in ASR. However, most existing AED methods neglect to simultaneously leverage both acoustic and semantic features in decoder, which is crucial for generating more accurate and informative semantic states. In this paper, we propose an Acoustic and Semantic Cooperative Decoder (ASCD) for ASR. In particular, unlike vanilla decoders that process acoustic and semantic features in two separate stages, ASCD integrates them cooperatively. To prevent information leakage during training, we design a Causal Multimodal Mask. Moreover, a variant Semi-ASCD is proposed to balance accuracy and computational cost. Our proposal is evaluated on the publicly available AISHELL-1 and aidatatang_200zh datasets using Transformer, Conformer, and Branchformer as encoders, respectively. The experimental results show that ASCD significantly improves the performance by leveraging both the acoustic and semantic information cooperatively.

READ FULL TEXT
research
06/18/2020

Self-and-Mixed Attention Decoder with Deep Acoustic Structure for Transformer-based LVCSR

The Transformer has shown impressive performance in automatic speech rec...
research
12/06/2019

Semantic Mask for Transformer based End-to-End Speech Recognition

Attention-based encoder-decoder model has achieved impressive results fo...
research
07/28/2018

Back-Translation-Style Data Augmentation for End-to-End ASR

In this paper we propose a novel data augmentation method for attention-...
research
05/18/2023

A Lexical-aware Non-autoregressive Transformer-based ASR Model

Non-autoregressive automatic speech recognition (ASR) has become a mains...
research
06/29/2021

Alzheimer's Dementia Recognition Using Acoustic, Lexical, Disfluency and Speech Pause Features Robust to Noisy Inputs

We present two multimodal fusion-based deep learning models that consume...
research
06/24/2019

Multimodal and Multi-view Models for Emotion Recognition

Studies on emotion recognition (ER) show that combining lexical and acou...
research
07/26/2021

Joint Visual Semantic Reasoning: Multi-Stage Decoder for Text Recognition

Although text recognition has significantly evolved over the years, stat...

Please sign up or login with your details

Forgot password? Click here to reset