Listen with Intent: Improving Speech Recognition with Audio-to-Intent Front-End

05/14/2021
by   Swayambhu Nath Ray, et al.
16

Comprehending the overall intent of an utterance helps a listener recognize the individual words spoken. Inspired by this fact, we perform a novel study of the impact of explicitly incorporating intent representations as additional information to improve a recurrent neural network-transducer (RNN-T) based automatic speech recognition (ASR) system. An audio-to-intent (A2I) model encodes the intent of the utterance in the form of embeddings or posteriors, and these are used as auxiliary inputs for RNN-T training and inference. Experimenting with a 50k-hour far-field English speech corpus, this study shows that when running the system in non-streaming mode, where intent representation is extracted from the entire utterance and then used to bias streaming RNN-T search from the start, it provides a 5.56 (WERR). On the other hand, a streaming system using per-frame intent posteriors as extra inputs for the RNN-T ASR system yields a 3.33 further detailed analysis of the streaming system indicates that our proposed method brings especially good gain on media-playing related intents (e.g. 9.12 relative WERR on PlayMusicIntent).

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/23/2019

Incremental Online Spoken Language Understanding

Spoken Language Understanding (SLU) typically comprises of an automatic ...
research
10/27/2022

Contextual-Utterance Training for Automatic Speech Recognition

Recent studies of streaming automatic speech recognition (ASR) recurrent...
research
09/17/2020

Utterance-level Intent Recognition from Keywords

This paper focuses on wake on intent (WOI) techniques for platforms with...
research
10/20/2020

Knowledge Transfer for Efficient On-device False Trigger Mitigation

In this paper, we address the task of determining whether a given uttera...
research
07/11/2023

Improving RNN-Transducers with Acoustic LookAhead

RNN-Transducers (RNN-Ts) have gained widespread acceptance as an end-to-...
research
05/07/2020

RNN-T Models Fail to Generalize to Out-of-Domain Audio: Causes and Solutions

In recent years, all-neural end-to-end approaches have obtained state-of...
research
08/06/2019

Two-stage Training for Chinese Dialect Recognition

In this paper, we present a two-stage language identification (LID) syst...

Please sign up or login with your details

Forgot password? Click here to reset