DeepAI AI Chat
Log In Sign Up

Improving RNN-T ASR Performance with Date-Time and Location Awareness

by   Swayambhu Nath Ray, et al.

In this paper, we explore the benefits of incorporating context into a Recurrent Neural Network (RNN-T) based Automatic Speech Recognition (ASR) model to improve the speech recognition for virtual assistants. Specifically, we use meta information extracted from the time at which the utterance is spoken and the approximate location information to make ASR context aware. We show that these contextual information, when used individually, improves overall performance by as much as 3.48 are combined, the model learns complementary features and the recognition improves by 4.62 improvements as high as 11.5 We ran experiments with models trained on data of sizes 30K hours and 10K hours. We show that the scale of improvement with the 10K hours dataset is much higher than the one obtained with 30K hours dataset. Our results indicate that with limited data to train the ASR model, contextual signals can improve the performance significantly.


page 1

page 2

page 3

page 4


Contextual-Utterance Training for Automatic Speech Recognition

Recent studies of streaming automatic speech recognition (ASR) recurrent...

Context-Aware Transformer Transducer for Speech Recognition

End-to-end (E2E) automatic speech recognition (ASR) systems often have d...

A Light-weight contextual spelling correction model for customizing transducer-based speech recognition systems

It's challenging to customize transducer-based automatic speech recognit...

Multimodal Punctuation Prediction with Contextual Dropout

Automatic speech recognition (ASR) is widely used in consumer electronic...

Improving Gated Recurrent Unit Based Acoustic Modeling with Batch Normalization and Enlarged Context

The use of future contextual information is typically shown to be helpfu...

Listen with Intent: Improving Speech Recognition with Audio-to-Intent Front-End

Comprehending the overall intent of an utterance helps a listener recogn...

The Use of Voice Source Features for Sung Speech Recognition

In this paper, we ask whether vocal source features (pitch, shimmer, jit...