Generalizing RNN-Transducer to Out-Domain Audio via Sparse Self-Attention Layers

08/22/2021
by   Juntae Kim, et al.
0

Recurrent neural network transducers (RNN-T) are a promising end-to-end speech recognition framework that transduces input acoustic frames into a character sequence. The state-of-the-art encoder network for RNN-T is the Conformer, which can effectively model the local-global context information via its convolution and self-attention layers. Although Conformer RNN-T has shown outstanding performance (measured by word error rate (WER) in general), most studies have been verified in the setting where the train and test data are drawn from the same domain. The domain mismatch problem for Conformer RNN-T has not been intensively investigated yet, which is an important issue for the product-level speech recognition system. In this study, we identified that fully connected self-attention layers in the Conformer caused high deletion errors, specifically in the long-form out-domain utterances. To address this problem, we introduce sparse self-attention layers for Conformer-based encoder networks, which can exploit local and generalized global information by pruning most of the in-domain fitted global connections. Further, we propose a state reset method for the generalization of the prediction network to cope with long-form utterances. Applying proposed methods to an out-domain test, we obtained 24.6% and 6.5% relative character error rate (CER) reduction compared to the fully connected and local self-attention layer-based Conformers, respectively.

READ FULL TEXT

page 1

page 2

research
09/28/2019

Self-Attention Transducers for End-to-End Speech Recognition

Recurrent neural network transducers (RNN-T) have been successfully appl...
research
03/29/2021

Transformer-based end-to-end speech recognition with residual Gaussian-based self-attention

Self-attention (SA), which encodes vector sequences according to their p...
research
11/01/2019

Improving Generalization of Transformer for Speech Recognition with Parallel Schedule Sampling and Relative Positional Embedding

Transformer showed promising results in many sequence to sequence transf...
research
02/18/2018

Improved TDNNs using Deep Kernels and Frequency Dependent Grid-RNNs

Time delay neural networks (TDNNs) are an effective acoustic model for l...
research
09/30/2022

E-Branchformer: Branchformer with Enhanced merging for speech recognition

Conformer, combining convolution and self-attention sequentially to capt...
research
09/01/2022

Deep Sparse Conformer for Speech Recognition

Conformer has achieved impressive results in Automatic Speech Recognitio...
research
05/12/2021

Global Structure-Aware Drum Transcription Based on Self-Attention Mechanisms

This paper describes an automatic drum transcription (ADT) method that d...

Please sign up or login with your details

Forgot password? Click here to reset