Conv-Transformer Transducer: Low Latency, Low Frame Rate, Streamable End-to-End Speech Recognition

08/13/2020
by   Wenyong Huang, et al.
0

Transformer has achieved competitive performance against state-of-the-art end-to-end models in automatic speech recognition (ASR), and requires significantly less training time than RNN-based models. The original Transformer, with encoder-decoder architecture, is only suitable for offline ASR. It relies on an attention mechanism to learn alignments, and encodes input audio bidirectionally. The high computation cost of Transformer decoding also limits its use in production streaming systems. To make Transformer suitable for streaming ASR, we explore Transducer framework as a streamable way to learn alignments. For audio encoding, we apply unidirectional Transformer with interleaved convolution layers. The interleaved convolution layers are used for modeling future context which is important to performance. To reduce computation cost, we gradually downsample acoustic input, also with the interleaved convolution layers. Moreover, we limit the length of history context in self-attention to maintain constant computation cost for each decoding step. We show that this architecture, named Conv-Transformer Transducer, achieves competitive performance on LibriSpeech dataset (3.6% WER on test-clean) without external language models. The performance is comparable to previously published streamable Transformer Transducer and strong hybrid streaming ASR systems, and is achieved with smaller look-ahead window (140 ms), fewer parameters and lower frame rate.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/08/2020

Streaming automatic speech recognition with the transformer model

Encoder-decoder based sequence-to-sequence models have demonstrated stat...
research
07/05/2022

Compute Cost Amortized Transformer for Streaming ASR

We present a streaming, Transformer-based end-to-end automatic speech re...
research
02/07/2020

Transformer Transducer: A Streamable Speech Recognition Model with Transformer Encoders and RNN-T Loss

In this paper we present an end-to-end speech recognition model with Tra...
research
03/17/2021

Transformer-based ASR Incorporating Time-reduction Layer and Fine-tuning with Self-Knowledge Distillation

End-to-end automatic speech recognition (ASR), unlike conventional ASR, ...
research
06/28/2023

Accelerating Transducers through Adjacent Token Merging

Recent end-to-end automatic speech recognition (ASR) systems often utili...
research
11/21/2022

Sequentially Sampled Chunk Conformer for Streaming End-to-End ASR

This paper presents an in-depth study on a Sequentially Sampled Chunk Co...
research
10/07/2021

Streaming Transformer Transducer Based Speech Recognition Using Non-Causal Convolution

This paper improves the streaming transformer transducer for speech reco...

Please sign up or login with your details

Forgot password? Click here to reset