Transformer-based Online CTC/attention End-to-End Speech Recognition Architecture

01/15/2020
by   Haoran Miao, et al.
0

Recently, Transformer has gained success in automatic speech recognition (ASR) field. However, it is challenging to deploy a Transformer-based end-to-end (E2E) model for online speech recognition. In this paper, we propose the Transformer-based online CTC/attention E2E ASR architecture, which contains the chunk self-attention encoder (chunk-SAE) and the monotonic truncated attention (MTA) based self-attention decoder (SAD). Firstly, the chunk-SAE splits the speech into isolated chunks. To reduce the computational cost and improve the performance, we propose the state reuse chunk-SAE. Sencondly, the MTA based SAD truncates the speech features monotonically and performs attention on the truncated features. To support the online recognition, we integrate the state reuse chunk-SAE and the MTA based SAD into online CTC/attention architecture. We evaluate the proposed online models on the HKUST Mandarin ASR benchmark and achieve a 23.66 320 ms latency. Our online model yields as little as 0.19% absolute CER degradation compared with the offline baseline, and achieves significant improvement over our prior work on Long Short-Term Memory (LSTM) based online E2E models.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/30/2022

Adaptive Sparse and Monotonic Attention for Transformer-based Automatic Speech Recognition

The Transformer architecture model, based on self-attention and multi-he...
research
04/30/2019

Very Deep Self-Attention Networks for End-to-End Speech Recognition

Recently, end-to-end sequence-to-sequence models for speech recognition ...
research
11/13/2018

An Online Attention-based Model for Speech Recognition

Attention-based end-to-end (E2E) speech recognition models such as Liste...
research
06/08/2020

Learning to Count Words in Fluent Speech enables Online Speech Recognition

Sequence to Sequence models, in particular the Transformer, achieve stat...
research
04/24/2023

Self-regularised Minimum Latency Training for Streaming Transformer-based Speech Recognition

This paper proposes a self-regularised minimum latency training (SR-MLT)...
research
02/18/2021

Gaussian Kernelized Self-Attention for Long Sequence Data and Its Application to CTC-based Speech Recognition

Self-attention (SA) based models have recently achieved significant perf...
research
07/10/2020

Gated Recurrent Context: Softmax-free Attention for Online Encoder-Decoder Speech Recognition

Recently, attention-based encoder-decoder (AED) models have shown state-...

Please sign up or login with your details

Forgot password? Click here to reset