Gaussian Kernelized Self-Attention for Long Sequence Data and Its Application to CTC-based Speech Recognition

02/18/2021
by   Yosuke Kashiwagi, et al.
0

Self-attention (SA) based models have recently achieved significant performance improvements in hybrid and end-to-end automatic speech recognition (ASR) systems owing to their flexible context modeling capability. However, it is also known that the accuracy degrades when applying SA to long sequence data. This is mainly due to the length mismatch between the inference and training data because the training data are usually divided into short segments for efficient training. To mitigate this mismatch, we propose a new architecture, which is a variant of the Gaussian kernel, which itself is a shift-invariant kernel. First, we mathematically demonstrate that self-attention with shared weight parameters for queries and keys is equivalent to a normalized kernel function. By replacing this kernel function with the proposed Gaussian kernel, the architecture becomes completely shift-invariant with the relative position information embedded using a frame indexing technique. The proposed Gaussian kernelized SA was applied to connectionist temporal classification (CTC) based ASR. An experimental evaluation with the Corpus of Spontaneous Japanese (CSJ) and TEDLIUM 3 benchmarks shows that the proposed SA achieves a significant improvement in accuracy (e.g., from 24.0 WER to 6.0

READ FULL TEXT
research
09/10/2021

Self-Attention Channel Combinator Frontend for End-to-End Multichannel Far-field Speech Recognition

When a sufficiently large far-field training data is presented, jointly ...
research
07/02/2021

Dual Causal/Non-Causal Self-Attention for Streaming End-to-End Speech Recognition

Attention-based end-to-end automatic speech recognition (ASR) systems ha...
research
03/29/2021

Transformer-based end-to-end speech recognition with residual Gaussian-based self-attention

Self-attention (SA), which encodes vector sequences according to their p...
research
01/15/2020

Transformer-based Online CTC/attention End-to-End Speech Recognition Architecture

Recently, Transformer has gained success in automatic speech recognition...
research
02/17/2022

MLP-ASR: Sequence-length agnostic all-MLP architectures for speech recognition

We propose multi-layer perceptron (MLP)-based architectures suitable for...
research
03/25/2022

Spatial Processing Front-End For Distant ASR Exploiting Self-Attention Channel Combinator

We present a novel multi-channel front-end based on channel shortening w...
research
08/16/2022

Uconv-Conformer: High Reduction of Input Sequence Length for End-to-End Speech Recognition

Optimization of modern ASR architectures is among the highest priority t...

Please sign up or login with your details

Forgot password? Click here to reset