Self-Attention Channel Combinator Frontend for End-to-End Multichannel Far-field Speech Recognition

09/10/2021
by   Rong Gong, et al.
0

When a sufficiently large far-field training data is presented, jointly optimizing a multichannel frontend and an end-to-end (E2E) Automatic Speech Recognition (ASR) backend shows promising results. Recent literature has shown traditional beamformer designs, such as MVDR (Minimum Variance Distortionless Response) or fixed beamformers can be successfully integrated as the frontend into an E2E ASR system with learnable parameters. In this work, we propose the self-attention channel combinator (SACC) ASR frontend, which leverages the self-attention mechanism to combine multichannel audio signals in the magnitude spectral domain. Experiments conducted on a multichannel playback test data shows that the SACC achieved a 9.3 beamformer-based frontend, both jointly optimized with a ContextNet-based ASR backend. We also demonstrate the connection between the SACC and the traditional beamformers, and analyze the intermediate outputs of the SACC.

READ FULL TEXT
research
09/30/2022

Adaptive Sparse and Monotonic Attention for Transformer-based Automatic Speech Recognition

The Transformer architecture model, based on self-attention and multi-he...
research
02/18/2021

Gaussian Kernelized Self-Attention for Long Sequence Data and Its Application to CTC-based Speech Recognition

Self-attention (SA) based models have recently achieved significant perf...
research
03/25/2022

Spatial Processing Front-End For Distant ASR Exploiting Self-Attention Channel Combinator

We present a novel multi-channel front-end based on channel shortening w...
research
11/14/2022

Towards A Unified Conformer Structure: from ASR to ASV Task

Transformer has achieved extraordinary performance in Natural Language P...
research
04/13/2023

ASR: Attention-alike Structural Re-parameterization

The structural re-parameterization (SRP) technique is a novel deep learn...
research
09/13/2022

Analysis of Self-Attention Head Diversity for Conformer-based Automatic Speech Recognition

Attention layers are an integral part of modern end-to-end automatic spe...
research
10/15/2020

Lightweight End-to-End Speech Recognition from Raw Audio Data Using Sinc-Convolutions

Many end-to-end Automatic Speech Recognition (ASR) systems still rely on...

Please sign up or login with your details

Forgot password? Click here to reset