Multi-View Frequency-Attention Alternative to CNN Frontends for Automatic Speech Recognition

06/12/2023
by   Belen Alastruey, et al.
0

Convolutional frontends are a typical choice for Transformer-based automatic speech recognition to preprocess the spectrogram, reduce its sequence length, and combine local information in time and frequency similarly. However, the width and height of an audio spectrogram denote different information, e.g., due to reverberation as well as the articulatory system, the time axis has a clear left-to-right dependency. On the contrary, vowels and consonants demonstrate very different patterns and occupy almost disjoint frequency ranges. Therefore, we hypothesize, global attention over frequencies is beneficial over local convolution. We obtain 2.4 reduction (rWERR) on a production scale Conformer transducer replacing its convolutional neural network frontend by the proposed F-Attention module on Alexa traffic. To demonstrate generalizability, we validate this on public LibriSpeech data with a long short term memory-based listen attend and spell architecture obtaining 4.6 noisy conditions.

READ FULL TEXT

page 1

page 3

research
11/20/2019

On using 2D sequence-to-sequence models for speech recognition

Attention-based sequence-to-sequence models have shown promising results...
research
02/14/2021

Thank you for Attention: A survey on Attention-based Artificial Neural Networks for Automatic Speech Recognition

Attention is a very popular and effective mechanism in artificial neural...
research
07/23/2018

Automatic Speech Recognition for Humanitarian Applications in Somali

We present our first efforts in building an automatic speech recognition...
research
10/22/2022

Speech Emotion Recognition via an Attentive Time-Frequency Neural Network

Spectrogram is commonly used as the input feature of deep neural network...
research
06/30/2020

Multi-view Frequency LSTM: An Efficient Frontend for Automatic Speech Recognition

Acoustic models in real-time speech recognition systems typically stack ...
research
09/01/2022

Attention Enhanced Citrinet for Speech Recognition

Citrinet is an end-to-end convolutional Connectionist Temporal Classific...
research
10/19/2017

Visual Speech Recognition Using PCA Networks and LSTMs in a Tandem GMM-HMM System

Automatic visual speech recognition is an interesting problem in pattern...

Please sign up or login with your details

Forgot password? Click here to reset