Similarity and Content-based Phonetic Self Attention for Speech Recognition

03/19/2022
by   Kyuhong Shim, et al.
0

Transformer-based speech recognition models have achieved great success due to the self-attention (SA) mechanism that utilizes every frame in the feature extraction process. Especially, SA heads in lower layers capture various phonetic characteristics by the query-key dot product, which is designed to compute the pairwise relationship between frames. In this paper, we propose a variant of SA to extract more representative phonetic features. The proposed phonetic self-attention (phSA) is composed of two different types of phonetic attention; one is similarity-based and the other is content-based. In short, similarity-based attention utilizes the correlation between frames while content-based attention only considers each frame without being affected by others. We identify which parts of the original dot product are related to two different attention patterns and improve each part by simple modifications. Our experiments on phoneme classification and speech recognition show that replacing SA with phSA for lower layers improves the recognition performance without increasing the latency and the parameter size.

READ FULL TEXT
research
05/22/2023

GNCformer Enhanced Self-attention for Automatic Speech Recognition

In this paper,an Enhanced Self-Attention (ESA) mechanism has been put fo...
research
05/21/2020

Simplified Self-Attention for Transformer-based End-to-End Speech Recognition

Transformer models have been introduced into end-to-end speech recogniti...
research
09/12/2023

Self-supervised Extraction of Human Motion Structures via Frame-wise Discrete Features

The present paper proposes an encoder-decoder model for extracting the s...
research
05/28/2020

When Can Self-Attention Be Replaced by Feed Forward Layers?

Recently, self-attention models such as Transformers have given competit...
research
07/12/2023

Sumformer: A Linear-Complexity Alternative to Self-Attention for Speech Recognition

Modern speech recognition systems rely on self-attention. Unfortunately,...
research
10/01/2022

A Comparison of Transformer, Convolutional, and Recurrent Neural Networks on Phoneme Recognition

Phoneme recognition is a very important part of speech recognition that ...
research
03/22/2023

Self-supervised Learning with Speech Modulation Dropout

We show that training a multi-headed self-attention-based deep network t...

Please sign up or login with your details

Forgot password? Click here to reset