Knowing What to Listen to: Early Attention for Deep Speech Representation Learning

09/03/2020
by   Amirhossein Hajavi, et al.
0

Deep learning techniques have considerably improved speech processing in recent years. Speech representations extracted by deep learning models are being used in a wide range of tasks such as speech recognition, speaker recognition, and speech emotion recognition. Attention models play an important role in improving deep learning models. However current attention mechanisms are unable to attend to fine-grained information items. In this paper we propose the novel Fine-grained Early Frequency Attention (FEFA) for speech signals. This model is capable of focusing on information items as small as frequency bins. We evaluate the proposed model on two popular tasks of speaker recognition and speech emotion recognition. Two widely used public datasets, VoxCeleb and IEMOCAP, are used for our experiments. The model is implemented on top of several prominent deep models as backbone networks to evaluate its impact on performance compared to the original networks and other related work. Our experiments show that by adding FEFA to different CNN architectures, performance is consistently improved by substantial margins, even setting a new state-of-the-art for the speaker recognition task. We also tested our model against different levels of added noise showing improvements in robustness and less sensitivity compared to the backbone networks.

READ FULL TEXT
research
07/20/2022

Fine-grained Early Frequency Attention for Deep Speaker Recognition

Attention mechanisms have emerged as important tools that boost the perf...
research
04/08/2021

Emotion Recognition from Speech Using Wav2vec 2.0 Embeddings

Emotion recognition datasets are relatively small, making the use of the...
research
10/24/2020

Learning Fine-Grained Multimodal Alignment for Speech Emotion Recognition

Speech emotion recognition is a challenging task because the emotion exp...
research
02/02/2022

Speaker Normalization for Self-supervised Speech Emotion Recognition

Large speech emotion recognition datasets are hard to obtain, and small ...
research
09/02/2020

Convolutional Speech Recognition with Pitch and Voice Quality Features

The effects of adding pitch and voice quality features such as jitter an...
research
01/02/2020

Deep Representation Learning in Speech Processing: Challenges, Recent Advances, and Future Trends

Research on speech processing has traditionally considered the task of d...
research
11/04/2022

SPEAKER VGG CCT: Cross-corpus Speech Emotion Recognition with Speaker Embedding and Vision Transformers

In recent years, Speech Emotion Recognition (SER) has been investigated ...

Please sign up or login with your details

Forgot password? Click here to reset