Frequency and temporal convolutional attention for text-independent speaker recognition

10/16/2019
by   Sarthak Yadav, et al.
0

Majority of the recent approaches for text-independent speaker recognition apply attention or similar techniques for aggregation of frame-level feature descriptors generated by a deep neural network (DNN) front-end. In this paper, we propose methods of convolutional attention for independently modelling temporal and frequency information in a convolutional neural network (CNN) based front-end. Our system utilizes convolutional block attention modules (CBAMs) [1] appropriately modified to accommodate spectrogram inputs. The proposed CNN front-end fitted with the proposed convolutional attention modules outperform the no-attention and spatial-CBAM baselines by a significant margin on the VoxCeleb [2, 3] speaker verification benchmark, and our best model achieves an equal error rate of 2:031 existing state of the art result by a significant margin. For a more thorough assessment of the effects of frequency and temporal attention in real-world conditions, we conduct ablation experiments by randomly dropping frequency bins and temporal frames from the input spectrograms, concluding that instead of modelling either of the entities, simultaneously modelling temporal and frequency attention translates to better real-world performance.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/31/2018

Prosodic-Enhanced Siamese Convolutional Neural Networks for Cross-Device Text-Independent Speaker Verification

In this paper a novel cross-device text-independent speaker verification...
research
09/12/2018

Frame-level speaker embeddings for text-independent speaker recognition and analysis of end-to-end model

In this paper, we propose a Convolutional Neural Network (CNN) based spe...
research
01/14/2020

An Improved Deep Neural Network for Modeling Speaker Characteristics at Different Temporal Scales

This paper presents an improved deep embedding learning method based on ...
research
07/10/2022

Multi-Frequency Information Enhanced Channel Attention Module for Speaker Representation Learning

Recently, attention mechanisms have been applied successfully in neural ...
research
03/28/2019

Deep Neural Network Embeddings with Gating Mechanisms for Text-Independent Speaker Verification

In this paper, gating mechanisms are applied in deep neural network (DNN...
research
05/14/2020

Large Scale Font Independent Urdu Text Recognition System

OCR algorithms have received a significant improvement in performance re...
research
04/03/2022

Selective Kernel Attention for Robust Speaker Verification

Recent state-of-the-art speaker verification architectures adopt multi-s...

Please sign up or login with your details

Forgot password? Click here to reset