Decomposed Temporal Dynamic CNN: Efficient Time-Adaptive Network for Text-Independent Speaker Verification Explained with Speaker Activation Map

03/29/2022
by   Seong-Hu Kim, et al.
0

Temporal dynamic models for text-independent speaker verification extract consistent speaker information regardless of phonemes by using temporal dynamic CNN (TDY-CNN) in which kernels adapt to each time bin. However, TDY-CNN shows limitations that the model is too large and does not guarantee the diversity of adaptive kernels. To address these limitations, we propose decomposed temporal dynamic CNN (DTDY-CNN) that makes adaptive kernel by combining static kernel and dynamic residual based on matrix decomposition. The baseline model using DTDY-CNN maintained speaker verification performance while reducing the number of model parameters by 35 detailed behaviors of temporal dynamic models on extraction of speaker information was explained using speaker activation maps (SAM) modified from gradient-weighted class activation mapping (Grad-CAM). In DTDY-CNN, the static kernel activates voiced features of utterances, and the dynamic residual activates unvoiced high-frequency features of phonemes. DTDY-CNN effectively extracts speaker information from not only formant frequencies and harmonics but also detailed unvoiced phonemes' information, thus explaining its outstanding performance on text-independent speaker verification.

READ FULL TEXT

page 3

page 4

research
12/13/2019

Short-duration Speaker Verification (SdSV) Challenge 2020: the Challenge Evaluation Plan

This document describes task1 of the Short-Duration Speaker Verification...
research
05/26/2017

Text-Independent Speaker Verification Using 3D Convolutional Neural Networks

In this paper, a novel method using 3D Convolutional Neural Network (3D-...
research
08/25/2020

Few Shot Text-Independent speaker verification using 3D-CNN

Facial recognition system is one of the major successes of Artificial in...
research
02/03/2022

MFA: TDNN with Multi-scale Frequency-channel Attention for Text-independent Speaker Verification with Short Utterances

The time delay neural network (TDNN) represents one of the state-of-the-...
research
08/30/2021

RSKNet-MTSP: Effective and Portable Deep Architecture for Speaker Verification

The convolutional neural network (CNN) based approaches have shown great...
research
04/28/2021

Personalized Keyphrase Detection using Speaker and Environment Information

In this paper, we introduce a streaming keyphrase detection system that ...

Please sign up or login with your details

Forgot password? Click here to reset