High-resolution embedding extractor for speaker diarisation

11/08/2022
by   Hee-Soo Heo, et al.
0

Speaker embedding extractors significantly influence the performance of clustering-based speaker diarisation systems. Conventionally, only one embedding is extracted from each speech segment. However, because of the sliding window approach, a segment easily includes two or more speakers owing to speaker change points. This study proposes a novel embedding extractor architecture, referred to as a high-resolution embedding extractor (HEE), which extracts multiple high-resolution embeddings from each speech segment. Hee consists of a feature-map extractor and an enhancer, where the enhancer with the self-attention mechanism is the key to success. The enhancer of HEE replaces the aggregation process; instead of a global pooling layer, the enhancer combines relative information to each frame via attention leveraging the global context. Extracted dense frame-level embeddings can each represent a speaker. Thus, multiple speakers can be represented by different frame-level features in each segment. We also propose an artificially generating mixture data training framework to train the proposed HEE. Through experiments on five evaluation sets, including four public datasets, the proposed HEE demonstrates at least 10 we analyse that rapid speaker changes less exist.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/06/2019

Self-supervised speaker embeddings

Contrary to i-vectors, speaker embeddings such as x-vectors are incapabl...
research
06/01/2023

A Teacher-Student approach for extracting informative speaker embeddings from speech mixtures

We introduce a monaural neural speaker embeddings extractor that compute...
research
05/16/2021

X-Vectors with Multi-Scale Aggregation for Speaker Diarization

Speaker diarization is the process of labeling different speakers in a s...
research
04/03/2020

Neural i-vectors

Deep speaker embeddings have been demonstrated to outperform their gener...
research
06/20/2019

Unleashing the Unused Potential of I-Vectors Enabled by GPU Acceleration

Speaker embeddings are continuous-value vector representations that allo...
research
05/15/2020

Weakly Supervised Training of Hierarchical Attention Networks for Speaker Identification

Identifying multiple speakers without knowing where a speaker's voice is...
research
03/30/2022

Multi-target Filter and Detector for Unknown-number Speaker Diarization

A strong representation of a target speaker can aid in extracting import...

Please sign up or login with your details

Forgot password? Click here to reset