Classification vs. Regression in Supervised Learning for Single Channel Speaker Count Estimation

12/12/2017
by   Fabian-Robert Stöter, et al.
0

The task of estimating the maximum number of concurrent speakers from single channel mixtures is important for various audio-based applications, such as blind source separation, speaker diarisation, audio surveillance or auditory scene classification. Building upon powerful machine learning methodology, we develop a Deep Neural Network (DNN) that estimates a speaker count. While DNNs efficiently map input representations to output targets, it remains unclear how to best handle the network output to infer integer source count estimates, as a discrete count estimate can either be tackled as a regression or a classification problem. In this paper, we investigate this important design decision and also address complementary parameter choices such as the input representation. We evaluate a state-of-the-art DNN audio model based on a Bi-directional Long Short-Term Memory network architecture for speaker count estimations. Through experimental evaluations aimed at identifying the best overall strategy for the task and show results for five seconds speech segments in mixtures of up to ten speakers.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/31/2023

UNSSOR: Unsupervised Neural Speech Separation by Leveraging Over-determined Training Mixtures

In reverberant conditions with multiple concurrent speakers, each microp...
research
08/29/2017

Improving Source Separation via Multi-Speaker Representations

Lately there have been novel developments in deep learning towards solvi...
research
11/24/2020

Multi-Decoder DPRNN: High Accuracy Source Counting and Separation

We propose an end-to-end trainable approach to single-channel speech sep...
research
03/17/2020

High-Resolution Speaker Counting In Reverberant Rooms Using CRNN With Ambisonics Features

Speaker counting is the task of estimating the number of people that are...
research
07/31/2018

DNN driven Speaker Independent Audio-Visual Mask Estimation for Speech Separation

Human auditory cortex excels at selectively suppressing background noise...
research
05/19/2022

Bi-LSTM Scoring Based Similarity Measurement with Agglomerative Hierarchical Clustering (AHC) for Speaker Diarization

Majority of speech signals across different scenarios are never availabl...
research
01/06/2021

Multichannel CRNN for Speaker Counting: an Analysis of Performance

Speaker counting is the task of estimating the number of people that are...

Please sign up or login with your details

Forgot password? Click here to reset