Joint speech and overlap detection: a benchmark over multiple audio setup and speech domains

07/24/2023
by   Martin Lebourdais, et al.
0

Voice activity and overlapped speech detection (respectively VAD and OSD) are key pre-processing tasks for speaker diarization. The final segmentation performance highly relies on the robustness of these sub-tasks. Recent studies have shown VAD and OSD can be trained jointly using a multi-class classification model. However, these works are often restricted to a specific speech domain, lacking information about the generalization capacities of the systems. This paper proposes a complete and new benchmark of different VAD and OSD models, on multiple audio setups (single/multi-channel) and speech domains (e.g. media, meeting...). Our 2/3-class systems, which combine a Temporal Convolutional Network with speech representations adapted to the setup, outperform state-of-the-art results. We show that the joint training of these two tasks offers similar performances in terms of F1-score to two dedicated VAD and OSD systems while reducing the training cost. This unique architecture can also be used for single and multichannel speech processing.

READ FULL TEXT
research
06/07/2023

Multi-microphone Automatic Speech Segmentation in Meetings Based on Circular Harmonics Features

Speaker diarization is the task of answering Who spoke and when? in an a...
research
12/18/2019

Ene-to-end training of time domain audio separation and recognition

The rising interest in single-channel multi-speaker speech separation sp...
research
09/24/2022

Joint Speech Activity and Overlap Detection with Multi-Exit Architecture

Overlapped speech detection (OSD) is critical for speech applications in...
research
04/12/2021

Improvement of Noise-Robust Single-Channel Voice Activity Detection with Spatial Pre-processing

Voice activity detection (VAD) remains a challenge in noisy environments...
research
03/30/2022

Improving Distortion Robustness of Self-supervised Speech Processing Tasks with Domain Adaptation

Speech distortions are a long-standing problem that degrades the perform...
research
09/15/2023

Characterizing the temporal dynamics of universal speech representations for generalizable deepfake detection

Existing deepfake speech detection systems lack generalizability to unse...
research
10/22/2020

Position-Agnostic Multi-Microphone Speech Dereverberation

Neural networks (NNs) have been widely applied in speech processing task...

Please sign up or login with your details

Forgot password? Click here to reset