Is CQT more suitable for monaural speech separation than STFT? an empirical study

02/02/2019
by   Ziqiang Shi, et al.
0

Short-time Fourier transform (STFT) is used as the front end of many popular successful monaural speech separation methods, such as deep clustering (DPCL), permutation invariant training (PIT) and their various variants. Since the frequency component of STFT is linear, while the frequency distribution of human auditory system is nonlinear. In this work we propose and give an empirical study to use an alternative front end called constant Q transform (CQT) instead of STFT to achieve a better simulation of the frequency resolving power of the human auditory system. The upper bound in signal-to-distortion (SDR) of ideal speech separation based on CQT's ideal ration mask (IRM) is higher than that based on STFT. In the same experimental setting on WSJ0-2mix corpus, we examined the performance of CQT under different backends, including the original DPCL, utterance level PIT, and some of their variants. It is found that all CQT-based methods are better than STFT-based methods, and achieved on average 0.4dB better performance than STFT based method in SDR improvements.

READ FULL TEXT
research
12/25/2019

Utterance-level Permutation Invariant Training with Latency-controlled BLSTM for Single-channel Multi-talker Speech Separation

Utterance-level permutation invariant training (uPIT) has achieved promi...
research
10/12/2021

Multi-channel Narrow-Band Deep Speech Separation with Full-band Permutation Invariant Training

This paper addresses the problem of multi-channel multi-speech separatio...
research
11/20/2019

Demystifying TasNet: A Dissecting Approach

In recent years time domain speech separation has excelled over frequenc...
research
09/20/2018

TasNet: Surpassing Ideal Time-Frequency Masking for Speech Separation

Robust speech processing in multitalker acoustic environments requires a...
research
11/09/2022

An Empirical Study on Clustering Pretrained Embeddings: Is Deep Strictly Better?

Recent research in clustering face embeddings has found that unsupervise...
research
09/21/2023

Is the Ideal Ratio Mask Really the Best? – Exploring the Best Extraction Performance and Optimal Mask of Mask-based Beamformers

This study investigates mask-based beamformers (BFs), which estimate fil...
research
07/30/2019

An Empirical Study of Propagation-based Methods for Video Object Segmentation

While propagation-based approaches have achieved state-of-the-art perfor...

Please sign up or login with your details

Forgot password? Click here to reset