Speech Enhancement using Separable Polling Attention and Global Layer Normalization followed with PReLU

by   Dengfeng Ke, et al.

Single channel speech enhancement is a challenging task in speech community. Recently, various neural networks based methods have been applied to speech enhancement. Among these models, PHASEN and T-GSA achieve state-of-the-art performances on the publicly opened VoiceBank+DEMAND corpus. Both of the models reach the COVL score of 3.62. PHASEN achieves the highest CSIG score of 4.21 while T-GSA gets the highest PESQ score of 3.06. However, both of these two models are very large. The contradiction between the model performance and the model size is hard to reconcile. In this paper, we introduce three kinds of techniques to shrink the PHASEN model and improve the performance. Firstly, seperable polling attention is proposed to replace the frequency transformation blocks in PHASEN. Secondly, global layer normalization followed with PReLU is used to replace batch normalization followed with ReLU. Finally, BLSTM in PHASEN is replaced with Conv2d operation and the phase stream is simplified. With all these modifications, the size of the PHASEN model is shrunk from 33M parameters to 5M parameters, while the performance on VoiceBank+DEMAND is improved to the CSIG score of 4.30, the PESQ score of 3.07 and the COVL score of 3.73.


page 1

page 2

page 3

page 4


Efficient Monaural Speech Enhancement using Spectrum Attention Fusion

Speech enhancement is a demanding task in automated speech processing pi...

Speech Enhancement with Fullband-Subband Cross-Attention Network

FullSubNet has shown its promising performance on speech enhancement by ...

A Multi-dimensional Deep Structured State Space Approach to Speech Enhancement Using Small-footprint Models

We propose a multi-dimensional structured state space (S4) approach to s...

MetricGAN+: An Improved Version of MetricGAN for Speech Enhancement

The discrepancy between the cost function used for training a speech enh...

Speech Enhancement with Score-Based Generative Models in the Complex STFT Domain

Score-based generative models (SGMs) have recently shown impressive resu...

Exploring Deep Hybrid Tensor-to-Vector Network Architectures for Regression Based Speech Enhancement

This paper investigates different trade-offs between the number of model...

PHASEN: A Phase-and-Harmonics-Aware Speech Enhancement Network

Time-frequency (T-F) domain masking is a mainstream approach for single-...

Please sign up or login with your details

Forgot password? Click here to reset