Multi-channel Speech Enhancement with 2-D Convolutional Time-frequency Domain Features and a Pre-trained Acoustic Model

by   Quandong Wang, et al.

We propose a multi-channel speech enhancement approach with a novel two-stage feature fusion method and a pre-trained acoustic model in a multi-task learning paradigm. In the first fusion stage, the time-domain and frequency-domain features are extracted separately. In the time domain, the multi-channel convolution sum (MCS) and the inter-channel convolution differences (ICDs) features are computed and then integrated with a 2-D convolutional layer, while in the frequency domain, the log-power spectra (LPS) features from both original channels and super-directive beamforming outputs are combined with another 2-D convolutional layer. To fully integrate the rich information of multi-channel speech, i.e. time-frequency domain features and the array geometry, we apply a third 2-D convolutional layer in the second stage of fusion to obtain the final convolutional features. Furthermore, we propose to use a fixed clean acoustic model trained with the end-to-end lattice-free maximum mutual information criterion to enforce the enhanced output to have the same distribution as the clean waveform to alleviate the over-estimation problem of the enhancement task and constrain distortion. On the Task1 development dataset of the ConferencingSpeech 2021 challenge, a PESQ improvement of 0.24 and 0.19 is attained compared to the official baseline and a recently proposed multi-channel separation method.



There are no comments yet.


page 5


SRIB-LEAP submission to Far-field Multi-Channel Speech Enhancement Challenge for Video Conferencing

This paper presents the details of the SRIB-LEAP submission to the Confe...

Adversarial Feature-Mapping for Speech Enhancement

Feature-mapping with deep neural networks is commonly used for single-ch...

Deep Attention Fusion Feature for Speech Separation with End-to-End Post-filter Method

In this paper, we propose an end-to-end post-filter method with deep att...

Inter-channel Conv-TasNet for multichannel speech enhancement

Speech enhancement in multichannel settings has been realized by utilizi...

Towards a generalized monaural and binaural auditory model for psychoacoustics and speech intelligibility

Auditory perception involves cues in the monaural auditory pathways as w...

Denoising-and-Dereverberation Hierarchical Neural Vocoder for Robust Waveform Generation

This paper presents a denoising and dereverberation hierarchical neural ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.