DeepAI
Log In Sign Up

spatial-dccrn: dccrn equipped with frame-level angle feature and hybrid filtering for multi-channel speech enhancement

10/17/2022
by   Shubo Lv, et al.
0

Recently, multi-channel speech enhancement has drawn much interest due to the use of spatial information to distinguish target speech from interfering signal. To make full use of spatial information and neural network based masking estimation, we propose a multi-channel denoising neural network – Spatial DCCRN. Firstly, we extend S-DCCRN to multi-channel scenario, aiming at performing cascaded sub-channel and full-channel processing strategy, which can model different channels separately. Moreover, instead of only adopting multi-channel spectrum or concatenating first-channel's magnitude and IPD as the model's inputs, we apply an angle feature extraction module (AFE) to extract frame-level angle feature embeddings, which can help the model to apparently perceive spatial information. Finally, since the phenomenon of residual noise will be more serious when the noise and speech exist in the same time frequency (TF) bin, we particularly design a masking and mapping filtering method to substitute the traditional filter-and-sum operation, with the purpose of cascading coarsely denoising, dereverberation and residual noise suppression. The proposed model, Spatial-DCCRN, has surpassed EaBNet, FasNet as well as several competitive models on the L3DAS22 Challenge dataset. Not only the 3D scenario, Spatial-DCCRN outperforms state-of-the-art (SOTA) model MIMO-UNet by a large margin in multiple evaluation metrics on the multi-channel ConferencingSpeech2021 Challenge dataset. Ablation studies also demonstrate the effectiveness of different contributions.

READ FULL TEXT VIEW PDF
11/16/2021

S-DCCRN: Super Wide Band DCCRN with learnable complex feature for speech enhancement

In speech enhancement, complex neural network has shown promising perfor...
09/01/2021

Embedding and Beamforming: All-neural Causal Beamformer for Multichannel Speech Enhancement

The spatial covariance matrix has been considered to be significant for ...
11/04/2020

DESNet: A Multi-channel Network for Simultaneous Speech Dereverberation, Enhancement and Separation

In this paper, we propose a multi-channel network for simultaneous speec...
06/27/2022

Insights into Deep Non-linear Filters for Improved Multi-channel Speech Enhancement

The key advantage of using multiple microphones for speech enhancement i...
02/13/2021

Multi-Channel Speech Enhancement using Graph Neural Networks

Multi-channel speech enhancement aims to extract clean speech from a noi...
07/07/2022

Dual Stream Computer-Generated Image Detection Network Based On Channel Joint And Softpool

With the development of computer graphics technology, the images synthes...

1 Introduction

Recently, with the tremendous success of deep learning, speech enhancement has been formulated as a supervised learning problem 

[28, 30]. Meanwhile, multi-channel speech enhancement is gaining increasingly interest due to the utilization of spatial information to distinguish target speech from interfering signal [34, 8, 29, 6]. Some challenges such as the L3DAS22 challenge [10] and ConferencingSpeech2021 challenge [21] have been recently organized to promote research on multi-channel speech processing.

A typical strategy is to combine DNNs with traditional beamforming techniques. Specifically, TF-mask can be predicted by DNNs, which is used to determine minimum variance distortionless response (MVDR) 

[31]

and generalized eigenvalue (GEV) 

[11] beamforming weights. However, as the second stage is purely based on statistical theory and is usually irrelevant to the mask estimation, the pre-estimation error may heavily hamper the subsequent beamforming results [13]. More recently, neural beamformer method, including FasNet [15], EabNet [13] and MIMO-Unet [24], have shown outstanding performance. EabNet, MIMO-Unet and FasNet employ DNNs to estimate beamforming filters and apply filter-and-sum operation to estimate a single-channel enhanced complex spectrum or waveform.

In frequency domain, for EaBNet 

[13]

, two core modules are designed accordingly, namely EM and BM. The embedding module (EM) learns the 3D spectral and spatial embedding tensor, while the beamforming module (BM) estimates the beamforming weights to implement filter-and-sum operation. Moreover, MIMO-Unet 

[24] uses a convolutional U-Net to estimate beamforming filters and applies filter-and-sum operation to estimate a single-channel enhanced complex spectrum. Working in the time domain, FasNet [15] estimates linear spatial filters for filter-and-sum beamforming.

The spatial information is of vital importance for the multi-channel scenario. However, without any explicite spatial features as input, the above approaches only adopt the multi-channel spectrum as the input of the model and let the network learn spatial information implicitly. Some other strategies adopt reference channel’s magnitude and the inter-channel phase difference (IPD) as the input of the network [9]. However, only using phase information to reflect channel correlation will lead to apparent information lost. Furthermore, the above methods mostly estimate a group of filters and apply filter-and-sum operation. Yet, when the noise and speech exist in the same TF-bin, the phenomenon of residual noise will be more serious as well.

In this paper, to address the above problems, we propose Spatial-DCCRN for multi-channel speech enhancement. The contribution of this work is three-fold, summarized as follows.

  • We extend the cascaded sub-band and full-band processing strategy of Super wide band DCCRN (S-DCCRN) [17] to multi-channel scenario to execute sub-channel and full-channel processing, aiming at benefiting from both local and global channel information processing. Different from the oracle S-DCCRN, we use LSTM in sub/full-channel DCCRN to accept the concatenation of angle feature embedding and encoder’s output simultaneously. In addition, the complex mask is replaced with masking and mapping filter (MMF), aiming at denoising and dereverberation simultaneously. Compared with monaural processing, sub/full-channel processing strategy in multi-channel scenario can take advantage of the spatial information meticulously and model different channels separately.

  • We design an angle feature extraction (AFE) module, which only adopts the cosIPDs feature as input, aiming at extracting frame-level angle feature. With the help of convolution layers and denseblock, the channel- and time-correlation information can be effectively modeled by the AFE module, which helps the network perceive spatial information apparently.

  • We design a masking and mapping filtering (MMF) method to replace the typical filter-and-sum operation. Using such cascading strategy, the masking operation aims at dereverberation and coarsely denoising while the mapping operation is designed to further remove residual noise.In detail, the masking operation aims at dereverberation and coarsely denoising while mapping operation is designed to remove residual noise. Specifically, we apply sub/full-channel DCCRN to estimate the mapping filters. After that the masking filters are estimated by a group of conv3d blocks which receive the stack of noisy spectrum and mapping filters as the inputs, due to the mapping filters contain the masking filters. With the assistance of conv3d, the magnitude mask of the target channel can be estimated by the input of the MMF module. After applying masking filters to magnitude, the mapping filters are employed on the coarse real / imaginary part to acquire the enhanced speech.

After combining the contributions above, Spatial-DCCRN has surpassed several SOTA models, including FasNet [15] and EaBNet [13], and obtains 0.956 metric score which are composed of STOI and WER and ranks forth on the L3DAS22 challenge dataset. Moreover, with the experiments on ConferencingSpeech2021 challenge dataset, our system outperforms baseline by a large margin and also surpasses MIMO-Unet [24] which ranked first in that challenge.

2 Proposed System

2.1 Signal Model

Assuming , with , denotes the time domain noisy and reverberant speech signal at the

th microphone. The signal model of multi-channel speech enhancement in the short-time Fourier transform (STFT) domain can be given by:

(1)

where {} denote noisy signal, clean signal and noise respectively with frequency index of and time index of . Furthermore, denotes the relative transfer function (RTF) of the source speech.

2.2 Multi-Channel DCCRN

Previously, S-DCCRN [17] was proposed for super wide-band speech enhancement. The S-DCCRN is equipped with a cascaded sub-band and full-band (SAF) processing module, aiming at benefiting from both local and global frequency information processing. In detail, the SAF module consists of cascaded sub/full-band DCCRN, which substitutes the complex convolution of original DCCRN with group complex convolution. In this paper, inspired by the idea of local and global modeling, we extend S-DCCRN to multi-channel scenario by applying sub-channel and full-channel processing to fully utilize the spatial information. The overall architecture of the proposed Spatial-DCCRN is shown in Fig. 1. Similar to the oracle S-DCCRN, the learnable spectrum compression (LSC) module, which consists of a group of trainable compression ratios, is applied to adjust the energy of different frequency bands. The motivation of this design is due to the fact that higher frequency band of the far-field audio is likely to have low energy components. Moreover, the complex feature encoder/decoder module (CFE/CFD) is adopted to extract information from multi-channel complex spectrum. With the help of convolution layers and denseblock [20], the CFE/CFD block can refine channel-correlation and time-correlation information. More details of LSC and CFE/CFD can be found in [17].

Different from the original S-DCCRN, we substitute the complex masking strategy with our proposed masking and mapping filtering (MMF) method, aiming at denoising and dereverberation simultaneously as reverberation is a key issue for far-field speech. Furthermore, we design a learnable angle feature extraction (AFE) module to extract frame-level angle features. Then a modified LSTM layer of sub/full-band DCCRN is applied to accept the concatenation of angle features embedding and encoder’s output as input for joint modeling. With the help of the AFE module, the model can model the spatial information apparently. The AFE and MMF modules in our Spatial-DCCRN are introduced in detail in the following.

2.3 Angle Feature Extraction

Angle feature is critical for multi-channel enhancement. A widespread strategy is to adpot cosIPDs and reference channel’s magnitude as the input’s features. However, concatenating those features may increase the difficulty of modeling – the neural network has to model angle features and magnitude features simultaneously. Another method is to employ a module to estimate the direction of arrival (DOA) of the source speech and noise. However, due to the complex acoustic conditions in real scenario, it is extremely difficult to accurately estimate such DOA features. To this end, we propose a block which only adopts the cosIPDs features as the input to estimate frame-level angle features.

As shown in Fig 2, the angle feature extraction module is similar to CFE. We employ conv2d with a kernel size of 1 to extract high-dimensional information. Then a dilated dense blockwhose depth is 2 is used to capture long-term contextual angle features from time scale. Finally, a conv2d is adopted to extract local angle features. LayerNorm and PReLU activation are placed after each convolution layer. Afterwards, the LSTM layer of sub/full-channel DCCRN receives the frame-level time-variant angle feature embedding as the input for temporal dependency modeling. With the help of the angle feature extraction module, the complex spectrum enhancement module can apparently perceive the angle information of every frame.

Figure 1: Network structure of the proposed Spatial-DCCRN. ”LSC” denotes learnable spectrum compression, ”AFE” denotes angle feature extraction, ”CFE” denotes complex feature encoder, ”CFD” denotes complex feature decoder, ”MMF” denotes masking and mapping filtering and ”CAT” denotes concatenate.
Figure 2: Network structure of angle feature extraction and sub/full channel DCCRN. ‘GCC‘ denotes group complex convolution, ‘GCTC‘ denotes group complex transpose convolution and ‘CP‘ denotes convolution pathway – a convolution layer among the encoder and decoder [18]. ”AF” denotes angle feature embedding.
Figure 3: Network structure of masking and mapping filtering.

2.4 Masking and Mapping Filtering

Previous studies usually estimate a group of filter weights to do mask based beamforming in frequency or time domain. However, when the noise and speech exist in same TF-bin, residual noise components are relatively hard to be removed. Aiming to suppress the noise and dereverberation simultaneously, we design a novel masking and mapping filtering method. Specifically, we first define the estimated clean speech as

(2)

where and denote the masking metrix and mapping metrix respectively. Combining Eq. (1) and Eq. (2), and can be described as

(3)

As a result, the masking operation aims to dereverb while the mapping operation works for removing residual noise. The detailed design of MMF is shown in Figure 3. Inspired by the collaborative reconstruction module (CRM) in [14], we first employ masking on the magnitude domain and then apply mapping residual operation on the complex domain. The difference between CRM and MMF is that we apply the priority information () to estimate .

As shown in Eq. (3), consists of noise and the RTF of speech, which should be estimated by a complex neural network. Therefore, the sub/full-channel processing block is employed to estimate . Due to involves , we stack with the noisy complex spectrum to estimated . Specifically, we first rearange the input channel to ensure that the real and imaginary parts of the input are alternately placed along channel axis. Then we stack the noisy spectrum and the output of SAF block as the input of the conv3d layer. On the one hand, with the conv3d, the spectral information exists in noisy and

can be well grasped. For another, we set the stride of the dim of mic channel to 2, which can maintain that magnitude mask of the target channel is estimated by its corresponding real and imaginary components. After generating

, we apply it to noisy magnitude. Finally, is applied to the coarse real/imaginary part to remove the residual noise.

2.5 Loss Function

For the learning objective, we adopt a hybrid loss function strategy. Specifically, besides taking SI-SNR 

[16] as the time-domain loss function, we also adopt the PHASEN loss to emphasize the phase of T-F bins with higher amplitude, which can help the network to focus on the high amplitude T-F bins where most speech signals are located [33].

Finally, the STOI loss [27] is applied to directly improve the objective results [1]. Finally, the three losses are optimized jointly by

(4)

where and denote the network output and clean spectrum respectively. Hyper-parameter is a spectral compression factor empirically set to 0.3. Operator calculates the argument of a complex number.

# Model Cau. Ch. Loss Function Para.(M) STOI WER Metric

1
Spatial-DCCRN 4 SR+PN+ST 2.34 0.913 0.086 0.914
2   + MMF 4 SR+PN+ST 2.34 0.921 0.08 0.921
3   + AFE 4 SR+PN+ST 2.61 0.926 0.075 0.925
4   + + MMF 4 SR+PN+ST 2.61 0.931 0.071 0.930
5   + + MMF 8 ST 2.61 0.890 0.671 0.609
6   + + MMF 8 SR 2.61 0.837 0.270 0.783
7   + + MMF 8 PN 2.61 0.941 0.057 0.941
8   + + MMF 8 PN+ST 2.61 0.941 0.053 0.944
9   + + MMF 8 PN+SR 2.61 0.946 0.056 0.945
10   + + MMF 8 SR+ST 2.61 0.916 0.118 0.899
11   + + MMF 8 SR+PN+ST 2.61 0.946 0.055 0.946
12   + + MMF 8 SR+PN+ST 8.86 0.957 0.045 0.956
13 MIMO-UNet [23] 4 L1 5.52 0.889 0.175 0.857
14 EaBNet [13] 4 SR+PN+ST 2.84 0.877 0.110 0.884
15 FasNet [15] 4 SR+PN+ST 3.7 0.832 0.241 0.795
Table 1: Results of various models and ablation experiments of the proposed model on L3DAS22 dataset. ‘++MMF‘ denotes applying AFE and MMF, ‘Cau‘ denotes causal, ‘Ch.‘ denotes the channel number of inputs, ‘SR‘ denotes SI-SNR, ‘PN‘ denotes PHASEN and ‘ST‘ denotes STOI.

3 Experiments

3.1 Datasets

We first conduct experiments to prove the effectiveness of each proposed sub-modules on the L3DAS22 challenge dataset [10]. The objective of 3D speech enhancement task in this challenge is to enhance speech signals immersed in the spatial sound field of a reverberant office environment collected by 1st order Ambisonics microphone. Specifically, the dataset contains more than 40000 virtual 3D audio environments with a duration up to 12 seconds each, reaching a total duration of more than 80 hours. Clean utterances are selected from the clean subset of Librispeech [19] (approximately 53 male and 47female speech) while the monophonic noise signals come from FSD50K [5]. There are a total of 1440 noise sound files, which include 14 transient noise classes and 4 continuous noise classes. The 3D audio signals are generated by convolving the monophonic audio signals with RIRs which is obtained by performing a circular convolution between the recorded sound and the time-inverted analytic signal [4]. Totally, the fixed training and validation sets, with SNR ranging from 6 to 16 db, contain 37,398 utterances (81 h) and 2362 utterances (4 h), respectively.

Meanwhile, the proposed Spatial-DCCRN is trained and evaluated on the ConferencingSpeech 2021 challenge dataset [21] to show its robustness on video conferencing multi-talker scenario. In the dataset, the source speech data comes from AISHELL-1 [3], AISHELL-3 [25], VCTK [32] and Librispeech(train-clean-360) [19]. The speech utterances with SNR larger than 15dB are selected for training. The total duration of clean training speech is around 550 hours. The noise dataset is composed of MUSAN [26] and Audioset [7]

, with total duration of about 120 hours. Besides these two open source datasets, 98 real meeting room noise files recorded by high fidelity devices are also used. The training data are generated on-the-fly and segmented into 8 s chunks in one batch with SNR ranging from -5 to 25 dB.

3.2 Training setup and baselines

For the proposed models, the window length, frame shift and future frame are 25 ms, 6.25 ms and 6.25 ms, respectively, resulting in a 37.5 ms processing time. The STFT length is 512. For the 3D speech enhancement task on L3DAS22, all models are trained for 40 epochs with totally 3200h training data. For the models trained on ConferencingSpeech, the total data ’seen’ by the model is more than 9900 h after 18 epochs of training. The initial learning rate of all models is 0.001 and will get halved if there is no loss decrease on the validation set. We also compare the proposed Spatial-DCCRN and its ablation components with other SOTA models on the L3DAS22 dataset.

The configuration of our Spatial-DCCRN is described as follows. The number of channels for the sub-channel DCCRN is {32,64,64,64,128,128}, and the convolution kernel size and stride are set to (5,2) and (2,1) respectively. In addition, the configuration of the full-channel DCCRN is similar to the sub-channel DCCRN except that the channel number of the first layer is 64. One LSTM layer which consists 256 nodes and follows by a 256 256 fully connected layer is adopted to process the concatenation of the encoder outputs of the sub/full-channel DCCRN and the angle feature embeddings. Each encoder/decoder module handles the current frame and one previous frame. The output number of channel of the complex feature encoder/decoder module is 32, and the depth of denseblock is 5. LayerNorm and PReLU are performed after each convolution, except for the last layer of the CFD module. For the AFE module, the number of hidden channels is 16 and the depth of denseblock is 2. For the MMF module, the kernel size, the number of output channels and the step size of the first conv3d are set to (2, 5, 3), 8 and (2, 1, 1) respectively. In addition, the kernel size, output channel and step size of the second conv3d are (1, 5, 3), 1 and (1, 1, 1) respectively.

3.3 Experimental results and discussion

As presented in Table LABEL:tab:l3d, ablation studies are conducted to evaluate the effectiveness of different model components of Spatial-DCCRN, including a) Spatial-DCCRN without AFE and MMF (Spatial-DCCRN), b) Spatial-DCCRN without AFE, c) Spatial-DCCRN without MMF, d) different input channels of the observed signal, e) the non-causal version of Spatial-DCCRN and f) the ablation on different loss function. For the non-causal model, we substitute the LSTM in Spatial-DCCRN with BLSTM and look ahead one frame in each convolution layer. The evaluation metric is a combination of STOI [27] and WER, according to the official challenge rule [10]:

(5)

The WER is computed based on the transcription of the estimated target signal and that of the reference signal, both decoded by a pre-trained Wav2Vec2.0 based ASR model [2].

It can be seen from the results that the performance of Spatial-DCCRN (the base version) is obviously better than MIMO Unet, resulting in 0.057 metric improvement with a smaller model and causal inference. Compared with frequency domain SOTA model EaBNet and time domain SOTA model FasNet, Spatial-DCCRN yields 0.03 and 0.119 metric improvement respectively, which can prove that our model is competitive. Adding the MMF module and the AFE module leads to 0.007 and 0.011 metric gain respectively. Furthermore, the AFE module yields relatively better performance over the MMF module. This is because the angle information in multi-channel scenario is more essential. Combining AFE and MMF together, we achieve 0.016 metric improvement over the base version Spatial-DCCRN. Moreover, when more channels are available (48), considerable improvements for those metrics can be achieved. Finally, when we extend Spatial-DCCRN to its non-causal version, the best metric score is obtained.

By comparing different loss functions, several observations can be made. 1) From experiment No. 1, 7, and 8, we can observe that when only the STOI Loss is adopted, the performance is bad. However, when combining it with other loss functions, better metric scores are obtained. It can be considered that the STOI Loss is an useful auxiliary loss function. 2) From experiment No. 6 and 7, it is found that the PHASEN loss yields better performance over SI-SNR. This is because WER and STOI are sensitive to spectrum distortion, while time domain loss is unstable for those metrics. 3) From experiment No. 7 and 9, compared with the single PHASEN Loss, when the SI-SNR loss is further employed, considerable improvement can be achieved. This shows that optimizing the model from both time and frequency domains is beneficial.

# Model Cau. PESQ STOI E-STOI SI-SNR
1 Noisy - 1.515 0.823 0.690 4.474
2 Baseline 1.999 0.888 0.780 9.159
3 MIMO-Unet [24] 2.215 0.908 0.817 9.287
4 Spatial-DCCRN 2.523 0.923 0.847 10.167
Table 2: Results of various models on ConferencingSpeech2021 challenge development set.

Figure 4 illustrates an example on the proposed masking and mapping operation (MMF). It can be observed that the noisy spectrum was coarsely denoised and dereverbed after the masking operation. As expressed in Eq. (3), is the inverse of . As a result, the masking operation does good to dereverberation. It should be noted that we do not guide the learning of the mask operation during training. As shown in Figure 4 (c), after the masking operation, the mapping operation focuses on the lost detail from the complex-domain perspective and removes the residual noise.

Figure 4: Masking and mapping filtering results on a testing noisy clip.

We further evaluate Spatial-DCCRN on the development set provided by ConferencingSpeech2021 challenge task 1. The official baseline system of ConferencingSpeech2021 is a network composed of 3 LSTM layers and a dense layer, and the model’s input is the complex spectrum of the first channel together with IPD features. Here we also take the first rank system MIMO-Unet [24] for comparison as well measured inPESQ [22], STOI, E-STOI [12] and SI-SNR. As shown in table LABEL:tab:conference, Spatial-DCCRN outperforms the baseline by a large margin and clearly surpasses MIMO-Unet in all metrics.

4 Conclusions

In this paper, we propose a novel multi-channel complex domain denosing network – Spatial-DCCRN, which is extended from S-DCCRN [17]. With the help of the cascaded sub-channel and full-channel processing strategy, the model can benefit from both local and global channel information processing. Importantly, an angle feature extraction module is adopted to extract frame-level angle feature, aiming at assisting the network to perceive spatial information more apparently. Finally a masking and mapping filtering method is employed to replace the traditional filter-and-sum operation. The proposed Spatial-DCCRN model obtains excellent performance with 0.956 metric score on the L3DAS22 dataset. Furthermore, Spatial-DCCRN surpasses the first-rank model MIMO-Unet on the task1 development set provided by the ConferencingSpeech 2021 challenge. 111Enhanced clips can be found from https://imybo.github.io/Spatial-DCCRN/

References

  • [1]

    Asteroid: the pytorch-based audio source separation toolkit for researchers

    .
    Cited by: §2.5.
  • [2] A. Baevski, Y. Zhou, A. Mohamed, and M. Auli (2020) Wav2vec 2.0: a framework for self-supervised learning of speech representations. Advances in Neural Information Processing Systems 33, pp. 12449–12460. Cited by: §3.3.
  • [3] H. Bu, J. Du, X. Na, B. Wu, and H. Zheng (2017) Aishell-1: an open-source mandarin speech corpus and a speech recognition baseline. In Proc. O-COCOSDA, pp. 1–5. Cited by: §3.1.
  • [4] A. Farina (2000) Simultaneous measurement of impulse response and distortion with a swept-sine technique. In Audio engineering society convention 108, Cited by: §3.1.
  • [5] E. Fonseca, X. Favory, J. Pons, F. Font, and X. Serra (2022) FSD50k: an open dataset of human-labeled sound events. IEEE/ACM Transactions on Audio, Speech, and Language Processing 30, pp. 829–852. Cited by: §3.1.
  • [6] Y. Fu, J. Wu, Y. Hu, M. Xing, and L. Xie (2021) DESNet: a multi-channel network for simultaneous speech dereverberation, enhancement and separation. In Proc. SLT, pp. 857–864. Cited by: §1.
  • [7] J. F. Gemmeke, D. P. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter (2017) Audio set: an ontology and human-labeled dataset for audio events. In Proc. ICASSP, pp. 776–780. Cited by: §3.1.
  • [8] R. Gu, L. Chen, S. Zhang, J. Zheng, Y. Xu, M. Yu, D. Su, Y. Zou, and D. Yu (2019) Neural spatial filter: target speaker speech separation assisted with directional information.. In Proc. Interspeech, pp. 4290–4294. Cited by: §1.
  • [9] R. Gu, J. Wu, S. Zhang, L. Chen, Y. Xu, M. Yu, D. Su, Y. Zou, and D. Yu (2019) End-to-end multi-channel speech separation. arXiv preprint arXiv:1905.06286. Cited by: §1.
  • [10] E. Guizzo, C. Marinoni, M. Pennese, X. Ren, X. Zheng, C. Zhang, B. Masiero, A. Uncini, and D. Comminiello (2022) L3DAS22 challenge: learning 3d audio sources in a real office environment. arXiv e-prints, pp. arXiv–2202. Cited by: §1, §3.1, §3.3.
  • [11] J. Heymann, L. Drude, A. Chinaev, and R. Haeb-Umbach (2015) BLSTM supported gev beamformer front-end for the 3rd chime challenge. In Proc. ASRU, pp. 444–451. Cited by: §1.
  • [12] J. Jensen and C. H. Taal (2016) An algorithm for predicting the intelligibility of speech masked by modulated noise maskers. IEEE/ACM Transactions on Audio, Speech, and Language Processing 24 (11), pp. 2009–2022. Cited by: §3.3.
  • [13] A. Li, W. Liu, C. Zheng, and X. Li (2021) Embedding and beamforming: all-neural causal beamformer for multichannel speech enhancement. arXiv preprint arXiv:2109.00265. Cited by: §1, §1, §1, Table 1.
  • [14] A. Li, C. Zheng, L. Zhang, and X. Li (2022) Glance and gaze: a collaborative learning framework for single-channel speech enhancement. Applied Acoustics 187, pp. 108499. Cited by: §2.4.
  • [15] Y. Luo, C. Han, N. Mesgarani, E. Ceolini, and S. Liu (2019) FaSNet: low-latency adaptive beamforming for multi-microphone audio processing. In Proc. ASRU, pp. 260–267. Cited by: §1, §1, §1, Table 1.
  • [16] Y. Luo and N. Mesgarani (2019) Conv-tasnet: surpassing ideal time–frequency magnitude masking for speech separation. IEEE/ACM transactions on audio, speech, and language processing 27 (8), pp. 1256–1266. Cited by: §2.5.
  • [17] S. Lv, Y. Fu, M. Xing, J. Sun, L. Xie, J. Huang, Y. Wang, and T. Yu (2021) S-dccrn: super wide band dccrn with learnable complex feature for speech enhancement. arXiv e-prints, pp. arXiv–2111. Cited by: 1st item, §2.2, §4.
  • [18] S. Lv, Y. Hu, S. Zhang, and L. Xie (2021) DCCRN+: Channel-Wise Subband DCCRN with SNR Estimation for Speech Enhancement. In Proc. Interspeech, pp. 2816–2820. Cited by: Figure 2.
  • [19] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur (2015) Librispeech: an asr corpus based on public domain audio books. In Proc. ICASSP, pp. 5206–5210. Cited by: §3.1, §3.1.
  • [20] A. Pandey and D. Wang (2020) Densely connected neural network with dilated convolutions for real-time speech enhancement in the time domain. In Proc. ICASSP, pp. 6629–6633. Cited by: §2.2.
  • [21] W. Rao, Y. Fu, Y. Hu, X. Xu, Y. Jv, J. Han, Z. Jiang, L. Xie, Y. Wang, S. Watanabe, et al. (2021) INTERSPEECH 2021 conferencingspeech challenge: towards far-field multi-channel speech enhancement for video conferencing. In Proc. Interspeech, Cited by: §1, §3.1.
  • [22] I. Rec (2005) P. 862.2: wideband extension to recommendation p. 862 for the assessment of wideband telephone networks and speech codecs. International Telecommunication Union, CH–Geneva. Cited by: §3.3.
  • [23] X. Ren, L. Chen, X. Zheng, C. Xu, X. Zhang, C. Zhang, L. Guo, and B. Yu (2021) A neural beamforming network for b-format 3d speech enhancement and recognition. In Proc. MLSP, pp. 1–6. Cited by: Table 1.
  • [24] X. Ren, X. Zhang, L. Chen, X. Zheng, C. Zhang, L. Guo, and B. Yu (2021) A causal u-net based neural beamforming network for real-time multi-channel speech enhancement. In Proc. Interspeech, pp. 1832–1836. Cited by: §1, §1, §1, §3.3, Table 2.
  • [25] Y. Shi, H. Bu, X. Xu, S. Zhang, and M. Li (2020) Aishell-3: a multi-speaker mandarin tts corpus and the baselines. arXiv preprint arXiv:2010.11567. Cited by: §3.1.
  • [26] D. Snyder, G. Chen, and D. Povey (2015) Musan: a music, speech, and noise corpus. arXiv preprint arXiv:1510.08484. Cited by: §3.1.
  • [27] C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen (2010) A short-time objective intelligibility measure for time-frequency weighted noisy speech. In Proc. ICASSP, pp. 4214–4217. Cited by: §2.5, §3.3.
  • [28] D. Wang and J. Chen (2018) Supervised speech separation based on deep learning: an overview. IEEE/ACM Transactions on Audio, Speech, and Language Processing 26 (10), pp. 1702–1726. Cited by: §1.
  • [29] Z. Wang and D. Wang (2018) Combining spectral and spatial features for deep learning based blind speaker separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 27 (2), pp. 457–468. Cited by: §1.
  • [30] F. Weninger, H. Erdogan, S. Watanabe, E. Vincent, J. L. Roux, J. R. Hershey, and B. Schuller (2015)

    Speech enhancement with lstm recurrent neural networks and its application to noise-robust asr

    .
    In Proc. LVA/ICS, pp. 91–99. Cited by: §1.
  • [31] X. Xiao, C. Xu, Z. Zhang, S. Zhao, S. Sun, S. Watanabe, L. Wang, L. Xie, D. L. Jones, E. S. Chng, et al. (2016) A study of learning based beamforming methods for speech recognition. In CHiME 2016 workshop, pp. 26–31. Cited by: §1.
  • [32] J. Yamagishi, C. Veaux, K. MacDonald, et al. (2019) Cstr vctk corpus: english multi-speaker corpus for cstr voice cloning toolkit (version 0.92). Cited by: §3.1.
  • [33] D. Yin, C. Luo, Z. Xiong, and W. Zeng (2020) Phasen: a phase-and-harmonics-aware speech enhancement network. In Proc. AAAI, Vol. 34, pp. 9458–9465. Cited by: §2.5.
  • [34] J. Zhang, C. Zorilă, R. Doddipatla, and J. Barker (2020) On end-to-end multi-channel time domain speech separation in reverberant environments. In Proc. ICASSP, pp. 6389–6393. Cited by: §1.