Monaural speech separation aims at separating speech from the noisy backgrounds by using one microphone. In recent years, speech separation has been formulated as a supervised learning problem. Thanks to the rise of deep learning, supervised speech separation has made significant progress.
Speech separation for improving human speech intelligibility and quality has been systematically evaluated and successfully utilized. In general, speech separation can be divided into three groups, i.e., masking-based methods, mapping-based methods and signal approximation. The masking-based methods try to predict a mask computed from premixed noise and clean speech, e.g. ideal ratio mask , phase sensitive mask  and complex ratio mask . Mapping-based method tries to enhance speech by finding a mapping function between noisy feature and spectrum of the clean speech 
. The idea of signal approximation (SA) is to train a ratio mask estimator that minimizes the difference between the spectral magnitude of clean speech and that of estimated speech. A lot of learning machines have also been introduced for speech separation, In [2, 4]
, deep neural networks (DNNs) are employed to predict ideal masks. Luet al. used a deep denoising auto-encoder (DDAE) to obtain a clean Mel frequency power spectrogram (fbank) from a noisy one , In [8, 9]
, convolutional neural networks (CNNs) have been introduced. Besides the feed-forward networks, recurrent networks (RNNs) have also became a popular choice in the speech separation community. As for features, Wang et al. proposed a complementary feature  and Chen et al. found multi-resolution cochleagram is a better feature in low signal-noise-ratio conditions .
Compared with human listeners, ASR is more sensitive to the noise interfering and the speech distortion. In general, there are three strategies introduced to improve the robustness of ASR. The first one is using a separation frontend to enhance both training and test sets and retraining the acoustic model with enhanced features [12, 13]. The second one is joint-training the front-end enhancement model with the back-end acoustic model [14, 15]. The third one is multi-conditional training which performs acoustic modeling on noisy speech and the extracted features are directly fed to the acoustic model for decoding at the test stage. This strategy is shown to be effective in matched condition but gives an unremarkable performance for the unseen noise .
All the above strategies require retraining or joint-training an acoustic model which can be time-consuming and sophisticated. Compared with speech separation, it is relatively hard to collect training data for speech recognition which needs handcrafted annotation. In practice, a preferred choice is to train the front-end speech separation and the back-end ASR independently. And we wonder whether the supervised speech separation methods can directly improve the performance of ASR without retraining or joint-training under the real noisy condition. Wang et al. evaluated a masking-based method on the simulated noisy dataset which is derived from Google Voice dataset, which made 0.3% improvement for the multi-condition trained ASR . Wang et al. investigated the effectiveness of the front-end processing on the reverberant condition . But there still lacks of a work to systematically examine the ability of the supervised speech separation methods for the multi-conditional trained ASR. In this paper, different speech separation methods based on various time-frequency (T-F) representations are investigated on the third CHiME challenge.
2 Speech separation methods
In speech separation community, RNNs with long short-term memorys (LSTMs) have been widely employed to leverage the sequential information of speech signals and shown superior performance as compared with DNNs and CNNs[3, 18]. For optimization objectives, ratio masking, direct mapping and signal approximation are three popular choices. Note that all these methods can be performed in different T-F representations, such as log-power spectrogram, log-fbank feature. In this investigation, we wonder which combination of the optimization objectives and T-F representations is most appropriate for the robust ASR. Therefore, we fix our learning machine as a RNN with the bidirectional long short-term memories (BiLSTMs)  and focus on the different optimization objectives and T-F representations.
2.1 Optimization objectives
The general training objective of supervised speech separation is defined as:
where is the desired output at frame , is the noisy T-F representation and the input of separation model which is parameterized by , and means squared loss, which is defined as:
is the 2-norm of a vector.
2.1.1 Ratio Masking
The masking-based methods try to learn a mapping function from the noisy T-F representations to the T-F masks of the clean speech. The training target of the ratio mask is defined as:
where is the desired ratio mask at frame . We investigate a direct masking method, which is defined as:
where and are the T-F representations of clean and noisy speech at frame respectively. Because the direct masks are not well bounded, we clip them to for the training stability.
2.1.2 Direct Mapping
Mapping-based methods train the learning machine to predict the T-F representation of the clean speech from the noisy speech directly. The optimization objective of direct mapping is defined as:
where and are the T-F representations of the clean and noisy speech at frame respectively.
2.1.3 Signal Approximation
SA-based methods implicitly learn ratio mask from noisy T-F representations. Different from the masking-based methods which directly reduce the training loss between the desired mask and the predicted one, SA-based methods reduce the loss between the T-F representations of the target speech and the estimated ones. SA-based optimization objective is defined as:
where is element-wise multiplication. The output of is restricted to the range and bounded as the ratio mask.
2.2 Target domains
The above optimization objectives can be performed on different target domains. In ASR community, log-fbank is the most used feature, so we optimize our model on the log-fbank domain. Because the log-fbank features can be directly extracted from the spectrograms (fft domain), we also perform the optimization on the fft domain and its logarithmic counterpart.
Different learning tasks can benefit from the appropriate features. Log-fbank features are widely used for training the acoustic model, and the log-fft spectrograms are usually fed to the speech separation models. In this paper, the targets on log-fbank and fbank domain are predicted from log-fbank features and the log-fft spectrograms are fed to the models on log-fft and fft domain. The input features, output domains and optimization objectives of the evaluated methods are shown in table 1.
|Evaluated methods||Input domain||Output domain||Optimization objectives|
|log-fbank masking||log-fbank||log-fbank||ratio masking|
|log-fft masking||log-fft||log-fft||ratio masking|
|fbank masking||log-fbank||fbank||ratio masking|
|fft masking||log-fft||fft||ratio masking|
3 Experimental settings
We perform our investigation on the CHiME-3 Challenge  which provides multi-channel data for distant-talking automatic speech recognition and we only use the fifth channel in this paper.
In the training phase of ASR, we follow the recipe for CHiME-3 in the newest Kaldi release to build our baseline. There are two differences between our training and the default. First, we train the recognizer with multi-conditional training strategy (MCT), i.e., we train the GMM-based and DNN-based acoustic model with the clean utterances, the simulated noisy utterances in the fifth channel, the real noisy utterances in the fifth channel and the real close-talk utterances in channel zero while the default training is only with the real and simulated noisy utterances in the fifth channel. The intuition behind this MCT is that the front-end processing tries to reconstruct the clean features, only training the recognizer with the noisy utterances is obviously unreasonable. Second, we train the recognizer with fbank features instead of MFCC. The fbank feature has been widely used in robust speech recognition community . With the MCT strategy and fbank features, our ASR baseline achieves the similar performance which is claimed in the CHiME-3 challenge.
For the front-end processing, we employ a 4-layer RNN with 512 bidirectional LSTM cells in each layer. A dense layer with softplus activations is followed for the mapping-based methods. And the sigmoid function is employed for the masking-based and the SA-based methods. Different methods are evaluated on the log-fft domain and log-fbank domain, however, the fft and fbank domain are only evaluated with the masking-based method because of their large value range. To evaluate the affect of noisy phases, the recognizer is also fed with the synthesized waveforms which are reconstructed from the noisy phases and the estimated magnitudes via the inverse STFT. In the training phase of the front-end models, the T-F representations extracted from the simulated and real noisy utterances are fed to the models and the corresponding clean counterparts are estimated. We also expand the training set by mixing the clean utterances and the noise records in training set by 0dB, 3dB and 6dB.
In evaluating phase, the word error rate (WER) is calculated for the simulated and real noisy utterances in development and test set. The front-end processing is also performed on the clean and close-talk utterances to find whether it will lead to a degradation on the relatively clean utterances.
4 Results and discussions
Table 2 and 3 show the WERs of GMM-based and DNN-based ASR respectively. The columns with dt_* and et_* show the results of development set and test set. The WERs of utterances recorded in booth and real noisy environments are given in columns *_bth and *_real. The columns *_close represent the results of close-talk utterances in channel zero and the WERs of simulated noisy speech in fifth channel are shown in columns *_simu. The rows marked by ”+noisy phases” indicate that we reconstruct waveforms in time domain and extract the ASR features on the waveforms. We do this because that speech enhancement often runs in local system and ASR locates in cloud server for many real scenarios, and the interface always needs waveform. The average performances of simulated and real noisy utterances are given in the *_avg columns.
For the GMM-based ASR (seen in table 2), masking-based method in the log-fbank domain achieves the best performance, 36.40% relative improvement from 31.70% to 20.16%, on the noisy test set. SA in the log-fbank domain gets the lowest WER on the noisy development set. It seems that the mapping-based method is not a good ideal for the automatic speech recognition purpose. When noisy phase is involved, the masking-based method in the fft domain degrades significantly on test set. Although the methods in the log-fft domain are affected slightly by noisy phase, performances are much worse than the methods in the fft domain.
For the DNN-based ASR(seen in table 3), the masking-based method in the log-fbank domain is a good choice and achieves 7.78% and 11.78% relative improvement on the noisy development and test set respectively. The masking-based method in the fft domain gets lower WER than all methods in the log-fft domain, but it is significantly degraded by noisy phase. The mapping-based front-end processing and the methods in the log-fft domain do not improve the performance of ASR anymore.
These front-end processing methods make very little degradation on the relatively clean speech utterances (see the *_clean and *_close columns). Surprisingly, some methods can even improve the performance of ASR for the close-talk utterances in test set, which is possibly because the close-talk utterances are not very clean but slightly noisy.
From table 3 and 4, we can see that independent front-end processing can dramatically enhance the ASR performance with same noise condition. To evaluate the generalization ability, we calculate WERs of noisy utterances interfered by babble noise which does not appear in ASR and speech enhancement training data. The method in the log-fbank domain achieves the best performance for the unseen babble noise which also gets the lowest WER under the noise matched condition. We find the ASR with MCT strategy does not generalize well for the unseen noise while speech enhancement efficiently leverages the information of noise and performs better under the unmatched condition.
In this paper, we investigate the independent front-end processing methods for ASR without retraining or joint-training on the CHiME-3 challenge. The masking-based, mapping-based and SA-based methods are evaluated in the log-fbank domain, log-fft domain and their linear counterparts. From this investigation, we find the masking-based method is a good choice for ASR. Direct masking in the log-fbank domain achieves the lowest WER under the matched and unmatched noise condition as compared with the baseline which is a strong DNN-based acoustic model.
|Methods||0 dB||3 dB||6 dB|
Noisy phase leads to a considerable degradation for the masking-based methods in the fft domain while the affect in the log-fft domain is very slight. The independent front-end generalizes better than MCT for the unseen noise. In the future, we will try to further reduce WER of the DNN-based ASR with the independent front-end processing.
This research was supported by National Science Foundation of China No.61876214, National Key Research and Development Program of China under Grant 2017YFB1002102 and National Natural Science Foundation of China under Grant U1736210.
-  DeLiang Wang and Jitong Chen, “Supervised speech separation based on deep learning: An overview,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2018.
-  Yuxuan Wang, Arun Narayanan, and DeLiang Wang, “On training targets for supervised speech separation,” IEEE Transactions on Audio, Speech and Language Processing, vol. 22, no. 12, pp. 1849–1858, 2014.
Hakan Erdogan, John R Hershey, Shinji Watanabe, and Jonathan Le Roux,
“Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks,”in IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2015, pp. 708–712.
-  Donald S Williamson, Yuxuan Wang, and DeLiang Wang, “Complex ratio masking for monaural speech separation,” IEEE Transactions on Audio, Speech and Language Processing, vol. 24, no. 3, pp. 483–492, 2016.
-  Yong Xu, Jun Du, Li-Rong Dai, and Chin-Hui Lee, “A regression approach to speech enhancement based on deep neural networks,” IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), vol. 23, no. 1, pp. 7–19, 2015.
-  Felix Weninger, John R Hershey, Jonathan Le Roux, and Björn Schuller, “Discriminatively trained recurrent neural networks for single-channel speech separation,” in GlobalSIP, Atlanta, GA, USA, 2014.
Xugang Lu, Yu Tsao, Shigeki Matsuda, and Chiori Hori,
“Speech enhancement based on deep denoising autoencoder.,”in Interspeech, 2013, pp. 436–440.
-  Like Hui, Meng Cai, Cong Guo, Liang He, Wei-Qiang Zhang, and Jia Liu, “Convolutional maxout neural networks for speech separation,” in IEEE International Symposium on Signal Processing and Information Technology. IEEE, 2015, pp. 24–27.
-  Ke Tan, Jitong Chen, and DeLiang Wang, “Gated residual networks with dilated convolutions,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2018, p. 5.
-  Yuxuan Wang, Kun Han, and DeLiang Wang, “Exploring monaural features for classification-based speech segregation,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 21, no. 2, pp. 270–279, 2013.
-  J. Chen, Y. Wang, and D. Wang, “A feature study for classification-based speech separation at very low signal-to-noise ratio,” in IEEE International Conference on Acoustics, Speech and Signal Processing, May 2014, pp. 7039–7043.
-  Kun Han, Yanzhang He, Deblin Bagchi, Eric Fosler-Lussier, and DeLiang Wang, “Deep neural network based spectral feature mapping for robust speech recognition,” in Sixteenth Annual Conference of the International Speech Communication Association, 2015.
-  Felix Weninger, Hakan Erdogan, Shinji Watanabe, Emmanuel Vincent, Jonathan Le Roux, John R Hershey, and Björn Schuller, “Speech enhancement with lstm recurrent neural networks and its application to noise-robust asr,” in International Conference on Latent Variable Analysis and Signal Separation. Springer, 2015, pp. 91–99.
-  Zhong-Qiu Wang and DeLiang Wang, “A joint training framework for robust automatic speech recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 4, pp. 796–806, 2016.
-  Bin Liu, Shuai Nie, Yaping Zhang, Dengfeng Ke, Shan Liang, and Wenju Liu, “Boosting noise robustness of acoustic model via deep adversarial training,” arXiv preprint arXiv:1805.01357, 2018.
-  Feipeng Li, Phani S Nidadavolu, and Hynek Hermansky, “A long, deep and wide artificial neural net for robust speech recognition in unknown noise,” in Interspeech, 2014.
-  Yuxuan Wang, Ananya Misra, and Kean K Chin, “Time-frequency masking for large scale robust speech recognition,” in Sixteenth Annual Conference of the International Speech Communication Association, 2015.
-  Ke Wang, Junbo Zhang, Sining Sun, Yujun Wang, Fei Xiang, and Lei Xie, “Investigating generative adversarial networks based speech dereverberation for robust speech recognition,” in Proc. Interspeech 2018, 2018, pp. 1581–1585.
-  Mike Schuster and Kuldip K Paliwal, IEEE Transactions on Signal Processing, vol. 45, no. 11, pp. 2673–2681, 1997.
-  Jon Barker, Ricard Marxer, Emmanuel Vincent, and Shinji Watanabe, “The third ‘chime’speech separation and recognition challenge: Dataset, task and baselines,” in Automatic Speech Recognition and Understanding (ASRU), 2015 IEEE Workshop on. IEEE, 2015, pp. 504–511.
-  Jinyu Li, Li Deng, Yifan Gong, and Reinhold Haeb-Umbach, “An overview of noise-robust automatic speech recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 22, no. 4, pp. 745–777, 2014.