Speech enhancement aims to suppress the environmental noise without distorting the target speech, and its importance has been well recognized with the wide applications in communication, hearing aids and automatic speech recognition (ASR). Although multichannel speech enhancement using a microphone array has shown promising advantages over the single-channel processing, the single-channel speech enhancement is still of great interest due to its simple setup in most practical scenarios.
Speech enhancement is traditionally treated within the framework of statistical signal processing. Generally, the target is to design a linear filter based on carefully-chosen statistical models of speech and noise. Typical algorithms include Wiener filtering (WF) [Wiener1949, Lim1979, Chen2006b], minimum mean-squared error (MMSE) amplitude estimator [Ephraim1984, Ephraim1985, Martin2002a], and maximum a posteriori (MAP) [Lotter2005, Wolfe2003] based methods. As priori expert knowledge has been adopted to design the statistical models, these methods can perform fully unsupervised which eliminates the need of training data, and the filter coefficients are computed adaptively according to the derived analytical formulas and the updated statistics of speech and noise. However, since the assumptions of the models are not always valid, the performance of the statistical signal processing based methods are limited in realistic cases, especially when the noise is non-stationary.
The performance of speech enhancement has been dramatically improved by using deep neural networks (DNNs). With the capability of modelling non-linear complex transformations, instead of relying on the expert knowledge, the DNN learns mapping from the noisy speech to the clean target directly in a supervised data-driven manner. Different DNN structures have been explored, including the feed-forward network (FNN) [Xu2014, Xu2015, Tu2017]
, convolutional neural network (CNN)[Mamun2019, Ouyang2019, Park2017], as well as the recurrent neural network(RNN) [Weninger2015, Gao2018, Sun2017].
Generalization to the unseen noise type, speaker and transmission channel is an essential issue for the DNN based methods. Since the statistical signal processing based methods are unsupervised and knowledge-driven, some approaches have been proposed to exploit the expertised knowledge for speech enhancement [Wang2016a, Xia2014a, Yang2018]. Generally, the statistics needed for filter design are estimated from DNN, and then linear filtering is applied to obtain the enhanced speech . However, as the strategy is basically two-stage, the expert knowledge is not well exploited in the DNN design and optimization, and the performance of linear filtering largely depends on the accuracy of the estimated statistics.
In this paper, we propose a new speech enhancement method by fully integrating the statistical signal processing into DNN. A neural Kalman filter (NKF) is developed, which extend the conventional Kalman filtering (KF) to the supervised learning scheme, and implicitly exploits both the capability of DNN modelling and the expert knowledge used in signal processing for speech enhancement. Clean speech estimates from recurrent neural networks (RNN) and linear WF are obtained, and are linearly combined by a NKF gain to yield the NKF output. By integrating the signal processing components, a network structure especially for the speech enhancement purpose can be designed, and such components can serve as regularization for the network. In addition, the proposed method also overcomes problem of unrealistic model assumption in KF. We conduct experiments in different noisy conditions, and evaluations on both objective speech quality and ASR demonstrate the effectiveness of the proposed method.
2 Signal Model and Conventional KF
We first describe the conventional statistical signal processing based KF, which will facilitate the introduction of the proposed NKF in the following sections. Compared with other signal processing based methods, KF additionally considers the temporal evolution of speech into the optimal filter design, such that the artifacts caused by inaccurate noise level estimation is suppressed. The algorithm generally works in the modulation domain [So2011a, Wang2013, Dionelis2017a, Wang2014], which regards the signal amplitude in each frequency bin as a time-varying sequence. The clean speech is first predicted according to the linear prediction (LP) model of speech, and then updated by incorporating the instantaneous noisy observation.
Assuming the speech and noise are uncorrelated, the amplitude of the noisy speech is expressed by:
where , , and
represent the short-time Fourier transform (STFT)-domain signals of the noisy speech, clean speech and noise, respectively. By modelling the clean speech amplitude as an auto-regressive (AR) process,can be further expressed by a -order LP model, and in the matrix form, we have:
is the hidden state vector of KF,is the state transmission matrix defined in [So2011a] according to the LP coefficients, is a vector, and is the LP residual. In practice, the unknown LP coefficients are estimated via LP analysis on the output of WF.
As shown in Fig. 1, the KF based speech enhancement has two stages: predicting and updating. Given the hidden state that consists of the clean speech estimates in the previous frame, the LP estimation of the clean speech is first obtained using the LP model (2), as:
The estimation is then updated by incorporating the noisy observation in the current frame, using a Kalman gain , as
is determined by comparing between the noise varianceand the variance matrix of the LP residual , as
such that the KF output approximates the LP estimation in strong noise cases and rely more on the observation otherwise. The statistics of LP residual can be updated according to the LP model and the KF output, and the details that can be found in [So2011a, Wang2013, Dionelis2017a, Wang2014] are omitted here for simplicity. It is been shown by Xue et.al [Xue2018, Xue2018a, Xue2018b] that the KF becomes the WF if the LP information is not exploited, thus KF can be seen as a combination of temporal LP and instantaneous linear filtering.
3 Proposed Method
Although the speech evolution is exploited into the design of conventional signal processing based KF, the statistical models, for instance, the modulation-domain additive signal model in (1) and the LP model of speech evolution in (2), may not be appropriate to represent different realistic noisy conditions. An external estimator is also required to provide the noise level estimation in Fig. 1, which, in the context of statistical signal processing, is usually based on the unrealistic stationary noise assumption.
The above shortcomings of the conventional KF can be overcome by exploiting DNNs to learn the complex models from data. A novel NKF is proposed by extending the concept of conventional KF to the supervised learning scheme. Different with conventional DNN-based methods which typically learn end-to-end non-linear mappings from the noisy signal to the clean target, we integrate the statistical signal processing into the network, and use the KF’s “predict-update” scheme to control the behaviour of the network more effectively. The statistical signal processing components can be seen as providing priori expert knowledge to the network, and serve as an regularization for network optimization.
The diagram of the proposed NKF is shown in Fig. 2
, which is similar to the KF, and consists of a) long short-term memory (LSTM) prediction, b) linear WF, and c) linear weighting between the LSTM prediction and WF output based on a learned NKF gain. Details will be described in the following subsections.
3.1 LSTM Prediction
The LSTM has strong capability of sequential modelling and has shown superior performances in noise reduction [Weninger2015, Gao2018, Sun2017]. A LSTM prediction network is constructed as in Fig. 3, which, not only predicts the clean speech amplitude, but also estimates the prediction residual from the noisy input, in accordance with the KF framework.
With the notations in Section 2, in each frame , an feature vector is formed using the noisy amplitudes in different frequency bins as
where is the number of frequency bins. A sequence of the feature vectors are first fed into the LSTM layers to model the temporal evolution of speech. Then in each time step, by utilizing two separate fully-connected output layers, outputs from LSTM layers are transformed into the clean amplitude prediction and the prediction residual, respectively.
We note that unlike the conventional KF that relies solely on the previous filter output to perform LP, the LSTM prediction network use the noisy amplitude spectrum as input. This is because we believe that the speech evolution has already been modelled by the hidden state propagation in the LSTM, and the additional noisy observation in the each frame can help the LSTM to achieve more accurate prediction.
3.2 Linear WF
The LSTM prediction will be updated by combining with the output of linear WF. Based on the signal model in (1), in each TF bin, under the MMSE criterion the optimal Wiener filter is given by
where is the variance of noisy speech, and are defined similarly on clean speech and noise, respectively. Then the clean speech amplitude is obtained simply as
Practically is unknown and can conventionally be estimated by algorithms such as [Rangachari2006, Cohen2002, Gerkmann2011]. can be computed by averaging the noisy power spectrum over the past few frames.
Here we integrate the above linear filtering process into the proposed NKF framework. Since the noise variance is unknown, a noise estimation network is first constructed by taking the amplitude vector and the variance vector
as inputs, and outputs the noise variances in different frequency bins. The noise estimation network is a ReLu-activated FNN, and uses a left-side context window to concatenate the input features to capture the short-term dependencies. As shown in Fig.2, once the noise variance is known, the clean speech amplitude is calculated by (7) and (8).
3.3 Linear Weighting
The clean amplitude estimations from the LSTM prediction network and WF are finally combined by linear weighting to yield the NKF output . Similar to (4), the weighting is controlled by an NKF gain :
where the TF bin index “” is omitted for simplicity.
The Kalman gain in (5) trade-offs between the LSTM prediction and WF output, and can actually be designed more flexibly according to the speech dominance in each TF bin [Xue2018b]. In the proposed method, the NKF gain is computed from a NKF gain network. The network takes the concatenation of LSTM prediction residual and noise variance as input and outputs the linear weight within the range . The structure of the network is a FNN with ReLU activation in the hidden layers, and Sigmoid activation in the output layer.
3.4 Network Optimization
Since the integrated linear filtering components are all differentiable, the proposed NKF can be directly optimized through back propagation (BP). We choose the amplitude spectrum of the clean speech as target and train the network under the mean-squared error (MSE) criterion. The network is trained in the sequence-to-sequence manner, and as a whole, the noisy features consist of the noisy amplitude spectrum in different frames, the left-side contexts and variances of the amplitude spectrum in each frame.
Once the clean amplitude spectrum is obtained, the time-domain speech is recovered by inverse STFT which uses the phase spectrum of the noisy speech.
4.1 Experimental Setup
The experiments are conducted using the Librispeech [Panayotov2015] corpus which contains clean speech, the PNL-100Nonspeech-Sounds (PNL) [Hu2010a] corpus which contains 100 types of noise that are mostly non-stationary, as well as the noise subset of the MUSAN corpus (MUSAN-Noise) [musan2015].
We create unmatched noisy conditions for training and testing. An 300-hour training set is prepared by repeatedly first picking up a pair of clean speech and noise signals from the “CLEAN-360” subset of Librispeech and PNL dataset, and then mixing them with a speech-to-noise ratio (SNR) level randomly chosen from dB. The test set contains all utterances of the “TEST-CLEAN” and “TEST-OTHER” subsets of Librispeech that are corrupted by 20 types of noise signals randomly selected from MUSAN-Noise at SNR levels of dB.
The sample rate of all speech and noise signals is
kHz, and the analysis window for STFT and feature extraction issamples with overlap. The LSTM prediction network of the proposed method has two 1024-node hidden LSTM layers, and both the noise estimation and the NKF gain network have three layers with 1024 nodes in the hidden layer. The left-side context window for the noise estimation network is and the in (8) is computed using previous frames.
We compare with the proposed method with a) LSTM-2L which has the same structure with the LSTM prediction network of the proposed method, except that the prediction residual output is discarded; b) LSTM-4L which is the variation of LSTM-2L with four hidden LSTM layers, such that a larger LSTM network is compared with since additional components are integrated into NKF; c) WF; and d) KF. For fair comparison, the WF and KF also use the noise estimation from a separately-trained DNN which has the same structure with the noise estimation network of the proposed method. During training, the batch size of all methods are , and the sequence length of the proposed method and the LSTM based-baselines is frames. All networks are trained with epochs.
The speech enhancement performances of different methods are evaluated in terms of objective speech quality measures and word error rate (WER) of ASR. We use Itakura-Saito distance (ISD) [Quackenbush1988], frequency-weighted segmental SNR (FwSegSNR) [Hu2006a] and perceptual evaluation of speech quality (PESQ) [ITU_T_P862_TRa] as objective metrics. The ASR is evaluated using ESPNET [Shinji2018] with an officially-released pretrained model (large Transformer with SpecAug and Transformer language model) on the Librispeech dataset.
The objective evaluation results of different methods in different SNRs are depicted in Fig. 4 and Fig. 5. It can be seen that the proposed NKF can consistently yield the lowest ISD and FwSegSNR in different scenarios, indicating that the proposed method has the best performance in suppressing noise while controlling speech distortion. In almost all cases, the proposed method achieves comparable PESQ with the LSTM-2L. By comparing with LSTM-2L and LSTM-4L, we can notice that simply using a larger model does not necessarily improve the performance, and may instead cause the over-fitting problem which degrades the performance in unmatched noise conditions.
The superiority of the proposed NKF is also demonstrated by the ASR performances, which are summarized in Tab. 1 and Tab. 2. In all SNR conditions of both TEST-CLEAN subset, the utterances after NKF noise reduction have the lowest WER. On the TEST-CLEAN subset, the proposed NKF improves the LSTM-2L by 12.14% relative WER reduction when SNR is dB, and reduces the WER by 53.45% compared with the raw noisy signal in the dB condition. Similar conclusions can be drawn from the performances on the TEST-OTHER subset.
A NKF based speech enhancement method is proposed by integrating the DNN with the statistical signal processing, following the framework of conventional KF. The statistical signal processing components can be seen as providing priori expert knowledge and serve as an regularization for the network. Experimental results in different noisy conditions show the effectiveness of the proposed method both in objective speech quality evaluation and ASR.