1 Introduction
Speech enhancement is used for recovering the target speech from a noisy observed signal. In the singlechannel case, the standard method is timefrequency (TF) masking in the shorttime Fourier transform (STFT) domain. Recently, speech enhancement is advanced by the use of a deep neural network (DNN) to estimate a TF mask. For effectively modelling a speech signal which is timesequential data, a recurrent neural network (RNN) is used in various speech signal processing applications
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14].While it has been effectively applied to speech enhancement, RNN is difficult to train in general because the gradient of RNN vanishes or explodes at an exponential rate by performing backpropagation to the same layer repeatedly. This difficulty of training RNN is socalled the vanishing/exploding gradient problem [15], and several methods have been proposed to solve it [16, 17, 18]. One of the popular DNN structures to mitigate this problem is the long shortterm memory (LSTM) [18] illustrated in Fig. 1
(a). By combining three gated units (input gate, forget gate and output gate), LSTM solves the vanishing gradient problem to some extent. As it can be trained effectively in practice, LSTM and the bidirectional LSTM (BLSTM) has been applied to speech enhancement and performed better than the conventional methods at the time
[2, 4, 8, 9, 10, 11, 12, 13, 14].Considering a practical situation in the real world, some research on DNNbased speech enhancement has focused on realtime application [19, 20, 21, 22, 23]. To apply an enhancement method in real time, the system must be causal, i.e., it uses past information only and does not require future information to estimate the enhanced signal. Therefore, unidirectional LSTM are often used in that task [19, 20, 21, 22]. However, as the price of mitigating the vanishing gradient problem, LSTM consists of a lot of parameters as in Fig. 1(a). Since more parameters require more computational resources, a simpler RNN should be more suitable for realtime speech enhancement than LSTM if the gradient problem can be solved in a different way.
In this paper, we propose a realtime speech enhancement method using a causal RNN with much fewer parameters compared to LSTM. In the proposed method, the equilibriated recurrent neural network (ERNN) [24] is used for the TF mask estimator. ERNN is a simpler RNN as in Fig. 1
(b) and can avoid the vanishing/exploding gradient problem by iteratively applying the same layer to the hidden state vector. Ideally, the gradient of ERNN in backpropagation does not vanish or explode
[24], and therefore longterm dependencies in the sequential data can be learned without the gated units as in LSTM. As a result, the number of parameters of ERNN can be noticeably decreased while maintaining the speech enhancement performance. The experimental results confirmed that the proposed method can reduce the number of parameters to less than times that of the LSTM network without sacrificing the performance.2 DNNbased speech enhancement
This paper focuses on TF masking for speech enhancement. In this section, after introducing DNNbased TF masking and RNN briefly, realtime speech enhancement is explained.
2.1 Timefrequency masking based on DNN
The aim of speech enhancement is to recover the target speech signal degraded by noise from an observed signal ,
(1) 
where is the time index. It can be rewritten in TF domain as
(2) 
where is the TF representation of (spectrogram obtained by STFT in this paper), and and denote the indices of frequency and time frame, respectively. In TF masking, the estimated target signal is acquired by the elementwise multiplication of a TF mask to the observation :
(3) 
Then, the enhanced result is transformed back to the time domain by the inverse transform. The TF mask must be estimated solely from , which is the difficult part.
Many methods have applied DNN to estimate the TF mask. In deeplearningbased approach, a TF mask
is estimated as(4) 
where is a regression function implemented by DNN, is the set of its parameters, and is the input acoustic feature. Since the signal is timesequential data indexed by , RNN is often used for realizing the regression function .
2.2 Recurrent neural network (RNN) and LSTM
Among many DNN structures, RNN is a popular network for modelling timesequential data including speech. RNN consists of a function which output the current hidden state vector from the past state vector and the current input feature as follows:
(5) 
where the recurrent structure on the state vector enables to learn the longterm dependencies of time series with fewer parameters comparing to nonrecurrent networks.
Although an RNN can effectively handle information from the past, it may not perform well in practice because of the difficulty on its training, the socalled vanishing/exploding gradient problem. When backpropagation is performed to RNN, the gradient passes through the same layer repeatedly. Then, by the chain rule, the gradient on the current state vector
from the past state vector is the product of gradients for all intermediate state vectors:(6) 
Therefore, the backpropagated gradient vanishes or explodes at an exponential rate unless the norm of each gradient is equal to one, i.e., . Even though an RNN has ability to model the longterm dependency, learning it is difficult because the dependency between the current and past information is quickly lost as the gradient quickly vanishes.
To mitigate the vanishing or exploding gradient problem of RNN, several methods have been developed [16, 17, 18]. One of the most standard methods is LSTM [18] illustrated in Fig. 1(a). It includes an additional recurrent loop of the socalled cell state so that the information from the past is retained unless the forget gate eliminates it. The magnitude of the gradient of LSTM does not decrease when the forget gate is open, and thus the vanishing gradient problem is avoided by assuming that the forgate gate properly select the information. Since it works well in practice, LSTM plays an important role in the DNNbased speech enhancement. However, since the gated unit consists of twice as many parameters than the linear layer, LSTM consists of a lot of parameters, which requires more computational resources compared to a simpler RNN.
2.3 Realtime speech enhancement and causal RNN
Some research on DNNbased speech enhancement has focused on the realtime application for applying it to a practical situation in the real world [19, 20, 22, 23]. To apply an enhancement method in real time, the system must be causal as illustrated in Fig. 2. In general, TF mask at time index can be estimated from the input feature obtained from both past and future,
(7) 
as illustrated in Fig. 2(a), where . However, such noncausal network cannot be applied in real time because the information in future is not available at the time of estimating . It requires some delay for buffering the input feature until all necessary information is obtained. For realtime applications, the network must be causal, i.e., estimation must be performed based on the past information only:
(8) 
This requirement makes RNN suitable for realtime applications because the past information can be encoded into the hidden state vector so that only the input feature and the state vector are required for estimating the mask at time as
(9) 
Owing to the causality and the advantage on training as explained in the previous subsection, unidirectional LSTM are often used in realtime speech enhancement [19, 20, 22]. While LSTM performs well in practice, any causal RNN written in the form of Eq. (9) can be used for realtime speech enhancement. That is, it should be possible to construct a computationally cheaper RNN for the realtime application if the training issue can be solved.
3 Proposed method
For realtime speech enhancement, a causal DNN with fewer parameters is preferred for reducing the computational requirement. Considering such conditions, we propose a causal DNNbased speech enhancement method using ERNN illustrated in Fig. 1(b).
3.1 Equilibriated recurrent neural network (ERNN)
ERNN is an RNN which avoids the vanishing/exploding gradient problem by the skip connections and repeated application of the same block [24]
. It is inspired by the fixed point recursion of the implicit discretization scheme for an ordinary differential equation. By introducing an intermediate variable
with iteration index , a simple form of ERNN can be written as(10) 
where is a small trainable scalar, is the total number of iteration, the initial value is typically , and the updated state vector is given as the iterated result , i.e., ERNN returns after iteration based on the inputs and as in Eq. (5). Here, is a nonlinear function implemented by a neural network, which makes Eq. (10) a multilayer RNN as in Fig. 1(b).
The notable property of ERNN is that its gradient does not vanish or explode in the ideal situation [24]. That is, the norm of the gradient is equal to one: . Therefore, it is expected that ERNN can learn the longterm dependencies without suffering from the training issue because the gradient survives in the parameter update for all time instances. This property should allow us to simplify the network because the gated units used in LSTM are not necessary anymore for alleviating the difficulty of training. We experimentally show later in the next section that a simple ERNN with much fewer parameters can compete with LSTM.
3.2 Proposed speech enhancement method using ERNN
We propose a speech enhancement method based on a causal ERNN. The proposed method estimates the TF mask for the current time frame by ERNN whose input is based only on the current input feature and the hidden state vector as
(11) 
where and are the matrix and bias of the fullyconnected layer, respectively,
is the sigmoid function, and
is the function iterating Eq. (10) times from an initial value using the nonlinear function .Since the expressive power of ERNN is determined by the nonlinear function
, the performance and the degree of computational requirements can be traded by appropriately designing it. For realtime applications, we aim to reduce the number of parameters of the system while maintaining the performance so that the proposed method can compete with the standard LSTMbased methods. As the first step of the investigation of the proposed method, we consider a fullyconnected DNN with the ReLU activation as illustrated in Fig.
3 because it is easy to adjust the number of parameters by changing the size of the fullyconnected layers. As the DNN is common for all in the iteration of Eq. (10), the number of parameters is that of plus (comes from the scalers ). Note that this choice of is merely an example, and it should be possible to design a better network consisting of fewer parameters.4 Experiment
In order to confirm the effectiveness of the proposed method, the performance of speech enhancement was investigated by comparing with LSTMbased methods as the baselines. We conducted two experiments. As the first experiment, we compared the performance and the number of parameters of the proposed and conventional methods by selecting the same number of the cell units. In the second experiment, the number of parameters of the proposed method was decreased to see how the performance varies for smaller DNN. Our implementation of these experiments is openly available online^{1}^{1}1 https://github.com/dtake1336/ERNNforspeechenhancement .
4.1 Experimental condition
Layer  Type  Size (activation) 
LSTM2/BLSTM2  
Layer1  LSTM/BLSTM  257 
Layer2  LSTM/BLSTM  
output  Fully  257 (sigmoid) 
ERNN  
Layer1  ERNN  257 
output  Fully  257 (sigmoid) 
4.1.1 Dataset
We utilized the VoiceBankDEMAND dataset constructed by Valentini et al. [25] which is openly available^{2}^{2}2http://dx.doi.org/10.7488/ds/1356 and frequently used in the literature of DNNbased speech enhancement. It consists of train set and test set which contain noisy mixtures and clean speech signals, respectively, i.e., noise and speech signals were already mixed by the authors [25]. They consist of 28 and 2 speakers (11 572 and 824 utterances) [26] which are contaminated by 10 (DEMAND, speechshaped noise, and babble) and 5 types of noise (DEMAND) [27], respectively. All data were downsampled from 48 kHz to 16 kHz.
4.1.2 DNN architecture, loss function and training setup
The parameters of STFT were the 512 points (32 ms) Hann window, 256 points timeshifting, and 512 points FFT length, and the inverse STFT was implemented by its canonical dual [28]. In the proposed method, DNN in ERNN was that illustrated in Fig. 3, and the iteration number was varied as 1/3/5. The size of hidden vector and were varied as 512/256 and 512/256/128/64/32, respectively. For the baseline methods, twolayered LSTM and BLSTM, which are popular and have been successfully applied to speech enhancement [2], consisting of 512/256 cells were used as summarized in Table 1. For the input feature, logmagnitude spectrogram,
(12) 
was used for all networks, where
denotes the absolute value. As an activation function of the output layer, the sigmoid function was used for limiting the values within the range of 0 to 1.
For the loss function in the training, the mean absolute error measured in the time domain was used:
(13) 
where is the elementwise multiplication, and
denotes the inverse STFT. Each DNN was trained 200 epochs where each epoch contained 11 572 utterances. A onesecondlong segment was randomly picked up for each utterance, and minibatch size was 16. Adam optimizer
[29] was utilized with a fixed learning rate 0.0001.The performance of speech enhancement was measured by PESQ[30] and three measures CSIG, CBAK, and COVL [31] which are the popular predictor of the mean opinion score (MOS) of the signal distortion, the background noise interference, and the overall effect, respectively.
DNN  Params.  PESQ  CSIG  CBAK  COVL  

1  2.42  3.57  2.58  2.98  
ERNN  256  3  329k  2.49  3.58  3.02  
256  5  2.43  3.56  2.58  2.98  
BLSTM2  –  –  2.76M  
LSTM2  1.12M  2.34  3.49  2.55  2.90 
DNN  Params.  PESQ  CSIG  CBAK  COVL  

1  2.43  3.65  2.60  3.03  
ERNN  512  3  1.05M  2.43  3.60  2.59  3.00  
512  5  2.41  2.58  3.02  
BLSTM2  –  –  9.72M  
LSTM2  3.81M  2.45  3.63  2.61  3.03 
DNN  Params.  PESQ  CSIG  CBAK  COVL  

ERNN  256  1  2.30  3.34  2.52  2.80  
32  3  215k  2.39  3.54  2.57  2.95  
5  2.32  3.45  3.45  2.87  
1  2.43  3.60  2.59  3.00  
64  3  231k  2.40  3.56  2.58  2.97  
5  2.37  3.57  2.56  2.96  
1  2.45  3.64  2.61  3.03  
128  3  264k  2.40  3.57  2.58  2.97  
5 
DNN  Params.  PESQ  CSIG  CBAK  COVL  

ERNN  512  1  2.35  3.43  2.54  2.87  
32  3  560k  2.40  3.58  2.58  2.98  
5  2.41  3.62  2.58  3.00  
1  2.44  3.56  2.59  2.98  
64  3  593k  2.47  3.60  2.61  3.02  
5  2.45  3.63  2.61  3.03  
1  2.49  3.70  2.62  3.08  
128  3  658k  2.36  3.52  2.56  2.93  
5  2.52  3.69  2.64  3.09  
1  2.52  3.68  2.63  3.09  
256  3  790k  2.48  2.63  3.10  
5  3.74 
4.2 Results
The results for comparison are summarized in Tables 3 and 3, where the cell sizes were 256 and 512, respectively. As well known in the speech enhancement literature, BLSTM performed better than LSTM because BLSTM is noncausal and can use the information from the future, while LSTM is causal and can only use the past information. Since BLSTM cannot be utilized for realtime applications, its scores are merely a reference, and LSTM is the direct competitor of the proposed method. Comparing with LSTM, the proposed method obtained almost the same performance in every situation. For some situations, the proposed method also obtained the similar performance compared to BLSTM even though the proposed method is causal and contains about 1/9 parameters. This should be because ERNN was able to successfully learn the longterm dependencies of the speech signals.
Since our aim is to construct a network with fewer parameters, the number of the parameters of the proposed method was reduced by changing the dimension of the linear layer (see Fig. 3). The results are summarized in Tables 5 and 5, where the result for can be found in Tables 3 and 3. As a general tendency, reducing the number of parameters gradually degrades the performance. However, the amount of degradation is not so significant, which indicates that the proposed method can reduce the computational requirement without losing the performance much. In terms of the number of iteration , more iteration tends to slightly improve the performance. The proposed method can reduce the computational requirement by reducing , where means that the network is applied only once at each time frame.
Note that, by comparing the best scores in Table 5 with LSTM in Table 3, the proposed method outperformed LSTM with less than 1/14 parameters. It is also comparable to BLSTM in Table 3 with less than 1/36 parameters and BLSTM in Table 3 with around 1/10 parameters. Again, BLSTM cannot perform in real time as it is noncausal, and thus the proposed method should be compared with LSTM. While LSTM lost noticeable amount of performance by reducing the parameters as shown in Tables 3 and 3, the proposed method can reduce the number of parameters with moderate amount of degradation of the performance as in the tables. Therefore, we confirmed the effectiveness of the proposed method in realtime speech enhancement as it can be performed by much lower computational cost compared to the standard LSTM networks.
5 Conclusions
In this paper, the causal DNNbased speech enhancement method using ERNN was proposed for realtime applications. By using ERNN, the number of parameters can be decreased thanks to its ability of learning the longterm dependencies without the vanishing gradient problem. The experimental results indicated that, while the standard LSTM lost the performance by reducing the number of parameters, the proposed method can effectively trade the performance and computational requirement which is preferable for performing speech enhancement on resourcelimited devices. As this paper only considered a simple fullyconnected DNN with ReLU activation as an example for ERNN, our future works include investigation of a better network performing well with less number of parameters.
References
 [1] D. L. Wang and J. Chen, “Supervised speech separation based on deep learning: An overview,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 26, no. 10, pp. 1702–1726, 2018.
 [2] H. Erdogan, J. R. Hershey, S. Watanabe, and J. L. Roux, “Phasesensitive and recognitionboosted speech separation using deep recurrent neural networks,” in 2015 IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP), 2015, pp. 708–712.
 [3] J. R. Hershey, Z. Chen, J. Le Roux, and S. Watanabe, “Deep clustering: Discriminative embeddings for segmentation and separation,” in 2016 IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP). IEEE, 2016, pp. 31–35.
 [4] H. Zhao, S. Zarar, I. Tashev, and C. Lee, “Convolutionalrecurrent neural networks for speech enhancement,” in 2018 IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP), 2018, pp. 2401–2405.
 [5] N. Kalchbrenner, E. Elsen, K. Simonyan, S. Noury, N. Casagrande, E. Lockhart, F. Stimberg, A. Oord, S. Dieleman, and K. Kavukcuoglu, “Efficient neural audio synthesis,” in Int. Conf. on Mach. Learn., 2018, pp. 2415–2424.
 [6] L. Perotin, R. Serizel, E. Vincent, and A. Guérin, “Multichannel speech separation with recurrent neural networks from highorder ambisonics recordings,” in 2018 IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP), 2018, pp. 36–40.
 [7] J. Le Roux, G. Wichern, S. Watanabe, A. Sarroff, and J. R. Hershey, “Phasebook and friends: Leveraging discrete representations for source separation,” IEEE J. Sel. Top. Signal Process., vol. 13, no. 2, pp. 370–382, 2019.
 [8] D. Takeuchi, K. Yatabe, Y. Koizumi, Y. Oikawa, and N. Harada, “Datadriven design of perfect reconstruction filterbank for DNNbased sound source enhancement,” in 2019 IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP), 2019, pp. 596–600.
 [9] Y. Koizumi, N. Harada, and Y. Haneda, “Trainable adaptive window switching for speech enhancement,” in 2019 IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP). IEEE, 2019, pp. 616–620.
 [10] S.W. Fu, C.F. Liao, Y. Tsao, and S.D. Lin, “MetricGAN: Generative adversarial networks based blackbox metric scores optimization for speech enhancement,” in Int. Conf. Mach. Learn., 2019, pp. 2031–2041.
 [11] S. Chakrabarty and E. A. P. Habets, “Timefrequency masking based online multichannel speech enhancement with convolutional recurrent neural networks,” IEEE J. Sel. Top. Signal Process., 2019.
 [12] N. Zheng and X.L. Zhang, “Phaseaware speech enhancement based on deep neural networks,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 27, no. 1, pp. 63–76, 2019.
 [13] Y. Koizumi, K. Yatabe, M. Delcroix, Y. Masuyama, and D. Takeuchi, “Speech enhancement using selfadaptation and multihead selfattention,” in 2020 IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP), 2020.
 [14] M. Kawanaka, Y. Koizumi, R. Miyazaki, and K. Yatabe, “Stable training of DNN for speech enhancement based on perceptuallymotivated blackbox cost function,” in 2020 IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP), 2020.
 [15] Y. Bengio, P. Simard, and P. Frasconi, “Learning longterm dependencies with gradient descent is difficult,” IEEE trans. neural netw., vol. 5, no. 2, pp. 157–166, 1994.
 [16] M. Arjovsky, A. Shah, and Y. Bengio, “Unitary evolution recurrent neural networks,” in Int. Conf. Mach. Learn., 2016, pp. 1120–1128.
 [17] S. Wisdom, T. Powers, J. R. Hershey, J. Le Roux, and L. Atlas, “Fullcapacity unitary recurrent neural networks,” 2016.
 [18] S. Hochreiter and J. Schmidhuber, “Long shortterm memory,” Neural Comput., vol. 9, no. 8, pp. 1735–1780, 1997.
 [19] G. Naithani, T. Barker, G. Parascandolo, L. Bramsløw, N. H. Pontoppidan, and T. Virtanen, “Low latency sound source separation using convolutional recurrent neural networks,” in 2017 IEEE Workshop Appl. Signal Process. Audio Acoust. (WASPAA), 2017, pp. 71–75.
 [20] K. Tan and D. L. Wang, “A convolutional recurrent neural network for realtime speech enhancement,” in Interspeech 2018, 2018, pp. 3229–3233.
 [21] M. Parviainen, P. Pertilä, T. Virtanen, and P. Grosche, “Timefrequency masking strategies for singlechannel lowlatency speech enhancement using neural networks,” in Int. Workshop Acoust. Signal Enhanc. (IWAENC), 2018, pp. 51–55.

[22]
A. Pandey and D. L. Wang,
“TCNN: temporal convolutional neural network for realtime speech enhancement in the time domain,”
in 2019 IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP), 2019, pp. 6875–6879.  [23] G. S. Bhat, N. Shankar, C. K. A. Reddy, and I. M. S. Panahi, “A realtime convolutional neural network based speech enhancement for hearing impaired listeners using smartphone,” IEEE Access, vol. 7, pp. 78421–78433, 2019.
 [24] A. Kag, Z. Zhang, and V. Saligrama, “RNNs evolving in equilibrium: A solution to the vanishing and exploding gradients,” arXiv preprint arXiv:1908.08574, 2019.
 [25] C. ValentiniBotinho, X. Wang, S. Takaki, and J. Yamagishi, “Investigating RNNbased speech enhancement methods for noiserobust TexttoSpeech.,” in 9th ISCA Speech Synth. Workshop, 2016, pp. 146–152.
 [26] C. Veaux, J. Yamagishi, and S. King, “The voice bank corpus: Design, collection and data analysis of a large regional accent speech database,” in 2013 Int. Conf. Orient. COCOSDA held jointly 2013 Conf. Asian Spok. Lang. Res. Eval. (OCOCOSDA/CASLRE), 2013, pp. 1–4.
 [27] J. Thiemann, N. Ito, and E. Vincent, “The diverse environments multichannel acoustic noise database: A database of multichannel environmental noise recordings,” J. Acoust. Soc. Am., vol. 133, no. 5, pp. 3591–3591, 2013.
 [28] K. Yatabe, Y. Masuyama, T. Kusano and Y. Oikawa, “Representation of complex spectrogram via phase conversion,” Acoust. Sci. Tech, vol. 40, no. 3, 2019.
 [29] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in Int. Conf. Learn. Represent. (ICLR), 2015.
 [30] P.862.2: Wideband extension to Recommendation P.862 for the assessment of wideband telephone networks and speech codecs, ITUT Std. P.862.2, 2007.
 [31] Y. Hu and P. C. Loizou, “Evaluation of objective quality measures for speech enhancement,” IEEE Trans. Audio, Speech, Lang. Process., vol. 16, no. 1, pp. 229–238, 2008.
Comments
There are no comments yet.