Reinforcement Learning Based Speech Enhancement for Robust Speech Recognition

11/10/2018 ∙ by Yih-Liang Shen, et al. ∙ 0

Conventional deep neural network (DNN)-based speech enhancement (SE) approaches aim to minimize the mean square error (MSE) between enhanced speech and clean reference. The MSE-optimized model may not directly improve the performance of an automatic speech recognition (ASR) system. If the target is to minimize the recognition error, the recognition results should be used to design the objective function for optimizing the SE model. However, the structure of an ASR system, which consists of multiple units, such as acoustic and language models, is usually complex and not differentiable. In this study, we proposed to adopt the reinforcement learning algorithm to optimize the SE model based on the recognition results. We evaluated the propsoed SE system on the Mandarin Chinese broadcast news corpus (MATBN). Experimental results demonstrate that the proposed method can effectively improve the ASR results with a notable 12.40 ratio at 0 dB and 5 dB conditions, respectively.



There are no comments yet.


page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The performance of automatic speech recognition (ASR) has significantly improved in recent years. However, a long-existing issue still remains: ASR suffers severe performance degradation in noise environments [1]. Many approaches have been proposed to address the noise issue. One category of these approaches is speech enhancement (SE) [2, 3]. The goal of SE is to generate enhanced speech signals that closly match clean and undistorted speech signals, by removing the noise components from the noisy speech [4, 5, 6]. Traditional SE approaches are designed based on some assumptions of speech and noise characteristics [7, 8]. Generally, these approaches can yield a satisfactory performance in terms of speech quality but may not be directly beneficial in the improvement of the ASR performance [9, 5].

Recently, deep-learning-based SE approaches have received increased attention and it has been confirmed that they yield better performances than traditional methods in many tasks

[10, 11, 12]

. Because of the deep structure, the deep-learning-based models can effectively characterize the complex transformation of noisy speech to clean speech, or they can precisely estimate a mask to filter out noise components from the noisy speech. To train the deep-learning-based models, the mean square error (MSE)-criterion is usually used as the objective function. Specifically, the model is trained to minimize the MSE of the enhanced speech and clean references. Although it has been proven that the MSE-based objective function is effective for noise reduction, it is not optimal for improving speech quality and intelligibility, or the ASR performance

[13, 14, 15, 16].

Clearly, the ASR results should be the optimal objective function for SE. However, most of the commonly used ASR systems consist of multiple modules, such as the acoustic models and language models. Correspondingly, the input–output correlation is extremely complicated and may not be differentiable. Thus, it is difficult to directly use the recognition results to directly optimize the SE models. Moreover, it takes a considerable amount of resources to build an ASR system, and thus the use of a well-established ASR system from a third party is thus favorable. In this study, we propose to adopt the reinforcement learning (RL) algorithm to train an SE model to minimize the recognition errors.

The main concept of the RL algorithm is to take an action in an environment in order to maximize some notion of a cumulative reward [17]

. Different from supervised and unsupervised learning algorithms, the RL algorithm learns how to attain a (complex) goal in an iterative manner. To-this-date, the RL algorithms have been successfully applied to various tasks, such as robot control

[18], dialogue management [19], and computer game playing [20].

The RL algorithm has also been adopted into the speech signal processing filed. In [21], the RL has been used to improve the ASR performance. Based on hypothesis selection by the users, the system can improve the recognition accuracy as compared to unsupervised adaptation. Meanwhile, the RL has been used for DNN-based source enhancement by optimizing objective sound quality assessment score [22]. The results show that by using the RL algorithm, both perceptual evaluation of the speech quality (PESQ) [23] and the short-time intelligibility measure (STOI) [24] scores can be improved as compared to the MSE-based training criterion [25].

In this study, we adopt the same idea presented in [22] to establish an RL-based SE system to optimize the ASR performance. Instead of estimating the ratio masking as used in [22], the proposed SE system determines the optimal binary mask to minimize the recognition errors. Notably, the ASR system is fixed in the proposed method. This is to simulate most realistic scenarios that a well-trained ASR system is provided by a third party, and an SE is built to generate suitable inputs to the ASR system. We evaluated the proposed RL-based SE system on a Mandarin Chinese broadcast news corpus (MATBN) [26]. According to our experimental results, the proposed RL-based SE system effectively decreases the character error rate (CER) during the testing of the recognition in the presence of noise. The remainder of this paper is organized as follows. Section 2 review relative techniques. Section 3 introduces the proposed system. Section 4 presents the experimental setup and results. Finally, section 5 provides conclusion remarks.

2 Related Works

In the time domain, a noisy speech signal is formulated by a combination of a clean speech signal and an additive noise signal

. By performing short-time Fourier transform (STFT), log–power operation, and mel–frequency-based filtering, the mel–frequency power spectrogram (MPS) of

can be expressed as:


In this study,

frames of the STFT MPS feature vectors are concatenated to form one chunk vector for

, and . Accordingly, we thus have:


where is a chunk index, and is the total number of chunks vectors within . Note that when , the chunk vector is the STFT MPS feature vector.

2.1 Ideal Binary Mask-based SE System

It has been reported that when the goal is to improve the ASR performance, ideal binary mask (IBM) is more suitable than ideal ratio mask (IRM) or directly mapping [27] to be used to design the SE system. Therefore, we implement an IBM-based SE system in this study. For the IBM-based SE system, the input was filtered by IBM to obtain the enhanced output :


where “” represents an element-wise multiplier, and is the IBM matrix, which is defined as:


where is the unit step function applied to each element of .

2.2 DNN-based SE Model with the MSE Criterion

For the DNN-based SE, a set of noisy-clean training pairs are prepared as the input and reference of a DNN model. For the noisy , chunk vectors are then cascaded to include more context information: . The mapping process of a feedforward DNN with hidden layers is the formulated as,


where and

are the weight matrices and bias vectors,respectively. Both


are activation functions, in which

is the sigmoid function while

represents a linear transformation. When the MSE is used as the cost function, the parameter set

that consists of all of and in Eq. (5) is estimated by,


3 proposed method

Figure 1 illustrates the proposed system, which consists of three modules: “IBM clustering”, “Action estimation”, and “Target action determination”.

Figure 1: The block diagram of the proposed SE system, which includes “IBM clustering”, “Action estimation”, and “Target action determination”.

3.1 IBM clustering module

In the IBM-based SE system, an IBM filter is computed for each feature vector. The IBM clustering module groups the entire set of IBM vectors collected from the training data to

clusters based on the K-means algorithm. Each cluster is represented as

with respect to the cluster index . The ensemble of these clusters is denoted as . Thus, we have,


Since the elements in each IBM vector acquire binary values, the Hamming distance [28] is used to compute the distance between the two vectors in this study. Meanwhile, we used 32 clusters to group based on the k-means algorithm.

3.2 Action estimation module

To effectively use the training data, we first pre-train the DNN model by placing at the input and at the output. This pre-trained model was then re-trained with additional hidden layers to compute the -dimensional action vector at th chunk. Among the elements in , the index with the maximum value was determined,


where represents the th element of the vector, and . In addition, different from the spectral mapping in Eq. (5), the softmax operation is used in the final layer in the re-trained DNN. The cost function for the re-training process is expressed as,


where is the reference target, which is derived from Target action determination module and is described in the next section.

3.3 Target action determination module

Figure 2 shows the flowchart of the Target action determination module. First, , which is estimated from the action estimation module is used to determine the cluster index in Eq. (8). Then, the IBM selection function selects from with respect to index . Next the SE function uses the selected to enhance the input . After enhancing all chunk vectors, both the input noisy and the IBM-enhanced STFT–MPS features are reconstructed back to the time domain signals, and then provide the ASR to calculate the utterance-based error rates (ERs), and , respectively. Both and are used in the Target action determination function, which is a two-stage operations, namely, the reward calculation and action update.

Figure 2: The flowchart of Target action determination module, which is used to update the input action vector.

3.3.1 Reward calculation

Rather than directly use as the reward, we applied the relative value between and in Eq. (10) to avoid external factors, such as the variation of an ASR system and environmental noises.


where is a scalar factor, which is set to 10 in this study. For this equation, the positive denotes a larger ER of than that of , thus suggesting that the enhanced speech can provide better recognition results. On the other hand, a negative denotes a smaller than , suggesting that the enhanced speech gives worse recognition performance.

In addition to the utterance-based rewards , we also consider a chunk-based reward because the action for each chunk vector may act and contribute differently to . That is, an effective enhancement can cause positive contribution on the ASR performance. Therefore, we defined a time-varied reward as:


From Eqs. (11)– (13), the weighting factor , at the th chunk is the normalized square error. When selecting a erroneous IBM vector, the normalized error in(12) is large, and accordingly is small, which penalizes this wrong action, as to be introduced in the next sub-section.

3.3.2 Action update

To update the action vector, , we first determine two different action indices, and . To obtain , we first follow Eq. (4) to determine an IBM vector, which is then used to locate the closest cluster in ; the located cluster index is . On the other hand, the cluster index is determined by Eq. (8), as presented in the Action estimation module.

With the determined action indices and , the input action vector is updated for the output based on the following equations:




3.4 Testing procedure

Figure 3: The block diagram of testing part for the proposed algorithm.

After performing the training on DNN with the associated objective function in Eq. (9), Fig. 3 illustrates the block diagram of the testing process. From the figure, the well-trained DNN model is applied on a noisy STFT–MPS , which is first extracted from the time-domain signal . The estimated IBM indices are then used in combination with Eq. (8) for each chunk to further enhance the input noisy and provide in the output of the IBM–SE function. The waveform is reconstructed from , and is then applied to ASR to conduct the recognized process.

4 Experiments

4.1 Experimental setup

We conducted our experiments on the MATBN task, which was an 198-hour Mandarin Chinese broadcast news corpus [26]. The utterances in MATBN were originally recorded at a 44.1 kHz sampling rate and were further down-sampled to 16 kHz. A 25-hour gender-balanced subset of the speech utterances was used to train aset of CD-DNN-HMM acoustic models. A set of trigram language models was trained on a collection of text news documents published by the Central News Agency (CNA) between 2000 and 2001 (the Chinese Gigaword Corpus released by LDC) with the SRI Language Modeling Toolkit [29]. The overall ASR system was implemented on the Kaldi [30] toolbox. Each speech waveform was parameterized into a sequence of 40-dimensional filter-bank features. The DNN structure for the acoustic models was consisted of six hidden layers, and each layer had 2048 nodes. The dimensions for the input and output layers were 440 () and 2596, respectively [31]. The evaluated results are reported as the average CER. To train the RL–SE system, another 460 utterances were selected from the MATBN corpus. The overall RL–SE and ASR systems were evaluated using another 30 utterances from the MATBN testing set. In this study, we used the baby-cry noise as the background noise. The baby-cry noise waveform was divided into two parts, the first part was artificially added to the 460 training utterances with signal-to-noise ratio (SNR) level at 5 dB; the second part was artificially added to the 30 testing utterances at 0 and 5 dB SNR levels. Notably, the training and testing utterances were simulated using different segments of the noise source waveform, and thus the properties were slightly different. Finally, we have prepared 460 noisy–clean pairs to train the RL-based SE system. For all of the training and testing data, the applied frame size and the shift for STFT were 32 and 16 ms in length, respectively. The 64-dimensional MPS features were then extracted from all noisy and clean utterances. Next, we established two RL-based SE models, with two different parameters for the chunk vectors: the systems with and are termed and , respectively. Both and were composed of one hidden layer with 64 nodes, and 32 for the output nodes. The input dimensions of was 704 (), and that of was 640 (), in which the and are values of the parameter , and is used for providing the context information (as mentioned in Section 2.2).

4.2 Experimental results

Figure 4: Clustered IBMs were derived by the k-means algorithm.
SNR Noisy
5 dB 56.14 73.09 55.60 49.18
0 dB 81.40 85.79 77.20 65.75
Table 1: The average CERs of Noisy (the baseline), , , and at 0 and 5 dB SNR conditions.

Figure. 4 shows all the 32 IBM vectors, each with 64-dimensions. The IBM in Eq. (7) used in the system. Bright yellow elements in the figure denote ones (in terms of their binary values) and the blue elements denote zeros. From the figure, we observe that low-dimensional MPS features are dominated by speech components. One possible explanation is that the noise signals did not mask the human speech in the low-frequency regions. In addition, the entire first column consisted of ones, thus suggesting that the silence frames were also contained in the baby-cry noise.

We then compared the averaged CER results of the and systems, and the corresponding results are listed in Table 1. The unprocessed noisy speech was also recognized by an ASR system, and the corresponding results are denoted as “Noisy”. To test the effectiveness of RL learning, we designed another set of experiments: the same 32 IBM vectors were used, while the one-nearest-neighbor () method was used to determine the IBM vector for enhancement. The enhanced speech was then recognized by the same ASR system; the corresponding results were denoted as in Table 1.

When the recognition was tested using the original clean testing utterances, the CER was . However, as shown in Table 1, when there was noise involved in the background, the CER was dropped considerably to and , respectively, for 5 dB and 0 dB SNR levels. We then noted that could not provide any improvements over Noisy, thus showing that the one-nearest-neighbor method could not select the optimal IBM vectors for SE to improve the ASR performance. Furthermore, both and provided better recognition results than those of Noisy and , and outperformed . The relative CER reductions of over Noisy are (from to ) at the 5 dB SNR level, and (from to ) for the 0 dB SNR level. The results in Table 1 clearly demonstrate the effectiveness of RL-based SE for improving ASR performance in the presence of noise.

Figure 5: The spectrograms of (a) Noisy speech, (b) clean speech, (c) enhanced speech by , and (d) enhanced speech by .
Noisy Noisy
5 dB 0.82 0.82 0.86 1.85 1.67 1.96
0 dB 0.74 0.77 0.81 1.45 1.42 1.59
Table 2: The STOI and PESQ scores of , , and Noisy at 0 and 5 dB SNR conditions.

To visually analyze the effect of the derived RL-based SE system, we presented the spectrograms of one noisy utterance at the 5 dB SNR level (as shown in Fig. 5 (a)), as well as its clean and enhanced versions by and (as shown in Fig. 5 (b), (c), and (d), respectively). From the figure, noise components of noisy datasets were effectively removed by and , thus showing that despite the fact that the goal was to improve the ASR performance, the RL-based SE also performed denoising on the input speech.

Recent studies have reported a positive correlation between objective intelligibility scores and ASR performance [27, 32]. In Table 2, we show the STOI and PESQ scores of enhanced speech processed by and at SNR levels of0 and 5 dB. The results of the unprocessed noisy speech, shown as Noisy, are also listed for comparison. From this table, we show that both and elicit higher STOI scores than Noisy and provides again clear improvements over . From Tables 1 and 2, we can clearly note positive correlations between the STOI scores and ASR performances. As for the PESQ scores, outperformed Noisy but slightly underperformed Noisy. It can be noted that the correlation of the PESQ scores with ASR results is not as strong as that of the STOI scores and the ASR results.

5 Conclusion

In this study, we present an RL-based SE for robust speech recognition without retraining the ASR system. By using the recognition errors as the objective function, the RL-based SE can effectively reduce CERs by and at 5 and 0 dB SNR conditions, respectively. We also noted that although the objective is to improve ASR performance, the enhanced speech presented denoised properties and was with improved STOI scores. This study serves as a pioneering work for building an SE system with the aim to directly improve ASR performance. The designed scenario is practical in many real-world applications where an ASR engine is supplied by a third-party. In the future work, more noise types and SNR levels will be considered to build the RL-based SE system.


  • [1] J. Li, L. Deng, R. Haeb-Umbach, and Y. Gong, Robust Automatic Speech Recognition: A Bridge to Practical Applications. Academic Press, 2015.
  • [2] B. Li, Y. Tsao, and K. C. Sim, “An investigation of spectral restoration algorithms for deep neural networks based noise robust speech recognition.,” in Proc. INTERSPEECH, pp. 3002–3006, 2013.
  • [3] Z.-Q. Wang and D. Wang, “A joint training framework for robust automatic speech recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 4, pp. 796–806, 2016.
  • [4] A. Acero and R. M. Stern, “Environmental robustness in automatic speech recognition,” in Proc. ICASSP, pp. 849–852, 1990.
  • [5] H.-J. Hsieh, B. Chen, and J.-w. Hung, “Employing median filtering to enhance the complex-valued acoustic spectrograms in modulation domain for noise-robust speech recognition,” in Proc. ISCSLP, pp. 1–5, 2016.
  • [6] H. Zhang, X. Zhang, and G. Gao, “Training supervised speech separation system to improve stoi and pesq directly,” in Proc. ICASSP, pp. 5374–5378, 2018.
  • [7] I. Cohen and B. Berdugo, “Noise estimation by minima controlled recursive averaging for robust speech enhancement,” IEEE signal processing letters, vol. 9, no. 1, pp. 12–15, 2002.
  • [8] R. McAulay and M. Malpass, “Speech enhancement using a soft-decision noise suppression filter,” IEEE Transactions on Acoustics, Speech and Signal Processing, vol. 28, no. 2, pp. 137–145, 1980.
  • [9] J. Du, Q. Wang, T. Gao, Y. Xu, L.-R. Dai, and C.-H. Lee, “Robust speech recognition with speech enhanced deep neural networks,” in Proc. INTERSPEECH, pp. 616–620, 2014.
  • [10] D. Baby, J. F. Gemmeke, T. Virtanen, et al., “Exemplar-based speech enhancement for deep neural network based automatic speech recognition,” in Proc. ICASSP, pp. 4485–4489, 2015.
  • [11] A. J. R. Simpson, “Probabilistic binary-mask cocktail-party source separation in a convolutional deep neural network,” CoRR, vol. abs/1503.06962, 2015.
  • [12] D. Wang and J. Chen, “Supervised speech separation based on deep learning: An overview,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 26, no. 10, pp. 1702–1726, 2018.
  • [13] Y. Xu, J. Du, L.-R. Dai, and C.-H. Lee, “A regression approach to speech enhancement based on deep neural networks,” IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), vol. 23, no. 1, pp. 7–19, 2015.
  • [14]

    X. Lu, Y. Tsao, S. Matsuda, and C. Hori, “Speech enhancement based on deep denoising autoencoder.,” in

    Proc. INTERSPEECH, pp. 436–440, 2013.
  • [15] Z. Meng, J. Li, Y. Gong, et al., “Adversarial feature-mapping for speech enhancement,” arXiv preprint arXiv:1809.02251, 2018.
  • [16] Z. Meng, J. Li, Y. Gong, et al., “Cycle-consistent speech enhancement,” arXiv preprint arXiv:1809.02253, 2018.
  • [17] R. S. Sutton, A. G. Barto, and R. J. Williams, “Reinforcement learning is direct adaptive optimal control,” IEEE Control Systems, vol. 12, no. 2, pp. 19–22, 1992.
  • [18] N. Kohl and P. Stone, “Policy gradient reinforcement learning for fast quadrupedal locomotion,” in Proc. ICRA, vol. 3, pp. 2619–2624, 2004.
  • [19] S. Singh, D. Litman, M. Kearns, and M. Walker, “Optimizing dialogue management with reinforcement learning: Experiments with the njfun system,”

    Journal of Artificial Intelligence Research

    , vol. 16, pp. 105–133, 2002.
  • [20] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529–333, 2015.
  • [21] T. Kala and T. Shinozaki, “Reinforcement learning of speech recognition system based on policy gradient and hypothesis selection,” in Proc. ICASSP, pp. 5759–5763, 2018.
  • [22] Y. Koizumi, K. Niwa, Y. Hioka, K. Kobayashi, and Y. Haneda, “Dnn-based source enhancement self-optimized by reinforcement learning using sound quality measurements,” in Proc. ICASSP, pp. 81–85, 2017.
  • [23] A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra, “Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs,” in Proc. ICASSP, pp. 749–752, 2001.
  • [24] C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “An algorithm for intelligibility prediction of time–frequency weighted noisy speech,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 7, pp. 2125–2136, 2011.
  • [25] Y. Koizumi, K. Niwa, Y. Hioka, K. Koabayashi, and Y. Haneda, “Dnn-based source enhancement to increase objective sound quality assessment score,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 26, no. 10, pp. 1780–1792, 2018.
  • [26] H.-M. Wang, B. Chen, J.-W. Kuo, and S.-S. Cheng, “Matbn: A mandarin chinese broadcast news corpus,” International Journal of Computational Linguistics & Chinese Language Processing, vol. 10, no. 2, pp. 219–236, 2005.
  • [27] A. H. Moore, P. P. Parada, and P. A. Naylor, “Speech enhancement for robust automatic speech recognition: Evaluation using a baseline system and instrumental measures,” Computer Speech & Language, vol. 46, pp. 574–584, 2017.
  • [28] M. Norouzi, D. J. Fleet, and R. R. Salakhutdinov, “Hamming distance metric learning,” in Proc. NIPs, pp. 1061–1069, 2012.
  • [29]

    S. Katz, “Estimation of probabilities from sparse data for the language model component of a speech recognizer,”

    IEEE transactions on acoustics, speech, and signal processing, vol. 35, no. 3, pp. 400–401, 1987.
  • [30] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, et al., “The kaldi speech recognition toolkit,” in Proc. ASRU, 2011.
  • [31] S.-S. Wang, P. Lin, Y. Tsao, J.-W. Hung, and B. Su, “Suppression by selecting wavelets for feature compression in distributed speech recognition,” IEEE/ACM Transactions on Audio, Speech and Language Processing, vol. 26, no. 3, pp. 564–579, 2018.
  • [32] S. Xia, H. Li, and X. Zhang, “Using optimal ratio mask as training target for supervised speech separation,” in Proc. APSIPA, pp. 163–166, 2017.