Distributed Microphone Speech Enhancement based on Deep Learning

11/19/2019 ∙ by Syu-Siang Wang, et al. ∙ 0

Speech-related applications deliver inferior performance in complex noise environments. Therefore, this study primarily addresses this problem by introducing speech-enhancement (SE) systems based on deep neural networks (DNNs) applied to a distributed microphone architecture. The first system constructs a DNN model for each microphone to enhance the recorded noisy speech signal, and the second system combines all the noisy recordings into a large feature structure that is then enhanced through a DNN model. As for the third system, a channel-dependent DNN is first used to enhance the corresponding noisy input, and all the channel-wise enhanced outputs are fed into a DNN fusion model to construct a nearly clean signal. All the three DNN SE systems are operated in the acoustic frequency domain of speech signals in a diffuse-noise field environment. Evaluation experiments were conducted on the Taiwan Mandarin Hearing in Noise Test (TMHINT) database, and the results indicate that all the three DNN-based SE systems provide the original noise-corrupted signals with improved speech quality and intelligibility, whereas the third system delivers the highest signal-to-noise ratio (SNR) improvement and optimal speech intelligibility.



There are no comments yet.


page 1

page 2

page 3

page 4

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Real-world environments are always contain stationary and/or time-varying noises that are received together with speech signals by recording devices. The received noises inevitably degrade the performance of multi-channel (MC)-based human–human and human–machine interfaces, and this issue has attracted significant attention over the years [1, 2, 3]. In recent decades, numerous MC speech-enhancement (SE) approaches have been proposed to alleviate the effect of noise and improve the quality and intelligibility [4, 5, 6, 7] of received speech signals. In general, most of these approaches have been proposed for use in a microphone array architecture, wherein multiple microphones are compactly placed in a small space. For example, the beam-forming algorithm, one of the most popular methods that exploit the spatial diversity of received signals to design a linear filter in the frequency domain, aims to preserve the signal received from the target direction while attenuating noise and interference from other directions [8, 9]. Recently, several novel approaches have combined deep-learning-based algorithms with the beam-forming process to further promote the enhanced capability of an MC SE system [10, 11]. In addition to beam-forming-based approaches, in [12]

, the multiple recordings are directly enhanced in the time domain along the specified spatial direction through a denoising auto-encoder, and this method benefits automatic speech recognition (ASR) systems by reducing recognition errors.

In contrast, some researchers pay more attention on performing SE on the architecture of distributed-microphones [13, 14, 15, 16]. This physical configuration, consisting of many individual self-powered microphones or microphone arrays, can be deployed in a large area [17, 18]. Therefore, one or more received signals with a higher signal-to-noise ratio (SNR) and a direct-to-reverberant ratio can be used for the distributed-microphone enhancement system to achieve better sound quality and intelligibility.

In general, a fusion center (FC) and a distributed signal processor (also called ad-hoc) are two alternative forms used in the distributed-microphone architecture [19]

. For the FC, each recording device can transmit the recorded sounds to a powerful central processor that aims at reconstructing nearly clean speech relative to the selected target speaker. Some successful approaches associated with this architecture include robust principal component analysis

[15], generalized eigen value decomposition [20], MC Wiener filter [21]

, and optimal MC frequency domain estimators


. In comparison, the ad-hoc-based enhancement system, however, enhances the noisy input locally in the individual device and then shares the result with its neighbors for further refinement. Some well-known techniques of this type include distributed linearly constrained minimum variance

[23], linearly constrained distributed adaptive node-specific signal estimation [24], distributed generalized sidelobe canceler [25], and distributed maximum signal to interference-plus-noise filtering [26].

In this study, three novel SE systems based on deep neural networks (DNNs) are introduced and investigated for the distributed-microphone architecture. For the first system, we train the DNN model for each microphone channel to enhance the corresponding noisy recordings. In other words, a large MC SE system is divided into several single-channel noise reduction tasks, and this system is called “DNN–S”. Next, the second system follows a process similar to the work in [12], wherein the multiple noisy utterances received are transmitted to an FC and then aggregated and used as input to a DNN model for producing the final enhanced signals. Because an FC is used here, this system is called “DNN–F”. Finally, the third system comprises two operational stages. The first stage acts locally in each channel device by enhancing the noisy input with a DNN, and in the second stage, all the enhanced local-channel signals are combined and further processed with a fusion DNN. We call this system “DNN–C” because it nearly combines the two previous systems to facilitate the enhancement.

The rest of this paper is organized as follows. Section 2 introduces the aforementioned three systems, namely DNN–S, DNN–F, and DNN–C. Experiments and the respective analysis are given in Section 3. Section 4 presented conclusions and a future avenue.

2 Three different multi-channel enhancement system

Figure 1: Block diagram of a DNN-based speech-enhancement system with a distributed-microphone architecture.

Figure 1

shows the general diagram of a distributed-channel SE architecture common to the three presented systems. The original clean signal shown in this figure is first corrupted by the background diffuse noise and/or reverberation and is received by the distributed microphones. Next, the short-time Fourier transform (STFT) and logarithmic operation are performed on the received signals to obtain the log-power magnitude spectra (LPS) in the speech signal processor block. The MC SE system based on DNN, DNN–S, DNN–F, or DNN–C, is then used to generate the enhanced version from the noisy LPS input. Finally, in the enhanced signal processor block, the inverse STFT (ISTFT) is applied to the enhanced LPS, together with the original phase component preserved from the specific channel to provide the final enhanced speech waveform. In the following three sub-sections, we will introduce the detail process of the aforementioned three DNN-based SE systems.

2.1 DNN–S enhancement systems

Figure 2: Block diagram of the DNN–S SE system.

For the DNN–S architecture as shown in Fig. 2, a DNN, presented by DNN–S, was trained for each of the –microphone channels using the channel-wise noisy speech feature set with the respective clean counterpart , where is the total number of microphone channels, and is the channel index. Consider an –layer DNN–S, an arbitrary th layer is formulated in Eq. (1) in terms of the input-output relationship (, ):


where and

are the ReLU activation and linear transformation functions, respectively. Notably, the input and output layers correspond to the first and

-th layers, respectively. In addition, for DNN–S, we have and , where is the ultimate output of this system. The DNN parameters are obtained by means of supervised training that minimizes the mean squared error (MSE) between and the noise-free counterpart .

2.2 DNN–F enhancement system

Figure 3: Block diagram of the DNN–F speech-enhancement system.

Figure 3 illustrates the DNN–F block diagram. According to this figure, we collect the channel-wise noisy features, , from all of the –microphone channels, and concatenate them to form a long feature in a frame-wise manner, i.e., . Then, the long features in the training set are used to train a DNN model in FC, which can be formulated as follows


where represents the operation of the used DNN model, and is the corresponding enhanced output. DNN–F applies a fusion model to enhance the noisy features from all channels concurrently, whereas DNN–S exploits -channel-dependent models.

2.3 DNN–C enhancement system

Figure 4: DNN–C block diagram depicted in (a) contains both DNN–FC and DNN–DP functions. The training target for DNN–FC is the clean LPS , while that for DNN–DP is the ground truth voice . In addition, the DNN–DP model is determined first in (b), and then fixed for performing DNN–FC in (a).

The third proposed system, DNN–C, consists of two stages, the distributed processing stage and the fusion stage. Both stages use DNN models and are therefore represented by “DNN–DP” and “DNN–FC”, respectively. The general diagram of DNN–C is shown in Fig. 4 (a), and the detailed configuration of the DNN–DP stage is shown in Fig. 4 (b). According to Fig. 4 (b), -self-powered devices, each for every individual microphone channel, are employed and a channel-specific DNN model denoted by DNN–DP is conducted on each of these devices for suppressing noise from the input to produce the enhanced features, denoted by . Like DNN–S and DNN–F models, each DNN–DP is composed of layers with ReLU activation and linear transformation functions, and can be formulated as follows:


Therefore, the –DNN models—DNN–DP, DNN–DP,, DNN–DP—are first estimated in the DNN–DP stage. In particular, the target feature for each of the –DNN models, denoted by , is created by the channel-dependent clean speech, which is recorded from the th microphone in the noise-free environment. That is, the speech pair is used to train the associated DNN–DP model.

As for the second stage, “DNN–FC”, the –DNN outputs at the first stage are concatenated to form a new feature , i.e., , that is used together with the clean target to train the DNN–FC model, which can be formulated by:


Therefore, the overall operation of the DNN–C system can be represented by


2.4 Phase component

Briefly, the MC SE systems presented here enhance multiple signal sources in the frequency domain. Therefore, an ISTFT is applied to the updated spectrogram to produce the enhanced time-domain signal. It is worth mentioning that the phase component in the enhanced spectrogram is generated from a noisy speech through STFT from a specific channel. For DNN–S, the enhanced signal processor shown in Fig. 1 is performed individually on each channel for reconstructing the speech waveform with the enhanced LPS and the preserved noisy phase. Conversely, the phase used for DNN–F and DNN–C is extracted from one of the noisy channels, where it is assumed that the recorded sound has the highest SNR and optimal speech quality among all channels.

3 Experiments

In the following subsections, we first describe the experimental setup of the SE distributed-microphone tasks and then present the experimental results together with some discussions for the presented systems.

3.1 Experimental setup

Figure 5: MC system consists of seven microphones (“m1”, “m2”,, “m7”); microphones m1, m2,, m6 were placed around the speaker T with a radius of 0.5 meter, and m7 was put behind m1 and oriented towards T at a distance of 1 meter.

The layout of the distributed MC system is shown in Fig. 5. Seven microphones () of the same brand and model (Sanlux HMT-11) were used and denoted by “m1”, “m2”,, “m7”, respectively, while the target speaker was represented by “”. In this system, six microphones, m1, m2,, m6, were placed around the speaker and equally spaced by an angle in the median plane with a 0.5-meter radius. The m7 microphone was placed right behind the m1 microphone and oriented towards the speaker at 1 meter.

For the evaluation task, we used the Taiwan Mandarin Hearing in Noise Test (TMHINT) [27] to prepare the speech dataset. According to the script provided by the TMHINT dataset, 300 phrases were selected as the training set, while the remaining 20 utterances were used for testing. Phrases in the training set were individually pronounced by a male and a female in a noise-free environment at a sampling rate of 16 kHz. These recordings were then corrupted by eight different types of noise (cockpit, machine gun, alarm, cough, PC fan, pink, babble, and engine) at eight different noise levels (ranging from to dB SNRs with a 3 dB interval). Thus, 38,400 utterances () were reproduced and then received by each of the seven microphones. On the other hand, the test utterances were first recorded by another male and female speaker and then contaminated with car and street noises at -5, 0 and 5 dB SNRs. Therefore, there were 240 noisy utterances () transmitted to each of the seven microphones.

For any of the seven microphone channels, each received utterance was first split into overlapped frames with a 32-ms frame duration and a 16-ms frame shift, and a series of 257-dim frame-wise LPS were constructed accordingly. The context feature for each frame was then created by concatenating the LPS of three neighboring frames. Therefore, the dimensions of the input feature were 771 () for each DNN–S, 771 () for DNN–DP, 5397 () for DNN–F, and 1799 () for DNN–FC, respectively. By contrast, the output dimensions of DNN–S, DNN–F, DNN–DP, and DNN–FC were 257. In addition, each DNN–S and DNN–F model consisted of seven layers and 2,048 nodes per hidden layer. A DNN–DP model was arranged to have five layers, whereas the DNN–FC model had four layers. The number of nodes for each hidden layer of DNN–DP and DNN–FC models was set to 2,048.

Three metrics were used to evaluate the enhanced utterances, including perceptual evaluation of speech quality (PESQ) [28], short-time objective intelligibility (STOI) [29], and segmental SNR improvement (SSNRI). Higher scores for PESQ, STOI, and SSNRI indicate better enhanced performance.

3.2 Experimental results

The averaged STOI scores of the test utterances (input) and enhanced utterances (output) of individual microphone channels of the DNN–S system are listed in Tables 1. It is clear that almost all DNN–S models could improve the STOI score, except for channel m1 in DNN–S. One possible reason is the overfitting issue that may have degraded the generalization capability of the model in the testing environments. Meanwhile, the STOI scores of noisy testing utterances varied with different recording channels, which may be owing to the different locations of the microphones deployed in a space, as shown in Fig. 5, despite the fact that these microphones were at the same distance from the speaker. In addition, Table 2 lists the SSNRI scores achieved by individual DNN–S models. Similar to the case in Table 1, all the channel-wise DNN–S models bring significant SNR improvements, whereas DNN–S (m1) was less effective than the other models (channels).

STOI m1 m2 m3 m4 m5 m6 m7
Noisy 0.672 0.581 0.631 0.668 0.671 0.666 0.663
DNN–S 0.670 0.677 0.668 0.685 0.695 0.679 0.673
Table 1: Average STOI scores of the noisy utterances tests (input) and enhanced utterances (output) of the individual DNN–S channels.
SSNRI m1 m2 m3 m4 m5 m6 m7
Noisy 0.000 0.000 0.000 0.000 0.000 0.000 0.000
DNN–S 8.722 13.179 12.859 12.132 8.699 12.530 12.127
Table 2: Average SSNRI of enhanced utterances of individual channels of DNN–S.
STOI 0.678 0.764 0.770
SSNRI 11.464 12.760 16.729
Table 3: Average STOI and SSNRI scores of the enhanced utterances of DNN–S, DNN–F, and DNN–C under all noise conditions and microphone channels.

In Table 3, the averaged STOI and SSNRI scores of the enhanced utterances of DNN–S, DNN–F, and DNN–C under all noise conditions and microphone channels are shown. For DNN–S, here we report the STOI and SSNRI scores averaged over all channels shown in Tables 1 and 2

. From these tables, it is clear that both DNN–F and DNN–C achieve superior evaluation scores than DNN–S, and these results further confirm that the two MC SE systems outperform DNN–S, a single-channel SE system, in improving the intelligibility and SNR of noisy utterances. Furthermore, both evaluation metrics indicate that DNN–C outperforms DNN–F, revealing the superiority of the two-stage SE architecture in promoting intelligibility and noise reduction for distorted signals.

Figure 6 compares DNN–F and DNN–C under siren and street noise conditions under all SNR levels in terms of the averaged (a) STOI and (b) SSNRI evaluation metrics. From this figure, we further confirm that DNN–C provides higher intelligibility and more significant SNR improvements than DNN–F.

Figure 6: The (a) STOI and (b) SSNRI scores of DNN–F and DNN–C enhanced noisy utterances in siren and street noise environmentss, with an average of more than three SNR levels.

Figures 7(a)-(d) show the spectrograms of an utterance under four conditions: (a) clean noise-free, (b) noise-corrupted (with a PESQ score of 1.642), (c) noise-corrupted and then enhanced by DNN–F (with a PESQ score of 1.834), and (d) noise-corrupted and then enhanced by DNN–C (with a PESQ score of 1.923). The utterance was corrupted with street noise at -5 dB SNR, and the noisy utterance was recorded by the m1 microphone. From these figures, we find that the spectrogram of the DNN–C-enhanced utterance in Fig. 7(d) is quite similar to that of the clean utterance in Fig. 7(a). In addition, by comparing Fig. 7(d) to Fig. 7(c), DNN–C reveals clearer spectral characteristics and sound structures than DNN–F, as indicated by the red blocks. These observations also explain why DNN–C achieved higher evaluation scores than DNN–F.

Figure 7: Spectrograms of (a) a clean utterance, (b) the noisy signal recorded by m1, (c) the DNN–F enhanced speech, and (d) the DNN–C enhanced version.

4 Conclusion

In this study, we presented three DNN–based SE systems under a distributed-microphone scenario applied in the diffuse-noise field environment. These three systems (DNN–S, DNN–F, and DNN–C) were evaluated in the TMHINT dataset. Experimental results showed that all of these systems were able to reduce the noise effect to improve speech intelligibility. Meanwhile, the two-stage DNN–C system achieved the optimal objective intelligibility score among the three systems. In the future, we plan to perform the DNN-based distributed-microphone SE task by properly selecting the recording channels rather than using them all as the system input.


  • [1] M. S. Kavalekalam, J. K. Nielsen, M. G. Christensen, and J. B. Boldt, “Hearing aid-controlled beamformer for binaural speech enhancement using a model-based approach,” in Proc. ICASSP, pp. 321–325, 2019.
  • [2] J. Bitzer, K. U. Simmer, and K.-D. Kammeyer, “Multi-microphone noise reduction techniques as front-end devices for speech recognition,” Speech Communication, vol. 34, no. 1-2, pp. 3–12, 2001.
  • [3] P. Vecchiotti, N. Ma, S. Squartini, and G. J. Brown, “End-to-end binaural sound localisation from the raw waveform,” in Proc. ICASSP, pp. 451–455, 2019.
  • [4] F. de la Hucha Arce, M. Moonen, M. Verhelst, and A. Bertrand, “Adaptive quantization for multichannel wiener filter-based speech enhancement in wireless acoustic sensor networks,” Wireless Communications and Mobile Computing, vol. 2017, p. 15, 2017.
  • [5]

    S. Bagheri and D. Giacobello, “Exploiting multi-channel speech presence probability in parametric multi-channel wiener filter,” in

    Proc. INTERSPEECH, pp. 101–105, 2019.
  • [6] W. Yang, G. Huang, J. Benesty, I. Cohen, and J. Chen, “On the design of flexible kronecker product beamformers with linear microphone arrays,” in Proc. ICASSP, pp. 441–445, 2019.
  • [7] I. A. McCowan and H. Bourlard, “Microphone array post-filter based on noise field coherence,” IEEE Transactions on Speech and Audio Processing, vol. 11, no. 6, pp. 709–716, 2003.
  • [8] O. Hoshuyama, A. Sugiyama, and A. Hirano, “A robust adaptive beamformer for microphone arrays with a blocking matrix using constrained adaptive filters,” IEEE Transactions on signal processing, vol. 47, no. 10, pp. 2677–2684, 1999.
  • [9] S. Emura, S. Araki, T. Nakatani, and N. Harada, “Distortionless beamforming optimized with -norm minimization,” IEEE Signal Processing Letters, vol. 25, no. 7, pp. 936–940, 2018.
  • [10] X. Zhang and D. Wang, “Deep learning based binaural speech separation in reverberant environments,” IEEE/ACM transactions on audio, speech, and language processing, vol. 25, no. 5, pp. 1075–1084, 2017.
  • [11] Z.-Q. Wang and D. Wang, “All-neural multi-channel speech enhancement.,” in Proc. INTERSPEECH, pp. 3234–3238, 2018.
  • [12]

    N. Tawara, T. Kobayashi, and T. Ogawa, “Multi-channel speech enhancement using time-domain convolutional denoising autoencoder,”

    Proc. INTERSPEECH, pp. 86–90, 2019.
  • [13] A. Hassani, A. Bertrand, and M. Moonen, “Real-time distributed speech enhancement with two collaborating microphone arrays,” in Proc. ICASSP, pp. 6586–6587, 2017.
  • [14] T. Matheja, M. Buck, and T. Fingscheidt, “A dynamic multi-channel speech enhancement system for distributed microphones in a car environment,” EURASIP Journal on Advances in Signal Processing, vol. 2013, no. 1, pp. 191–211, 2013.
  • [15] X. Li, M. Fan, L. Liu, and W. Li, “Distributed-microphones based in-vehicle speech enhancement via sparse and low-rank spectrogram decomposition,” Speech Communication, vol. 98, pp. 51–62, 2018.
  • [16] J. Tu and Y. Xia, “Fast distributed multichannel speech enhancement using novel frequency domain estimators of magnitude-squared spectrum,” Speech Communication, vol. 72, pp. 96 – 108, 2015.
  • [17] A. Plinge, F. Jacob, R. Haeb-Umbach, and G. A. Fink, “Acoustic microphone geometry calibration: An overview and experimental evaluation of state-of-the-art algorithms,” IEEE Signal Processing Magazine, vol. 33, no. 4, pp. 14–29, 2016.
  • [18] A. Bertrand, “Applications and trends in wireless acoustic sensor networks: A signal processing perspective,” in Proc. SCVT, pp. 1–6, 2011.
  • [19] S. Markovich-Golan, A. Bertrand, M. Moonen, and S. Gannot, “Optimal distributed minimum-variance beamforming approaches for speech enhancement in wireless acoustic sensor networks,” Signal Processing, vol. 107, pp. 4–20, 2015.
  • [20]

    S. Markovich, S. Gannot, and I. Cohen, “Multichannel eigenspace beamforming in a reverberant noisy environment with multiple interfering speech signals,”

    IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 17, no. 6, pp. 1071–1086, 2009.
  • [21] R. Serizel, M. Moonen, B. Van Dijk, and J. Wouters, “Low-rank approximation based multichannel wiener filter algorithms for noise reduction with application in cochlear implants,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 22, no. 4, pp. 785–799, 2014.
  • [22] M. B. Trawicki and M. T. Johnson, “Distributed multichannel speech enhancement with minimum mean-square error short-time spectral amplitude, log-spectral amplitude, and spectral phase estimation,” Signal Processing, vol. 92, no. 2, pp. 345 – 356, 2012.
  • [23] A. Bertrand and M. Moonen, “Distributed lcmv beamforming in a wireless sensor network with single-channel per-node signal transmission,” IEEE Transactions on Signal Processing, vol. 61, no. 13, pp. 3447–3459, 2013.
  • [24] A. Bertrand and M. Moonen, “Distributed node-specific LCMV beamforming in wireless sensor networks,” IEEETransactions on Signal Processing, vol. 60, no. 1, pp. 233–246, 2011.
  • [25] S. Markovich-Golan, S. Gannot, and I. Cohen, “Distributed multiple constraints generalized sidelobe canceler for fully connected wireless acoustic sensor networks,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 21, no. 2, pp. 343–356, 2012.
  • [26] V. M. Tavakoli, J. R. Jensen, R. Heusdens, J. Benesty, and M. G. Christensen, “Distributed max-SINR speech enhancement with ad hoc microphone arrays,” in Proc. ICASSP, pp. 151–155, 2017.
  • [27] M. Huang, “Development of taiwan mandarin hearing in noise test,” Department of speech language pathology and audiology, National Taipei University of Nursing and Health science, 2005.
  • [28] A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra, “Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs,” in Proc. ICASSP, pp. 749–752, 2001.
  • [29] C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “An algorithm for intelligibility prediction of time–frequency weighted noisy speech,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 7, pp. 2125–2136, 2011.