1 Introduction
Almost all voicebased applications such as mobile communications, hearing aids or human to machine interfaces require a clean version of speech for an optimal use. Singlechannel speech enhancement can substantially improve the speech intelligibility and speech recognition of a noisy mixture [1, 2]. However improvement with a singlechannel filter is limited by the distortions introduced during the filtering operation. The distortion can be reduced in multichannel processing which exploits spatial information [3, 4]. The multichannel Wiener filter (MWF) [5] for example yields the optimal filter in the mean squared error (MSE) sense and can be extended to a speech distortion weighted multichannel Wiener filter (SDWMWF) where the noise reduction is balanced by the speech distortion [6].
Up to a certain point, the effectiveness of these algorithms increases with the number of microphones. More microphones can allow for a wider coverage of the acoustic scene and a more accurate estimation of the statistics of the source signals. In large rooms, or even in flats, this implies the need of huge microphone arrays, which, if they are constrained, can become prohibitively expensive and lacks flexibility. However, in our daily life, with the omnipresence of computers, we are surrounded by an increased number of embedded microphones, telephones and tablets. They can be viewed as unconstrained ad hoc microphone arrays which are promising but also challenging [7]. A distributed adaptive nodespecific signal estimation (DANSE) algorithm [8], where the nodes exchange a single linear combination of their local signals, was proposed for a fully connected microphone array. It was shown to converge to the centralized MWF [9]. The constraint of a fully connected array can be lifted with randomized gossipingbased algorithms, where beamformer coefficients are computed in a distributed fashion [10]. Message passing [11] or diffusionbased [12] algorithms can increase the rather slow convergence rate of these solutions. Another way to exploit the broad covering of the acoustic field by ad hoc microphone arrays is to gather the microphones into clusters dominated by a single common source which can be estimated more efficiently [13].
All these algorithms require the knowledge of either the direction of arrival (DOA) or the speech activity to compute the filters and are sensitive to signal mismatches [14] or detection errors [6]
. Deep learningbased approaches have been proposed to estimate accurately these quantities through the prediction of a
timefrequency (TF) mask [15, 16, 17] or of the spectrum of the desired signals [18]. Although often used in a multichannel context, most of these solutions use singlechannel data as input of their deep neural networks. Multichannel information was first taken into account through spatial features [19], but can also be exploited using the magnitude and phase of several microphones as the input of a convolutional recurrent neural network (CRNN) [20, 21]. This yields better results than singlechannel prediction but combining all the sensor signals is not scalable and seems suboptimal because of the redundancy of the data. Coping with the redundancy, Perotin et al. [22]combined a single estimate of the source signals with the input mixture and used the resulting tensor to train a
long shortterm memory (LSTM) recurrent neural network (RNN).In this paper, we consider a fully connected microphone array with synchronized sensors. This allows for using the MWFbased DANSE algorithm which was reported to achieve good speech enhancement performance [9]. Following the results shown by Perotin et al. [22], we take advantage of the DANSE paradigm [9] by combining at each node one local signal with the estimations of the target signal sent by the other nodes. This uses the multichannel context for the mask estimation but avoids the redundancy brought by the signals of a same node. Additionally, this scheme takes advantage of the internal filter operated in DANSE and reduces the costs in terms of bandwidth and computational power compared to a network combining all the sensor signals.
2 Problem formulation
2.1 Signal model
We consider an additive noise model expressed in the short time Fourier transform (STFT) domain as where is the recorded mixture at frequency index and time frame index . The speech target signal is denoted and the noise signal . For the sake of conciseness, we will drop the time and frequency indexes and . The signals are captured by
microphones and stacked into a vector
. In the following, regular lowercase letters denote scalars; bold lowercase letters indicate vectors and bold uppercase letters indicate matrices.2.2 Multichannel Wiener filter
The MWF operates in a fully connected microphone array. It aims at estimating the speech component of a reference signal at microphone . Without loss of generality, we take the reference microphone as in the remainder of the paper. The MWF minimises the MSE cost function expressed as follows:
(1) 
is the expectation operator and denotes the Hermitian transpose. The solution to (1) is given by
(2) 
with , and . Under the assumption that speech and noise are uncorrelated and that the noise is locally stationary, where . Computing these matrices requires the knowledge of noiseonly periods and speechplusnoise periods. This is typically obtained with a voice activity detector (VAD) [6, 9].
The SDWMWF provides a tradeoff between the noise reduction and the speech distortion [6]. The filter parameters minimise the cost function
(3) 
with the tradeoff parameter. The solution to (3) is given by
(4) 
If the desired signal comes from a single source, the speech covariance matrix is theoretically of rank 1. Under this assumption, Serizel et al. [23] proposed a rank1 approximation of based on a generalized eigenvalue decomposition (GEVD), delivering a filter that is more robust in low SNR scenarios and provides a stronger noise reduction.
2.3 Danse
In this section, we briefly describe the DANSE algorithm under the assumption that a single target source is present. We consider microphones spread over nodes, each node containing microphones. The signals of one node are stacked in . As can be seen in (2), the array wide MWF should be computed from all signals of the array, which can result in high bandwidth and computational costs. In DANSE, only a single compressed signal is sent from node to the other nodes. So a node has signals, stacked in , where is a column vector gathering the compressed signals coming from the other nodes . Replacing by and solving (3) yields the DANSE solution to the SDWMWF:
(5) 
where , the filter at node , can be decomposed into two filters as . The first filter is applied on the local signals and is applied on the compressed signals sent from the other nodes. The covariance matrices and are computed from the speech and noise components of . The compressed signal is computed as . Bertrand and Moonen proved that this solution converges to the MWF solution with , while dividing the bandwidth load by a factor at each node [9].
3 Deep neural network based distributed multichannel Wiener filter
Heymann et al. predicted TF masks out of a single signal of the microphone array [16]. Perotin et al. [22] or Chakrabarty and Habets [21] included several other signals to improve the speech recognition or speech enhancement performance. We propose to extend these scenarios to the multinode context of DANSE. In DANSE, at node , a single VAD is used to estimate the source and noise statistics required for both filters and . The first part of our contribution is to replace the VAD by a TF mask predicted by a DNN. Besides, since the compressed signals are sent from one node to the others, we also examine the option of exploiting this extra source of information by using it for the mask prediction. The schematic principle of DANSE is depicted in Figure 1. As it can be seen, an initialisation phase is required to compute the initial signal . We propose to do this with a first neural network. The second stage of DANSE is represented in the greyed box in Figure 1 and expended in Figure 2. Our second contribution is highlighted with the red arrow. It is to exploit the presence of at one node to better predict the masks with the DNN. Several iterations are necessary for the filter to converge to the solution (4). In DANSE, iterations are done at every time step. As we developed an offline batchmode algorithm, we stopped the processing after the first iteration. To analyse the effectiveness of combining with a reference signal to predict the mask, we compare our solution with a singlechannel prediction, where the masks required for both initialisation and iteration stages are predicted by a singlechannel model seeing only the local signal .
We compare two different architectures for each of these schemes. The first architecture is a bidirectional LSTM introduced by Heymann et al. [16]. When additional inputs are used with a RNN, they are stacked over the frequency axis [22]. Although this might deliver improved performance compared to the singlechannel version, stacking it over the frequency axis is not efficient as many connections are used to represent relations between TF bins that might not be related. That is why we propose a CRNN architecture which is more appropriate to process multichannel data. At each node, the compressed signals and the local reference signal are considered as separate convolutional channels.
During the training, in order to take into account the spectral shape of the speech, we weight the MSE loss between the predicted mask and the ground truth mask by the STFT frame of the input , corresponding to the predicted frame. Both models are thus trained to minimise the cost function
where represents the empirical mean.
Lastly, since the filter is also applied on , we use the GEVD of the covariance matrices to compute the MWF of equation (4). Contrary to equation (2), this does not explicitly take the first microphone as a reference. It also assigns higher importance to the compressed signals, which is desirable since they are prefiltered with potentially higher signal to noise ratios than the local signals.
4 Experimental setup
4.1 Dataset
Training as well as test data was generated by convolving clean speech and noise signals with simulated room impulse responses, and then by mixing the convolved signals at a specific SNR. The anechoic speech material was taken from the clean subset of LibriSpeech [24]. The RIRs were obtained with the Matlab toolbox Roomsimove^{1}^{1}1homepages.loria.fr/evincent/software/Roomsimove_1.4.zip simulating shoeboxlike rooms.
In the training set, the length of the room was drawn uniformly as m, the width as m, the height as m. Two nodes of four microphones each recorded the acoustic scene. The distance between the nodes was set to m, the microphones being cm away from the node centre. Each node was at least 1 m away from the closest wall. One source of noise and one of speech were placed at m from the array centre. Both sources had an angular distance relative to the array centre. The microphones as well as the sources were at the constant height of m. The SNR was drawn uniformly between dB and
dB. The noise was white noise modulated in the spectral domain by the long term spectrum of speech. We generated
files of 10 seconds each, corresponding to about 25 hours of training material.The test configuration was the same as the training configuration but with restricted values for some parameters. The length of the room was randomly selected among m, the width among m, and the height was set to m. The angular distance between the sources was randomly selected in . The noise was a random part of the third CHiME challenge dataset [25] in the cafeteria or pedestrian environment. We generated files representing about 2 hours of test material.
4.2 Setup
All the data was sampled at 16 kHz. The STFT was computed with an FFTlength of 512 samples (32 ms), 50% overlap and a Hanning window.
Our CRNN model was composed of three convolutional layers with 32, 64 and 64 filters respectively. They all had
kernels, with stride
and ReLU activation functions. Each convolutional layer was followed by a batch normalization over the frequency axis and a maximum pooling layer of size
(along the frequency axis). The recurrent part of the network was a layer with 256 gated recurrent units, and the last layer was a fully connected layer with a sigmoid activation function. The input data of both CRNN and RNN networks was made of sequences of 21 STFTframes and the mask corresponding to the middle frame was predicted. We trained them with the RMSprop optimizer
[26].5 Results
We evaluate the speech enhancement performance based on the source to artifacts ratio (SAR), source to interferences ratio (SIR) and source to distortion ratio (SDR) [27] computed with the mir_eval^{2}^{2}2https://github.com/craffel/mir_eval/ toolbox. The performance reported corresponds to the mean over the test samples of the objective measures computed at the node with the best input SNR
. We also report the 95% confidence interval.
The GEVD
filter does not explicitly take one sensor signal as the reference signal to minimise the cost function, but a projection of the input signals into the space spanned by the common eigenvectors of the covariance matrices. Because of that, the objective measures computed with respect to the convolved signals did not give results that were coherent with perceptual listening tests performed internally on random samples. Indeed, differences between the enhanced signal and the reference signal are interpreted as artefacts whereas they are due to the decomposition of the input signals into the eigenvalue space of the covariance matrices. Therefore, we compute the objective measures using the dry (source) signals as reference signals. This decreases the
SAR because the reverberation is then considered as an artefact but the comparison between methods correlates more with the perceptual listening tests.We present the objective metrics for the different approaches in Table 1. In this table, single node filters are referred to as MWF (upper part of the table) and distributed filters as DANSE (lower part of the table). For each filter, the architecture used to obtain the masks is indicated between parenthesis. RNN refers to Heymann’s architecture and CRNN to the network introduced in Section 4.2. The subscript of the network architecture indicates the channels considered at the input. The results obtained with the singlechannel DNN models are denoted with ”SC”. When the compressed signals were used as additional input to the DNN to predict the mask of the second filtering stage, models are denoted with ”MC”. Additionally, we report the number of trainable parameters of each model in Table 2.
5.1 Oracle performance
The VAD gives information about the speechplusnoise and noiseonly periods in a wideband manner only, whereas a mask gives spectral information that enables a finer estimation of the speech and noise covariance matrices. This additional information is translated into an improvement of the speech enhancement performance with both types of filters (MWF and DANSE). In the following section, we analyse whether this conclusion still holds when the masks are predicted by a neural network.
5.2 Performance with predicted masks
(dB)  SAR  SIR  SDR 

MWF (oracle VAD)  2.40.3  24.70.3  2.30.3 
MWF (oracle mask)  4.00.3  26.70.3  3.90.3 
MWF (RNN)  3.40.3  25.10.4  3.30.3 
MWF (CRNN)  3.30.3  25.10.4  3.20.3 
DANSE (oracle VAD)  2.6 0.3  25.2 0.3  2.6 0.3 
DANSE (oracle mask)  4.8 0.3  27.6 0.3  4.8 0.3 
DANSE (RNN)  26.00.4  
DANSE (CRNN)  
DANSE (RNN)  
DANSE (CRNN)  4.70.3  27.40.3  4.60.3 
First, replacing the oracle VAD by masks brings significant improvement in terms of all objective measures. This confirms the idea that TF masks are better activity detectors than VADs, even oracle ones. Second, the objective measures corresponding to the output signals of DANSE filters are always better than those of the MWF filters. This confirms the benefit of using the DANSE algorithm. Although these differences are not high, increasing the number of nodes and the distance between them might enhance the utility of the distributed method.
From the results in Table 1, there is no clear advantage of using a CRNN over using a RNN in the single channel case. Indeed, the objective measures of RNN and CRNN match in all points. In the multichannel case, the performance of the RNNbased approach does not increase. This tends to confirm that the RNN is not able to efficiently exploit multichannel information. Since the RNN delivered good results in the singlechannel scenario, this leads to the conclusion that stacking multichannel input on the frequency axis is not appropriate. In addition, as shown in Table 2, the number of parameters of the RNN almost doubles when a second signal is used, whereas it barely increases for the CRNN. This is due to the convolutional layers of the CRNN which can process multichannel data much more efficiently than recurrent layers.
The CRNN solution can exploit the multichannel inputs efficiently and the performance increases for all metrics. The biggest improvement is obtained for the SIR. Indeed, one of the main difficulties for the models is to predict noiseonly regions, because of people talking in the noise CHiME database. Since the compressed signals are prefiltered, they contain less noise and they are less ambiguous. This makes it easier for the model to recognize noiseonly regions, without degrading its predictions of speechplusnoise regions.
Model  Number of parameters 

RNN  
CRNN  
RNN  
CRNN 
6 Conclusion and future work
We introduced an efficient way of estimating masks in a multinode context. We developed multichannel models combining an estimation of the target signals sent by the other nodes with a local sensor. This proved to better predict TF masks, which led to higher speech enhancement performance that outperformed the results obtained with an oracle VAD. A CRNN was compared to a RNN and the CRNN could exploit much better the multichannel information. In addition, the RNN architecture is limited by its number of parameters, especially if the number of nodes had to increase. In such scenarios, the difference between singlechannel and multichannel models performance might be even more important but this still has to be explored. To attain performance closer to the oracle ones, several options are possible. First, the rather simple architectures that were used could be replaced by stateofthe art architectures. Besides, given the increase in performance when the target estimation is given, it would also be interesting to additionally give the noise estimation at the input of the models.
References
 [1] T. Gerkmann and R. C. Hendriks, “Unbiased MMSEbased noise power estimation with low complexity and low tracking delay,” IEEE Transactions on Audio, Speech and Language Processing, vol. 20, no. 4, pp. 1383–1393, 2012.
 [2] F. Weninger, J. R. Hershey, J. Le Roux, and B. Schuller, “Discriminatively trained recurrent neural networks for singlechannel speech separation,” in IEEE GlobalSIP. IEEE, 2014, pp. 577–581.
 [3] O.L. Frost, “An algorithm for linearly constrained adaptive array processing,” Proceedings of the IEEE, vol. 60, no. 8, pp. 926–935, 1972.
 [4] E. Vincent, T. Virtanen, and S. Gannot, Eds., Audio source separation and speech enhancement, John Wiley edition, 2018.
 [5] S. Doclo and M. Moonen, “GSVDbased optimal filtering for single and multimicrophone speech enhancement,” IEEE Transactions on Signal Processing, vol. 50, no. 9, pp. 2230–2244, 2002.

[6]
S. Doclo, A. Spriet, J. Wouters, and M. Moonen,
“Frequencydomain criterion for the speech distortion weighted multichannel Wiener filter for robust noise reduction,”
Speech Communication, vol. 49, no. 78, pp. 636–656, 2007.  [7] A. Bertrand, S. Doclo, S. Gannot, N. Ono, and T. van Waterschoot, “Special issue on wireless acoustic sensor networks and ad hoc microphone arrays,” Signal Processing, vol. 107, no. C, pp. 1–3, 2015.
 [8] A. Bertrand, J. Callebaut, and M. Moonen, “Adaptive distributed noise reduction for speech enhancement in wireless acoustic sensor networks,” in Proc. of IWAENC, 2010.
 [9] A. Bertrand and M. Moonen, “Distributed adaptive nodespecific signal estimation in fully connected sensor networks  Part I: Sequential node updating,” IEEE Transactions on Signal Processing, vol. 58, no. 10, pp. 5277–5291, 2010.
 [10] Y. Zeng and R. C. Hendriks, “Distributed estimation of the inverse of the correlation matrix for privacy preserving beamforming,” Signal Processing, vol. 107, pp. 109–122, 2015.
 [11] R. Heusdens, G. Zhang, R. C. Hendriks, Y. Zeng, and W. B. Kleijn, “Distributed MVDR beamforming for (wireless) microphone networks using message passing,” in IWAENC, 2012, pp. 1–4.
 [12] M. O’Connor and W. B. Kleijn, “Diffusionbased distributed MVDR beamformer,” in IEEE Proc. of ICASSP, 2014, pp. 810–814.
 [13] S. Gergen, R. Martin, and N. Madhu, “Source separation by featurebased clustering of microphones in ad hoc arrays,” IWAENC, pp. 530–534, 2018.
 [14] S. A. Vorobyov, A. B. Gershman, and Z. Q. Luo, “Robust adaptive beamforming using worstcase performance optimization: A solution to the signal mismatch problem,” IEEE Transactions on Signal Processing, vol. 51, no. 2, pp. 313–324, 2003.
 [15] A. Narayanan and D. Wang, “Ideal ratio mask estimation using deep neural networks for robust speech recognition,” IEEE ICASSP, pp. 7092–7096, 2013.
 [16] J. Heymann, L. Drude, and R. HaebUmbach, “Neural network based spectral mask estimation for acoustic beamforming,” in IEEE ICASSP, 2016, vol. 2016May, pp. 196–200.
 [17] L. Perotin, R. Serizel, E. Vincent, and A. Guérin, “CRNNbased joint azimuth and elevation localization with the ambisonics intensity vector,” in 16th International Workshop on Acoustic Signal Enhancement (IWAENC), Sep. 2018, pp. 241–245.
 [18] A.A. Nugraha, A. Liutkus, and E. Vincent, “Multichannel audio source separation with deep neural networks,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 10, pp. 1652–1664, 2016.
 [19] Y. Jiang, D. Wang, R. Liu, and Z. Feng, “Binaural classification for reverberant speech segregation using deep neural networks,” IEEE/ACM Transactions on Audio, Speech and Language Processing, vol. 22, no. 12, pp. 2112–2121, 2014.
 [20] S. Adavanne, A. Politis, and T. Virtanen, “Direction of arrival estimation for multiple sound sources using convolutional recurrent neural network,” in EUSIPCO, Sep. 2018, pp. 1462–1466.
 [21] S. Chakrabarty and E. A. P. Habets, “TimeFrequency Masking Based Online MultiChannel Speech Enhancement With Convolutional Recurrent Neural Networks,” IEEE Journal of Selected Topics in Signal Processing, vol. 13, no. 4, pp. 1–1, 2019.
 [22] L. Perotin, R. Serizel, E. Vincent, and A. Guérin, “Multichannel speech separation with recurrent neural networks from highorder ambisonics recordings,” in IEEE Proc. of ICASSP, 2018, pp. 36–40.
 [23] R. Serizel, M. Moonen, B. Van Dijk, and J. Wouters, “Lowrank Approximation Based Multichannel Wiener Filter Algorithms for Noise Reduction with Application in Cochlear Implants,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 22, no. 4, pp. 785–799, 2014.
 [24] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: an ASR corpus based on public domain audio books,” in IEEE Proc. of ICASSP, 2015, pp. 5206–5210.
 [25] J. Barker, R. Marxer, E. Vincent, and S. Watanabe, “The third CHiME speech separation and recognition challenge: Dataset, task and baselines,” in IEEE ASRU, December 2015, pp. 504–511.

[26]
G. Hinton, N. Srivastava, and K. Swersky,
“COURSERA: Neural networks for machine learning – lecture 6a,” 2012,
Online available at http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf.  [27] E. Vincent, R. Gribonval, and C. Févotte, “Performance measurement in blind audio source separation,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 14, no. 4, pp. 1462–1469, 2006.
Comments
There are no comments yet.