Distant speech processing is a challenging task as the target speech is usually corrupted by reverberation and interfering noise sources from the environment 
. Training an automatic speech recognition (ASR) system in the same environmental conditions is desirable, yet remains a challenging task when the additive noise and reverberation to be observed at test time is unknown. Enhancing speech with a microphone array prior to the ASR task stands as a simple solution to cope with the training and testing conditions mismatch .
Delay-and-Sum (DS) and Minimum Variance Distortionless Response (MVDR) beamformers use the direction of arrival (DoA) of the target sound source to enhance speech[3, 9, 4, 20]. However, they rely on the anechoic assumption, i.e. free field propagation of sound, which is an inaccurate approximation for reverberant environments. Alternatively, generalized eigenvalue (GEV) beamforming can enhance speech using only estimations of the speech and noise spatial covariance matrices (SCMs), based on an estimated time-frequency mask to distinguish speech from background noise. A neural network usually estimates this time-frequency mask, which implies the background interference type has to be known in advance at training time [12, 13, 11, 6]. While this can yield satisfactory performance when trained and evaluated on large synthetic datasets , this approach remains challenging in real environments with unknown interference.
. The idea behind KISS-GEV is to provide a mask estimation method that is computationally much lighter than state of the art machine learning-based approaches, while providing adequate enhancement in real-world scenarios with unseen noise, making it viable to run on low-cost embedded hardware. Results demonstrate that this method works without training a neural network to predict a time-frequency mask to differentiate speech from noise, which makes it ideal for enhancement in unseen conditions. The proposed method can also be used to separate speech from a mixture, provided the speech sources come from different directions. Results also show that the proposed method uses the same minimal DoA assumption as DS beamforming, yet outperforms this traditional approach.
2 Proposed approach
KISS-GEV compares the power of the signal obtained with DS beamforming pointing in the direction of the target against the average power from all channels. A two-filter bank then computes the power in low and high frequency regions, and generates a coarse binary time-frequency mask representing the target signal. These two broadband filters are suitable for reverberant environments as they can cope with early reflections, which usually significantly impact the spectral shape of the DS beamformed signal. This idea is similar to the Generalized Cross-Correlation with Phase Transform (GCC-PHAT) approach, where analyzing the normalized broadband spectrum makes time difference of arrival (TDoA) estimation more robust to early reflections . With representing the target and , the interference, the masks obtained with the filterbank ( and ) then provide an estimation of the target and interference spatial covariance matrices, denoted as and , respectively. The two SCMs are finally used to perform GEV beamforming. Figure 1 illustrates this pipeline.
The Short-Time Fourier Transform (STFT) is computed for the signal captured by each microphone(where denotes the number of microphones) to generate the time-frequency frames , with frames of samples in the time-domain, where stands for the frame index and the frequency bin index. The proposed approach then defines a binary filterbank made of filters, where is the filter index:
where represent the lower and the upper bounds of the filter , respectively. In this paper, we assume there is no overlap between adjacent filters (), and the spectrum is fully covered by the filters in the filterbank (, and ). More specifically, we restrict the number of filters to (to intuitively capture either low-frequency voiced phonemes or high-frequency fricatives), which implies the only parameter of the filterbank is the separator position , as shown in Figure 2.
The delay-and-sum beamformer is computed using the anechoic steering vectortowards the target:
where is the Time Delay of Arrival in samples, obtained from the known array geometry and target DoA.
The power of the beamformed signal corresponds to for each frequency , while the sum of the power of each channel is . It is then possible to compute the total beamformed power, as well as the average power, for each filter in the filterbank, by multiplying by the expression and summing across all frequencies. The ratio of the beamformed and average power for each filterbank is hence defined as:
where the constant is introduced for normalization purpose.
In other words, when the frequency region covered by the filter is dominated by the target signal, we expect the ratio to get closer to a value of . Similarly, when there is little power from the target signal in the region covered by filter , the ratio gets closer to a value of . The complete time-frequency ratio for frame and bin is then obtained by summing every with the filterbank:
Then, only both extremities of , with a width defined by parameter , are kept to define binary masks in order to capture only the most significant spectral features. Thresholds are defined as follows:
where is the -th percentile of for bin across all .
Using these thresholds, the binary time-frequency masks are defined as:
With the masks defined, both SCMs can be estimated as follows:
where stands for a vector that concatenates the time-frequency frames of all microphones, and stands for the Hermitian operator.
The beamforming vector is then obtained by performing the eigenvalue decomposition of the target and interference SCMs:
While the target and noise masks estimated with the filterbank have a poor frequency resolution, this remains acceptable for estimating the target and noise SCMs, as the eigenvalue decomposition leads to the same eigenvector as long as the target SCM captures more energy from the target signal than the noise signal, and vice-versa.
Finally, as demonstrated by Heymann et al. , a Blind Analytic Normalization (BAN) gain, defined as:
can be applied as a post-filter in order to reduce the distortion that can be introduced by the GEV beamformer in the target direction. The enhanced STFT representation of the target signal is then obtained as follows:
The algorithm’s performance was evaluated by calculating the Signal-to-Distortion Ratio (SDR) improvement of the enhanced signal on a dataset containing 20 target speech tracks, 20 different speech tracks to use as interference, 20 ambient noise tracks with both stationary and non-stationary noise and 20 music tracks; of which every combination of target and interference was simulated on 5 different room impulse responses (RIRs). Each simulation was performed with a new RIR, with none being reused between configurations, for a total of 6000 unique RIRs. The RIRs were simulated using the image method  with the geometry of a ReSpeaker Core v2 microphone array as the receiver, as well as varying parameters as in the BIRD dataset , such as the room dimensions (width between and m, length between and m, and height between and m), absorption coefficients (between and ), speed of sound (between and m/sec), and target and interference source positions. The speech segments are from the Librispeech  dataset, while the noise and music tracks were selected from the Musan  dataset.
A value of is defined to separate between voiced and unvoiced spectrum segments (which represents a separation at Hz), and is used as an adequate compromise between good target/interference separation and having enough data to properly calculate the SCM.
The mean SDR was calculated for the four following outputs: unprocessed mixture (from channel 1), delay-and-sum beamforming, KISS-GEV beamforming, and GEV beamforming with oracle ideal ratio mask (IRM) . The results of this experiment are shown in table 1.
|Method||Types of interference|
|GEV with oracle mask||16.98||17.40||18.58|
KISS-GEV significantly outperforms DS thanks to its ability to null interference. The remaining gap between KISS and Oracle masks is likely explained by the loss of spectral information to interference.
Figure 3 contrasts the time-frequency mask generated with KISS-GEV with the Oracle IRM, on a single simulation with stationary noise. Apart from showing how a coarse mask still leads to effective results for SCM estimation, it also suggests dereverberation can be achieved, as it only takes into account the direct path from the target and ignores the late reverberation.
. It can be observed that although it gives a similar noise floor to DS, KISS-GEV yields a much clearer resolution of the harmonics in voiced segments, as well as a more robust dereverberation than the other methods, which is especially visible on the fricatives. It also removes more white noise in the lower frequencies than DS.
Figure 5 shows spectrograms of a simulation using speech as interference on the same target utterance as figure 4, where a sharp dereverberation is also demonstrated, while offering a significantly better attenuation of the interference voiced segments than DS. This demonstrates that using KISS-GEV with informed DoA can enhance speech in the direction of interest and deal with the permutation issue observed in speech separation based on neural networks.
This paper presents KISS-GEV, a lightweight mask estimation front-end to generate target and interference SCMs for GEV-BAN beamforming that requires no training, hence performs well when enhancing against unseen interference. Results show a significantly better SDR than the popular DS beamformer while relying only on the same target DoA assumption. This approach is thus ideal for low-cost embedded hardware deployed in real-life environments. As the end use of this method is speech processing, Word Error Rate (WER) improvement should also eventually be evaluated with ASR backend, such as Kaldi . Future work could also involve estimating the masks and SCMs in an online manner, for real-time applications.
-  (1997) A robust method for speech signal time-delay estimation in reverberant rooms. In Proc. IEEE ICASSP, Vol. 1, pp. 375–378. Cited by: §2.
-  (2018) Building state-of-the-art distant speech recognition using the CHiME-4 challenge with a setup of speech enhancement baseline. In Proc. INTERSPEECH, pp. 1571–1575. Cited by: §1.
-  (2018) Multi-channel overlapped speech recognition with location guided speech extraction network. In Proc. IEEE SLT Workshop, pp. 558–565. Cited by: §1.
-  (2016) Improved MVDR beamforming using single-channel mask prediction networks. In Proc. of Interspeech, pp. 1981–1985. Cited by: §1.
-  (2019) SVD-PHAT: A fast sound source localization method. In Proc. IEEE ICASSP, pp. 4140–4144. Cited by: §1.
-  (2020) GEV beamforming supported by DOA-based masks generated on pairs of microphones. In Proc. Interspeech, pp. 3341–3345. Cited by: §1.
-  (2020) Audio-visual calibration with polynomial regression for 2-D projection using SVD-PHAT. In Proc. IEEE ICASSP, pp. 4856–4860. Cited by: §1.
-  (2020) BIRD: Big impulse response dataset. arXiv preprint arXiv:2010.09930. Cited by: §1, §3.
-  (2009) New insights into the MVDR beamformer in room acoustics. IEEE/ACM Trans. Audio Speech Lang. Process. 18 (1), pp. 158–170. Cited by: §1.
-  (2006) Room impulse response generator. Technische Universiteit Eindhoven, Tech. Rep 2 (2.4), pp. 1. Cited by: §3.
-  (2017) Beamnet: End-to-end training of a beamformer-supported multi-channel asr system. In Proc. IEEE ICASSP, pp. 5325–5329. Cited by: §1.
-  (2015) BLSTM supported GEV beamformer front-end for the 3rd CHiME challenge. In Proc. IEEE ASRU Workshop, pp. 444–451. Cited by: §1, §2.
-  (2016) Neural network based spectral mask estimation for acoustic beamforming. In Proc. IEEE ICASSP, pp. 196–200. Cited by: §1.
-  (2013) Ideal ratio mask estimation using deep neural networks for robust speech recognition. In Proc. IEEE ICASSP, pp. 7092–7096. Cited by: §3.
-  (2015) Librispeech: an asr corpus based on public domain audio books. In Proc. IEEE ICASSP, pp. 5206–5210. Cited by: §3.
-  (2011) The kaldi speech recognition toolkit. In IEEE 2011 Workshop on Automatic Speech Recognition and Understanding, Cited by: §4.
-  (2019) Unsupervised Speech Enhancement Based on Multichannel NMF-Informed Beamforming for Noise-Robust Automatic Speech Recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 27 (5), pp. 960–971. External Links: Cited by: §1.
-  (2015) Musan: a music, speech, and noise corpus. arXiv preprint arXiv:1510.08484. Cited by: §3.
A study of enhancement, augmentation, and autoencoder methods for domain adaptation in distant speech recognition. In Proc. INTERSPEECH, pp. 2928–2932. Cited by: §1.
-  (2017) On time-frequency mask estimation for MVDR beamforming with application in robust speech recognition. In Proc. IEEE ICASSP, pp. 3246–3250. Cited by: §1.