In a multi-speaker scenario the performance of many speech enhancement algorithms depends on correctly identifying the target speaker to be enhanced. Recent advances in electroencephalography (EEG) have shown that it is possible to identify the target speaker which the listener is attending to using single-trial EEG-based auditory attention decoding (AAD) methods [16, 3, 1, 5]. However, many AAD methods rely on the unrealistic assumption that the clean speech signals of the speakers are available as reference signals for decoding. In real-world conditions, obviously only the microphone signals, which consist of a mixture of the speakers, including reverberation and background noise, are available.
Aiming at incorporating AAD in speech enhancement, several algorithms have recently been proposed to generate appropriate reference signals for decoding from the microphone signals [9, 20, 18, 2]. Most cognitive-driven speech enhancement algorithms generate reference signals by separating the speakers from the mixture received at the microphones either using time-domain neural networks , multi-channel Wiener filters 
or minimum variance distortionless response (MVDR) beamformers. Using AAD, one of the reference signals is then selected as the enhanced attended speaker. More recently, aiming at controlling the suppression of the interfering speaker, which is important when intending to switch attention between speakers, a cognitive-driven beamforming system using linearly constrained minimum variance (LCMV) beamformers has been proposed [18, 2].
While most aforementioned cognitive-driven speech enhancement systems are able to suppress the interfering speakers and background noise, they may not be able to suppress (late) reverberation, which is known to have a detrimental effect on speech quality and intelligibility . In this paper we propose a cognitive-driven convolutional beamforming system aiming at enhancing the attended speaker and jointly suppressing the interfering speakers, reverberation and background noise.
The proposed system is depicted in Fig. 1 for a scenario with two speakers. First, time-frequency masks of both speakers are estimated from the noisy and reverberant microphone signals using a speaker-independent speech separation neural network. Then, two beamformers are designed to generate reference signals for AAD by enhancing the speech signal of each speaker based on the estimated masks. The AAD method then selects one of the reference signals as the enhanced attended speech signal. For the beamformers we propose to use a recently proposed weighted minimum power distortionless response (wMPDR) convolutional beamformer as it optimally combines dereverberation, noise suppression and interfering speaker suppression . While suppressing the interfering speaker is desired to improve speech intelligibility, keeping the interfering speaker audible is also important to allow the listener to switch attention between speakers. Therefore, we also propose an extension of the wMPDR convolutional beamformer incorporating an interference suppression constraint, referred to as a weighted linearly constrained minimum power (wLCMP) convolutional beamformer, which allows to control the level of suppression of the interfering speaker.
We experimentally compare our proposed method with state-of-the-art cognitive-driven systems based on conventional MPDR, LCMP, MVDR and LCMV beamformers, which are steered based on estimated masks or estimated DOAs. The results show that the proposed system outperforms state-of-the-art cognitive-driven systems for dealing with noisy and reverberant speech mixtures and reveal potential future research directions.
2 Cognitive-driven convolutional
2.1 Signal model
We consider an acoustic scenario comprising competing speakers111It should be noted that we provide a general description of the algorithms for speakers, but limit our experiments in Section 4 to two speakers. with the clean signals denoted as , where is the discrete time index. We consider a binaural hearing aid setup with microphones. The -th microphone signal can be decomposed as
where denotes the reverberant speech component in the -th microphone signal corresponding to speaker and denotes the background noise component. The reverberant speech components consist of an anechoic speech component (encompassing the head filtering effect), an early reverberation component arriving typically in the order of tens of milliseconds, and a late reverberation component. While early reverberation can be beneficial for speech intelligibility, late reverberation is known to have a detrimental effect on speech quality and intelligibility .
2.2 Mask estimation
The first component of our proposed system is a separation neural network that estimates time-frequency ideal ratio masks corresponding to each speaker from the reverberant and noisy microphone signals. These masks will be used for beamforming and to generate reference signals for AAD (see Section 2.3).
Several neural network-based speech separation approaches have been proposed, both in frequency-domain and in time-domain[12, 13]. In this paper we use a BLSTM-based frequency-domain approach  since it trains faster than time-domain approaches such as , allowing a faster experimental turnover.
The separation neural network takes the STFT coefficients of the -th microphone signal as input features and generates real-valued time-frequency masks, i.e.,
where the matrix contains all STFT coefficients of the -th microphone signal, is the separation neural network, and the matrix for , contains the estimated time-frequency masks for speaker . In addition to the time-frequency masks for the speakers, the network also generates a time-frequency mask for the background noise, i.e., .
The separation neural network is trained using permutation invariant training (PIT)  with a scale-dependent SNR loss in the time-domain . However, at test time the masks have speaker permutation ambiguity, i.e., it is not known which mask corresponds to which speaker. In addition, the separation neural network in (3) operates on each microphone signal independently, which typically causes speaker permutation ambiguities across the microphones. To resolve this ambiguity, we align the masks obtained for each microphone based on the least-squares error. We then average the masks across the microphones to obtain one mask for each speaker, i.e. . The averaged mask contains the masks of the -th speaker for all times frames and frequencies.
2.3 Reference signal generation using beamformers
Based on the estimated masks , we design beamformers to extract each speaker with reduced noise and reverberation from the microphone signals (see and in Fig. 1). The output signals of the beamformers are then transformed to the time-domain as , where ISTFT denotes the inverse short-time Fourier transform. These time-domain output signals will be used as reference signals for AAD.
In this paper we investigate different types of beamformers for generating reference signals, i.e., wMPDR and wLCMP convolutional beamformers, and conventional MPDR and LCMP beamformers, which will be described in detail in Section 3.
2.4 Speaker selection using AAD
Based on the reference signals generated by the beamformers, the speaker which the listener is attending to is then selected using the EEG-based auditory attention decoding method proposed in . First, an estimate of the envelope of the attended speech signal , with the sub-sampled time index, is reconstructed from the EEG signals using a trained spatio-temporal filter. Then, the correlation between the reconstructed envelope and the envelopes of the reference signals is computed, i.e.,
where is the Pearson correlation. Finally, the attended speech signal is selected as the reference signal yielding the maximum correlation with the reconstructed envelope, i.e.,
In this section, we review the wMPDR convolutional beamformer , present the proposed wLCMP convolutional beamformer, and compare them with the conventional MPDR and LCMP beamformers. Since the beamformer operates for each frequency independently, the frequency index will be omitted in this section for notational conciseness.
3.1 Weighted MPDR convolutional beamformer
The wMPDR convolutional beamformer in  aims at 1) suppressing the noise component while preserving the target speech component in one of the microphone signals and 2) suppressing the late reverberation component while preserving the early reverberation component corresponding to the target speaker (i.e., dereverberation). The output signal of a convolutional beamformer is defined as
where , , consists of the observation from frames in the past until frames in the past, i.e., , and and model the frame delay of the start and end time of the late reverberation, respectively.
It has been shown in  that the convolutional beamformer can be factorized into a dereverberation matrix and a beamforming vector , i.e., with . The convolutional beamforming in (6) can hence be written as dereverberation filtering followed by beamforming , i.e.,
Assuming that the output of the convolutional beamformer
follows a zero mean complex Gaussian distribution with a time-varying variance, the wMPDR convolutional beamformer is obtained by maximizing an objective function , which is derived based on the maximum-likelihood estimation with a target speaker preservation constraint (distortionless constraint), i.e.,
where denotes the time-varying variance of the target speech component (including the early reverberation) and denotes the number of frames over which the beamformer coefficients are estimated.
This optimization problem can be solved in a alternating fashion, by first assuming constant and solving for and then updating . Assuming constant, the optimization problem of the wMPDR convolutional beamformer incorporating the target speaker preservation constraint can be written as 
with , , .
To estimate the RETF vector of the target speaker in (11), we use the masks of the target speaker , assuming the target speaker index is . The RETF vector is estimated using the covariance whitening method , i.e.,
where is the covariance matrix of the target speaker and is the covariance matrix of all interfering speakers and background noise.
The estimation methods discussed in this section are used to iteratively update the output signal of the wMPDR convolutional beamformer. First, the dereverberation filtering in (7) is performed using in (11). Based on the dereverberated signals and the estimated masks , the RETF vector of the target speaker is updated using (12) to steer the beamformer in (11). Using the steered beamformer, the output signal in (7) is obtained. The variance of the target speech component is then updated as for the next iteration.
3.2 Weighted LCMP convolutional beamformer
As an alternative to the wMPDR convolutional beamformer, we propose the wLCMP convolutional beamformer, which allows to control the suppression of the interfering speakers. The wLCMP convolutional beamformer is derived by adding interfering speaker suppression constraints to the optimization problem of the wMPDR convolutional beamformer, i.e.,
where contains the RETF vectors of interfering speakers, with , and controls the amount of suppression of the interfering speakers. This optimization problem is the same as the optimization problem of the conventional LCMP beamformer , but with different RTF vectors and covariance matrix. Therefore the wLCMP convolutional beamformer can be obtained as
with and . Setting to zero in (15) corresponds to a complete suppression of the -th interfering speaker, while leads to a controlled suppression.
where is the covariance matrix of the -th interfering speaker and .
The output signal of the wLCMP convolutional beamformer is iteratively updated similarly as for the wMPDR convolutional beamformer.
3.3 Relation with conventional MPDR and LCMP beamformers
The conventional MPDR beamformer aims at minimizing the PSD of the output signal while preserving the reverberant target speech component in one of the microphone signals . The MPDR beamformer is given by
where and denotes the reverberant RTF vector corresponding to the target speaker. The MPDR beamformer in (17) is similar to the convolutional wMPDR beamformer in (11) except that the covariance matrix and the RTF vector are estimated using the microphone signals instead of the dereverberated microphone signals . In addition, the MPDR beamformer is obtained using a non-iterative optimization procedure compared to the wMPDR convolutional beamformer.
A similar relation exists between the conventional LMCP beamformer incorporating interfering speaker suppression constraints and the wLMCP convolutional beamformer in (15). The conventional LCMP beamformer is given by 
with and containing the reverberant RTF vectors of interfering speakers.
The output signals of the MPDR and the LCMP beamformer are obtained as
These output signals are obviously computed without involving a dereverberation step compared to the output signals of wMPDR and wLCMP convolutional beamformers in (6).
4 Experimental setup
4.1 Acoustic simulation setup
In the experimental evaluation we consider two competing speakers, i.e., . Two German audio stories, uttered by two different male speakers, were used as the clean speech signals and . Speech pauses that exceeded s were shortened to s, resulting in two highly overlapping (competing) audio stories. The hearing aid microphone signals were generated at a sampling frequency of kHz by convolving the clean speech signals with non-individualized measured binaural impulse responses (anechoic or reverberant) from , and adding diffuse babble noise, simulated according to . The hearing aid setup in  consisted of two hearing aids, each equipped with three microphones (), mounted on a dummy head. The left and the right competing speaker were simulated at and . We consider three acoustic conditions, i.e., an anechoic-noisy condition with an average frequency-weighted segmental SNR () of dB, a reverberant condition (reverberation time s) with an average of dB, and a reverberant-noisy condition with an average of dB. The average is computed by averaging the highest fwSSNR corresponding to speaker and to speaker among the microphone signals. The reference signals used to compute the fwSSNR are the anechoic speech signals of the speakers at the first microphone of the hearing aid located at the same side of each speaker.
4.2 Mask estimation
The mask estimation neural network consisted of 3 BLSTM layers of 896 units. The network was trained on simulated noisy and reverberant mixtures obtained by mixing Librispeech  utterances convolved with room impulse responses generated with the image method for reverberation times between s and s, and adding babble noise at between 5 and 15 dB. The number of training mixtures was 50k. Note that there is a large mismatch between the training and the testing condition with respect to reverberation, background noise and head shadow effect, and also a large linguistic dissimilarity, as Librispeech consists of English read speech but the test data consists of German audio stories.
All considered beamformers were implemented using a weighted overlap-add (WOLA) framework with an STFT frame length , an overlap of between successive frames and a Hann window. For the wMPDR and wLCMP convolutional beamformers, the frame delay was set to and the length of the dereverberation filter was set to , and for frequency ranges kHz, kHz and kHz, respectively. The variance of the target speech component was initialized as . For the wLCMP convolutional beamformer and the LCMP beamformer, we set the interference suppression parameter to to partially suppress the unattended speaker. The outputs signal of the wMPDR and wLCMP convolutional beamformers were obtained with iterations.
To investigate the impact of mask estimation errors on the speech enhancement performance of the proposed system, we consider oracle ideal ratio masks (oMASK) and estimated ideal ratio masks (eMASK), obtained by the mask estimation neural network in (3).
We also compare our proposed system with a state-of-the-art cognitive-driven system proposed in , which uses either a conventional MVDR beamformer or a conventional LCMV beamformer to generate reference signals. Contrary to the MPDR and LCMP beamformers described in Section 3.3, these MVDR and LCMV beamformers use a diffuse noise covariance matrix instead of and are steered using estimated anechoic RTF vectors (based on estimated DOAs of both speakers) instead of estimated reverberant RTF vectors. For the LCMV beamformer, the interference suppression parameter was set to . Similarly as in , the DOAs of both speakers were estimated using a classification-based method  and the anechoic RTF vectors corresponding to the estimated DOAs were selected from a database of (measured) prototype RTF vectors .
4.4 Speaker selection using AAD
We used EEG responses recorded for 16 native German-speaking participants, where participants were instructed to attend to the left speaker and participants to the right speaker. See  for details about the EEG recording and the AAD training and decoding configuration.
For the AAD training and decoding steps (see Section 2.4), the EEG recordings were split into -second trials, resulting in trials for the anechoic-noisy condition as well as for the reverberant-noisy condition, and trials for the reverberant condition. Each participant’s own data were used for training the spatio-temporal filter used for reconstructing the speech envelope from the EEG data.
4.5 Performance measures
We evaluate the cognitive-driven beamformers both in terms of AAD and speech enhancement performance. To evaluate the AAD performance, a trial is considered to be correctly decoded if the fwSSNR corresponding to the selected beamformer output signal (as the attended speech signal) is larger than the fwSSNR corresponding to the discarded beamformer output signal. To compute fwSSNR, the anechoic speech component of the attended speaker in the first microphone signal of the hearing aid at the side of the attended speaker was used as the fwSSNR reference signal. The AAD performance is then computed by averaging the percentage of correctly decoded trials over all considered trials and all participants.
The speech enhancement performance of the complete proposed system is evaluated in terms of the fwSSNR improvement () using the same reference signals as used for AAD performance evaluation. The input fwSSNR is defined as the highest fwSSNR among the microphone signals. The output fwSSNR is defined as the fwSSNR of the selected beamformer output signals .
To investigate the impact of the errors of speaker selection using AAD on the speech enhancement performance of the complete proposed system, we will consider oracle AAD (oAAD) where the attended speech signal is determined based on the highest among the output signals of and , and estimated AAD (eAAD) where is determined based on the highest Pearson correlation coefficients as described in Section2.4.
5 Experimental results
In this section, we evaluate the AAD performance and the speech enhancement performance of the proposed cognitive-driven convolutional beamforming system. In Section 5.1 we investigate the impact of mask estimation errors on the AAD performance. In Section 5.2, we investigate the impact of AAD errors on the speech enhancement performance.
5.1 Auditory attention decoding performance
Figure 2 depicts the average AAD performance for the anechoic-noisy, the reverberant and the reverberant-noisy condition, when using the output signals of the wMPDR or wLCMP convolutional beamformer, the MPDR or LCMP beamformer and the MVDR or LCMV beamformer as reference signals for decoding. We observe that all considered beamformers yield a AAD performance that is significantly larger than chance levels. For all considered acoustic conditions the wMPDR convolutional beamformer and the wLCMP convolutional beamformer using the oracle masks (wMPDR-oMASK and wLCMP-oMASK) yield the highest AAD performance, showing the potential of using convolutional beamformers for AAD.
When using estimated masks instead of oracle masks for the convolutional beamformers (wMPDR-eMASK and wLCMP-eMASK) the AAD performance decreases, especially in the reverberant-noisy condition. In the reverberant-noisy condition, the MVDR and LCMV beamformers using anechoic RTF vectors based on estimated DOAs (MVDR-eDOA and LCMV-eDOA) yield a larger average AAD performance than the beamformers using reverberant RTF vectors based on the estimated masks. This suggests that in order to improve the AAD performance, a better estimation of RTF vectors is required, e.g., based on prototype RTF vectors or neural networks that are more robust to background noise and reverberation.
5.2 Speech enhancement performance
Figure 3a depicts the fwSSNR improvement of the complete proposed system averaged over all considered acoustic conditions, either using oracle AAD or estimated AAD. It can be observed that the convolutional beamformers outperform all other considered beamformers for both oracle and estimated AAD. When using estimated AAD instead of oracle AAD, for all considered beamformers the fwSSNR improvement decreases by – dB, showing the sensitivity to AAD errors. Nevertheless, the fwSSNR improvement of the convolutional beamformers is about – dB larger than the state-of-the-art MVDR and LCMV beamformers using estimated DOAs.
Figure 3b depicts the fwSSNR improvement of the complete proposed system for the anechoic-noisy, reverberant and reverberant-noisy conditions when using estimated AAD. It can be observed that all beamformers yield a significant fwSSNR improvement for the anechoic-noisy condition. However, for the reverberant condition the systems using conventional beamformers (MPDR-eMask, LCMP-eMask, MVDR-eDOA, LCMV-eDOA) tend to degrade the fwSSNR, whereas only the proposed system using convolutional beamformers (wMPDR-eMask, wLCMP-eMask) provides a fwSSNR improvement, showing the influence of dereverberation. It should be noted that the considered reverberant-noisy condition with an interfering speaker is an extremely adverse condition with babble noise at a signal-to-interference-plus-noise ratio (SINR) of dB and a reverberation time of s, which makes it very challenging for speech enhancement.
The experimental results show that for the considered acoustic setup the AAD performance and the fwSSNR improvement of the proposed cognitive-driven speech enhancement system using convolutional beamformers are sensitive to mask estimation errors, particularly for the reverberant and reverberant-noisy conditions. The mask estimation errors can be mainly attributed to the linguistic dissimilarity of training and testing conditions of the neural-network-based mask estimation algorithm and also the intrinsic difficulty of separating out two competing speakers with the same gender in the reverberant-noisy condition.
The results show that the wMPDR convolutional beamformer yields a larger fwSSNR improvement than the wLCMP convolutional beamformer. Although the wMPDR convolutional beamformer can strongly suppress the interfering speaker, it may deprive the listener from the ability to switch attention between the speakers. In contrast, the wLCMP convolutional beamformer is able to both control the interfering speaker suppression as well as yield a considerable fwSSNR improvement.
Lastly, the results show that the convolutional beamformers (wLCMP-eMASK and wMPDR-eMASK) yield the highest fwSSNR improvement for all considered acoustic conditions, whereas the conventional LCMV beamformer (LCMV-eDOA) yields the highest AAD performance in the reverberant and reverberant-noisy conditions. Future work could therefore investigate the potential of combining the convolutional and the conventional beamformers to improve both the decoding and the speech enhancement performance.
In this paper, we proposed a cognitive-driven speech enhancement system which combines neural-network-based mask estimation, convolutional beamformers and AAD. We considered the wMPDR convolutional beamformer, which jointly enhances the attended speaker and suppresses the unattended speaker, reverberation and background noise. In addition, we proposed a wLCMP convolutional beamformer which enables to control the amount of suppression for the unattended speaker. The experimental results showed that the proposed system using convolutional beamformers is able to considerably improve the both for noisy and reverberant conditions compared to state-of-the-art cognitive-driven speech enhancement systems.
-  (2019) A tutorial on auditory attention identification methods. Frontiers in Neuroscience 13, pp. 153. Cited by: §1.
-  (2020) Cognitive-driven binaural beamforming using EEG-based auditory attention decoding. IEEE Transactions on Audio, Speech, and Language Processing 28, pp. 862–875. Cited by: §1, §4.3, §4.4.
-  (2019-04) Impact of different acoustic components on EEG-based auditory attention decoding in noisy and reverberant conditions. IEEE Transactions on Neural Systems and Rehabilitation Engineering 27 (4), pp. 652–663. Cited by: §1.
-  (2020-05) JOINTLY optimal dereverberation and beamforming. In Proc. of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 216–220. Cited by: §1, §3.1, §3.1.
-  (2019-08) Comparison of two-talker attention decoding from eeg with nonlinear neural networks and linear methods. Scientific Reports, Nature 9 (11538). Cited by: §1.
-  (2015-Mar.) Multichannel signal enhancement algorithms for assisted listening devices. IEEE Signal Processing Magazine 32 (2), pp. 18–30. Cited by: §3.3.
-  (2012-Mar.) A speech distortion and interference rejection constraint beamformer. IEEE Transactions on Audio, Speech, and Language Processing 20 (3), pp. 854–867. Cited by: §3.2, §3.3.
-  (2008-Nov.) Generating nonstationary multisensor signals under a spatial coherence constraint. Journal of the Acoustical Society of America 124 (5), pp. 2911–2917. Cited by: §4.1.
-  (2019) Speaker-independent auditory attention decoding without access to clean speech sources. Science Advances 5 (5). Cited by: §1.
-  (2014, Juan-les-Pins, France-Sep.) A discriminative learning approach to probabilistic acoustic source localization. In Proc. International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 99–103. External Links: Cited by: §4.3.
-  (2009) Database of multichannel in-ear and behind-the-ear head-related and binaural room impulse responses. EURASIP Journal on Advances in Signal Processing 2009, pp. 6. Cited by: §4.1, §4.3.
Multi-talker speech separation with utterance-level permutation invariant training of deep recurrent neural networks. IEEE/ACM Transactions on Audio, Speech, and Language Processing 25 (10), pp. 1901–1913. Cited by: §2.2, §2.2.
-  (2018-05) TaSNet: time-domain audio separation network for real-time, single-channel speech separation. In Proc. of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Vol. , pp. 696–700. Cited by: §2.2.
Multichannel eigenspace beamforming in a reverberant noisy environment with multiple interfering speech signals. IEEE Transactions on Audio, Speech, and Language Processing 17 (6), pp. 1071–1086. External Links: Cited by: §3.1.
-  (2019-09) Maximum likelihood convolutional beamformer for simultaneous denoising and dereverberation. In Proc. of the European Signal Processing Conference (EUSIPCO), Vol. , A Coruna, Spain, pp. 1–5. Cited by: §3.1, §3.1, §3.1, §3.
-  (2014) Attentional selection in a cocktail party environment can be decoded from single-trial EEG. External Links: Cited by: §1, §2.4.
-  (2015) Librispeech: an ASR corpus based on public domain audio books. In Proc. of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Vol. , Brisbane, Australia, pp. 5206–5210. Cited by: §4.2.
-  (2019-05) A joint auditory attention decoding and adaptive binaural beamforming algorithm for hearing devices. In Proc. of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Vol. , pp. 311–315. Cited by: §1.
-  (2019) SDR – half-baked or well done?. In Proc. of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Brighton, United Kingdom, pp. 626–630. Cited by: §2.2.
-  (2017) EEG-informed attended speaker extraction from recorded speech mixtures with application in neuro-steered hearing prostheses. 64 (5), pp. 1045–1056. Cited by: §1.
-  (2013-01) Effects of spatial and temporal integration of a single early reflection on speech intelligibility. The Journal of the Acoustical Society of America 133 (1), pp. 269–282. Cited by: §1, §2.1.