We consider the problem of stereo speech enhancement that is estimating a stereo clean speech from stereo noisy records. There exists a rich literature in signal processing for mono speech enhancement. To name a few, ephraim1985lsa enhance the speech through estimating its log-spectral amplitude [ephraim1985lsa]. oppenheim1979enh discuss various enhancement methods such as Wiener filtering and all-pole speech modeling [oppenheim1979enh]. Over the past decade, deep learning has gained a lot of attention for speech enhancement [wang2018supervised]; this is partly due to the growing number of available training datasets (i.e., clean speech and its noisy counterpart), and partly due to the outperformance of learning based approaches compared to classical methods [kaz2020low, soni2018tfgan, weninger2015speech].
Learning based mono speech enhancement is mainly of two forms: a) predicting the clean speech through a deep neural network such as U-Net [ronneberger2015u], or b) estimating a real [chakrabarty2018time, li2019tflstm] or complex [williamson2015complex, tolooshams2020ch] time-frequency (TF) mask such that when applied to the mixture it predicts the target speech. In the case of multichannel speech enhancement [li2019tflstm, tolooshams2020ch, gu2019end], prior work focuses on extracting spatial features either explicitly at the input or implicitly through the network. wang2018all combine spectral features estimated through monaural speech enhancement with directional features to improve performance. tolooshams2020ch capture the spatial info with a beamforming-inspired architecture to perform complex ratio masking.
Despite the usage of spatial information within the network for speech enhancement, the preservation of spatial image such as sound image locations and sensations of depth is barely studied. Prior work on stereo enhancement mainly provides overall objective evaluations through metrics such as PESQ [rix2001pesq] and STOI [taal2011stoi] rather than focusing on perceptual enhancement through subjective tests. Moreover, in cases of reported subjective evaluations, mainly overall performance is studied rather than sound image [subramanian2019speechmushra, braithwaite2019speech, polyak2021regen]. We note that spatial cue preservation are studied previously for source separation [han2020mimotasnet].
To fill the gap, this paper proposes a framework to preserve the stereo image and provide subjective evaluation along with objective metrics to assess the method. The approach is model-independent and fully focuses on training through a stereo-aware loss function helping to preserve spatial information. Specifically, the method regularizes to preserve interchannel intensity difference (IID), interchannel phase difference (IPD), interchannel coherence (IC), and overall phase difference (OPD). This is inspired by traditional methods[faller2003binaural, herre2004joint, breebaart2005parametric], specifically parametric coding of stereo [breebaart2005parametric], originally developed for efficient stereo coding to reduce bit-rate.
formulates the problem, introduces the stereo-aware training, and demonstrates the network architecture. The dataset, training details and evaluation metrics are explained inSection 3. We show in Section 4 that the stereo-aware training not only results in an overall improvement of the enhanced speech, but also refines the stereo image. This is supported by both objective and subjective evaluation. Finally, Section 5 concludes.
2.1 Problem formulation
Consider the discrete-time noisy speech observed at a stereo microphone. In the stereo speech enhancement problem, we aim to estimate the clean reverberated stereo speech given the mixture following the model
for . The received speech at the microphone is the result of convolving the speaker speech with stereo room impulse responses (RIRs). Similarly, the noise is reverberated through the room and recorded at the microphone asand with time frames and frequency bins.
2.2 Stereo-aware training
Given the model-independence of the proposed framework, we focus mainly on the training loss enabling stereo image preservation and discuss the choice of neural architecture in Section 2.3. Given a set of training data, a network is trained to minimize a combination of speech reconstruction and stereo image preservation loss, i.e.,
where is an estimate of clean speech given the mixture . We design the speech reconstruction
loss to suppress the noise and improve signal-to-noise ratio (SNR), and theimage preservation loss to conserve features related to the position of the speaker and microphone.
Speech reconstruction: Given and , the loss consists of log-spectral distortion (LSD) [wu2008minimum] and time loss (TL):
where is the generalized logarithmic function with [kobayashi1984gen]. LSD helps to minimize the spectral error, and TL compensates for phase enhancement in the time domain.
Image preservation: We study the stereo image of the signal based on four spatial properties. We follow a similar approach to [breebaart2005parametric] and quantify the interchannel intensity-phase-coherence differences and overall phase. Although we study the parameters for image preservation, the original idea behind it is to reduce the bit-rate of the audio for a more efficient transmission or storage. For example, instead of transmitting the stereo signal, one may encode it with a mono downmix and stereo parameters. Then, the parameters are used by the decoder to reinstate spatial cues to reconstruct stereo [breebaart2005parametric].
Given STFT for , the frequency bins are grouped into non-overlapping subbands such that there are total of bins in each band. We leave non-uniform bands with equivalent rectangular bandwidth (ERB) [glasberg1990derivation] for future works. For , there would be bands. For each band , we extract IID, IPD, IC, and OPD as follows:
where we denote the frequencies in band by and denotes complex conjugation. While IID and interchannel time differences cues are known to be useful for evaluation of sound source localization [rayleigh1907xii, sayers1964acoustic, bogaert2007wiener], they have not been used during network training. We capture the time difference through IPD highlighting the delay between the channels. IC quantifies the correlation between the left and right channels given an aligned phase. Finally, OPD encodes the phase difference between the source and its estimate. Given the spatial parameters, image preservation error is defined as:
2.3 Network architecture
The network (Figure 1) consists of three main blocks, i.e., encoder, denoiser, and decoder. Given the noisy input , the encoder computes its STFT and scales it by . Then, the signal is passed through a band compressor (BC) and is outputted as a stack of the real and imaginary components . BC compresses the mixture by a factor of along the frequency domain. Precisely, it passes the low frequency bins in , and compresses the bins in and high frequencies in by a factor of and , respectively. This compression is achieved by averaging the neighbouring frequencies.
The denoiser has a U-Net structure [ronneberger2015u] consisting of a feature extractor ( blocks), down-blocks ( blocks), enhancer ( blocks), and up-blocks (
blocks). The architecture has skip-connections between down and up-blocks. The building blocks contain convolution layers, causal along the time axis. All blocks have leaky ReLU activations with
(except the first extractor block) and batch normalization (except last up-block). The up-blocks contain pixel shufflers to reshape the feature map into desired number of channels.
The decoder decompresses the signal to reverse the BC operation, and applies an inverse STFT to construct the signal in time domain. The main results are based on this U-Net architecture. To emphasize the model-independence of the stereo-aware framework, we additionally train a similar architecture, which we call U-NetCM, with a decoder estimating a complex TF mask for enhancement [williamson2015complex, tolooshams2020ch].
Both training and testing data are sampled at kHz. We use the Deep Noise Suppression (DNS) challenge dataset [reddy2021interspeech] to generate training stereo data. We picked mono clean and noise tracks at random, and applied RIRs to create stereo (usage of RIRs instead of head-related transfer function is to focus on stereo image on the recording device). Then, clean and noise are mixed with an SNR sampled from with a range of dB. The signals are leveled up/down using a scale following . We generate approximately M stereo signals. During training, a random segment of s is selected, i.e., samples.
We use two test sets. For each, we create mono utterances, and construct a stereo test set using test RIRs. The sets are divided into five groups each with SNR of , , , , and dB. The speech and mixture are scaled by a constant following .
Room impulse responses (RIR): RIRs are simulated using an image method similar to [rao2021interspeech]. The room size ranges from to . The microphones are cm apart and are placed in the room uniformly at random such that their height ranges from to m, and they are within the second and third quarter in the middle. The speaker and noise are randomly located in the room with a height ranged from to m. The speaker and noise have distances of m and m from the microphone, respectively. Finally, we make sure that the angle between the speech and noise is at least , and sound velocity is m/s. The generated RIRs are s long and categorize into two groups of with and without reverberation.
We create around training rooms and generate utterances for each. For the training set, there are no reverberation RIRs, and reverberated RIRs with dB attenuation time sampled from s. For each test set, rooms (i.e., one for each test example) are created. The sets follow similar characteristics as in the training set, except that Test set II uses a room height of m (i.e., a typical meeting room). The test set RIRs are divided into four categories of no, short, medium, and long reverberation which has dB attenuation in s, respectively.
The network is trained using ADAM optimizer with , , and an initial learning rate of . The learning rate is scheduled with piecewise constant decay to after iterations. Training is performed on four GPUs using a batch size of for iterations. For TL, is set to . Additionally, , , , and whenever the particular error is present. The above weights are chosen such that all loss components are on the same order. STFT and inverse-STFT blocks use Hanning windows of length (i.e., ) with hop size of . Then, time frames are cropped.
Signal-to-distortion ratio (SDR) and perceptual objective listening quality assessment (POLQA) [beerends2013perceptual] along with stereo preservation errors are used as objective metrics. Given the enhanced speech, SDR and POLQA metrics (higher the better) are computed independently for each channel, and the average is reported. Additionally, we quantify the errors for IID, IPD, IC and OPD (lower the better).
We perform a listening test following MUSHRA standards [series2014method] through a vendor specialized in designing cloud-based tests. Approximately listeners, wearing headphones, participated in the experiments; each evaluates a subset of the test set given OVRL or IMG task. Given a reference (i.e., clean speech), listeners are ask to evaluate and grade several tracks including a hidden reference and an anchor, i.e., the noisy mixture [polyak2021regen, deng2015speech, braithwaite2019speech]. We evaluate two attributes (OVRL and IMG). For OVRL, the assessors are asked to evaluate the overall quality of the audio clips. For IMG, they rate the stereophonic image quality of the clips (i.e., how close the clips are to the reference in terms of sound image locations, sensations of depth, and reality of the speaker). We categorize the MUSHRA grading scheme from to as () Bad, () Poor, () Fair, () Good, and () Excellent.
|Network||Method||Test set I||Test set II|
|U-Net||downmix - spec||6.46||2.98||2.68||2.79||0.30||1.61||x||x||6.16||2.95||2.70||2.83||0.31||1.62|
|LRindp - spec||6.82||3.26||2.36||1.99||0.28||1.62||x||x||6.67||3.19||2.48||2.02||0.27||1.63|
|downmix - spec - time||10.10||2.95||2.39||2.78||0.29||1.40||0.34||0.30||9.65||2.92||2.42||2.82||0.29||1.40|
|LRindp - spec - time||12.89||3.31||2.42||1.92||0.27||1.27||0.42||0.35||12.27||3.24||2.55||1.95||0.26||1.27|
|stereo - spec - time||12.56||3.01||1.85||1.91||0.26||1.25||0.38||0.37||11.97||2.96||1.90||1.93||0.28||1.23|
|stereo - spec - time - IID||14.17||3.33||1.55||1.76||0.35||1.42||0.45||0.41||13.64||3.26||1.59||1.79||0.39||1.43|
|stereo - spec - time - IPD||13.88||3.36||1.67||1.71||0.32||1.27||0.63||0.46||13.24||3.30||1.71||1.73||0.36||1.28|
|stereo - spec - time - IC||12.09||3.04||1.80||2.08||0.21||1.43||0.31||0.37||11.47||2.98||1.85||2.12||0.20||1.40|
|stereo - spec - time - OPD||14.05||3.33||1.86||2.10||0.23||0.99||0.42||0.49||13.35||3.28||1.90||2.15||0.22||1.00|
|stereo - spec - time - all||13.78||3.32||1.64||1.81||0.21||1.10||0.45||0.43||13.16||3.25||1.69||1.85||0.19||1.11|
|U-NetCM||stereo - spec||6.28||3.34||2.24||2.14||0.25||2.48||x||x||6.10||3.27||2.29||2.18||0.23||2.46|
|stereo - spec - time - all||15.02||3.28||1.96||1.93||0.24||1.05||x||x||14.30||3.22||2.01||1.97||0.23||1.06|
We train the stereo network using various combinations of the loss. We denote the presence of LSD and TL in the training loss by spec and time, respectively. For example, spec-time-OPD denotes the case where loss includes LSD, TL, and OPD errors. Given the rich deep learning literature on enhancing mono signals, we consider two baselines. A mono network that is trained using downmix (i.e., , where for prediction, the phase difference between the mixture stereo and enhanced downmix is added to reinstate stereo. We call this method downmix. The other baseline, LRindp, is a mono network trained using left and right channels independently. We first compare the baselines to one another, then highlight the effect of the time loss on the performance, and finally focus on the stereo-aware training. Table 1 demonstrates the evaluations on the test sets where the comparisons we highlight bellow holds for both test sets.
Downmix vs. LRindp: LRindp-spec shows better performance in terms of SDR and POLQA against downmix-spec and also results in a better image preservation (lower IID, IPD, and IC errors). In spite of the better performance, LRindp has approximately doubled inference time compared to downmix. Drawbacks of downmix may come from the addition of noisy phase at channel-upsampling time.
Presence of time loss: Comparing spec with spec-time, time loss results in a drastic improvement in SDR with a trade-off being an occasional decrease in POLQA. TL also helps to preserve OPD. Figure 1(a) highlights the overall phase preservation using the time loss; compared to LRindp-spec, LRindp-spec-time has lower OPD in magnitude, particularly at low-frequency bands.
Mono to stereo: Moving from mono to stereo, we observe that the stereo image of LRindp-spec-time gets worse, but that of the stereo-spec-time is improved. Stereo method preserves IID much better than the mono case. However, LRindp-spec-time (with doubled inference time) has better performance in terms of SDR and POLQA compared to stereo-spec-time. This motivates us to regularize for image preservation loss which cannot be done in a mono network. We show how this helps the stereo network to outperform LRindp.
Image dynamics during training: We first study the effect of each stereo parameter independently. Figure 1(c) shows how the IID loss for the training batch changes as a function of training iterations; specifically, it shows that IPD regularization alone helps to achieve a lower IID error and the lowest IID is achieved when the IID loss is presented. Moreover, Figure 1(d) demonstrates that regularizing for IID or IPD preservation results in a worse IC than no image regularization (i.e., stereo-spec-time). The figure highlights that OPD does not have much effect on IC, and IC can further be improved by regularizing for IC. Finally, we observe (not shown) that including IC loss during training results in a worse IPD compared to no regularization training.
Image preservation loss: Given stereo-spec-time, the addition of IID, IPD, IC, or OPD results in a POLQA improvement. Furthermore, regularizing for IID, IPD, or OPD, improves SDR. We observed that regularizing for the preservation of a stereo metric alone (e.g., stereo-spec-time-IC) results in the best preservation of that metric (e.g., IC) in the test sets among all other methods. Figure 1(b) visualizes IC of the clean speech, predicted signal through stereo-spec-time and stereo-spec-time-IC from Test set I. The figure highlights that stereo-spec-time-IC has preserved IC better than stereo-spec-time. Results show that IID helps the best to improve SDR, and IPD results in the highest POLQA improvement. Overall, compared to the unregularized case, stereo-spec-time-all preserves all aspects of the stereo image. Finally, we note the subjective results may contain uncertainties in evaluation of the stereo image as it is challenging to fully ignore the speech distortion while scoring the stereo image.
Subjective evaluation: We conduct four tests on Test set I. In each test, five tracks (i.e., three methods and hidden noisy and reference) are compared. To combine the results, we report the mean of relative score with respect to hidden noisy in each test (higher the better). The “Subjective” column of Table 1 demonstrates the result of this listening test where the average relative difference of hidden reference and noisy is and for OVRL and IMG, respectively. The table demonstrates that stereo-spec-time-IPD achieves the highest OVRL score which also has highest POLQA among all. Moreover, all stereo-aware training methods results in higher IMG ( at highest) score compared to downmix () and LRindp (). We emphasize that the inference complexity of our stereo networks is approximately half of LRindp. Among the stereo-aware regularization methods, the listeners have given highest IMG scores of and when OPD and IPD, respectively, are preserved the best (i.e., stereo-spec-time-OPD/IPD). These results highlight the benefits of the proposed training approach in subjectively improving the image.
Model independence: We lastly apply a stereo-aware training framework on a different architecture, U-NetCM (last row of Table 1). We observe that including all the stereo errors along with time results in lower stereo image errors and an improvement in SDR.
This paper studied the perceptual enhancement of stereo speech. The paper proposed a stereo-aware training loss to preserve the image while aiming to estimate the clean speech from noisy mixture. The trained architecture was a variant of a causal U-Net and the image preservation loss consist of errors related to interchannel intensity and phase differences, interchannel coherence, and overall phase. We showed that accounting for preservation of the stereo image improves the enhancement both objectively with the SDR and POLQA metrics and subjectively through a MUSHRA listening test.