Corrupted speech signal by noise and reverberation is one of the most common signals we hear in our everyday life. The desire to listen to clean speech signal, therefore, is strong let alone its usage for machine such as speech recognition system. While there has been numerous studies to address the problem of single channel denoising and dereverberation, only few have tried to solve this problem with a single deep learning model[sun2018enhanced]. This motivates us to tackle this real-world problem using a single-stage deep learning model. We address this problem by dissecting the elements that compose the noisy-reverberant mixture, that is, noise, direct source, and reverberation. By dissecting each part of the mixture, one could handle each element at hand and even mix them with a desired proportion. Note that this is a desirable property for users as the effective amount of reverberation is important to achieve better speech intelligibility for both impaired and nonimpaired listeners [bradley2003importance, hu2014effects].
The key contributions of our proposed approach are three folds. First, to suppress the noise and reverberation, we propose a new type of complex-valued mask called phase-aware -sigmoid mask (PHM). While the complex-valued mask suggested by [williamson2016complex] separately estimates the real and imaginary part of a complex spectrogram, we believe the phase part of it can be effectively estimated by reusing an estimated magnitude value of it in a trigonometric perspective as suggested in [wang2019masking]. The major difference between PHM and the suggested approach in [wang2019masking] is that PHM is designed to respect the triangular relationship between mixture, source and the rest, and hence the sum of the estimated source and the rest is always equal to the mixture. By exploiting this property, we train the deep network to output two different PHMs simultaneously to effectively deal with both denoising and derverbration problem. Second, we propose a new time-domain loss function, an emphasized multi-scale cos similarity loss function. A time-domain loss function has recently been used as a popular loss function [choi2019phase, yao2019coarse, wang2018end, koizumi2020speech, le2019sdr]. To better design the time-domain cos similarity loss function proposed in [choi2019phase], we change it into a multi-scale version of it with proper emphasis functions and show the effectiveness of it. Finally, we suggest an optimization strategy for two-dimensional U-Net to significantly reduce the computational inefficiency in runtime.
2 Related Works
Recently, there has been an increasing interests in phase-aware speech enhancement because of the sub-optimality of reusing the phase of mixture signal. The first work that tried to address this problem was by using phase-sensitive mask (PSM) [erdogan2015phase]. PSM estimates the real-part of the signal which is still sub-optimal. As a more direct remedy for this, a complex masking [williamson2016complex, choi2019phase, wang2019masking] or complex spectral mapping [tan2019complex] has also been proposed to estimate a clean phase part. Another line of research is to sequentially estimate the clean phase part using an additional sub-module [takahashi2018phasenet, afouras2018conversation, yin2019phasen]. This, however, is limited in that it requires an additional module resulting in inefficient computation. While most of these works tried to estimate the clean phase by using phase mask or an additional network, the absolute phase difference between mixture and source can be actually computed using the law of cosines using the estimated magnitude values as the three sides of a triangle [mowlaee2012phase, mowlaee2014time]. Inspired by this, [wang2019deep] proposed to estimate a rotational direction of the absolute phase difference using a sign-prediction network.
The efforts to deal with denoising or dereverberation using deep networks have been tried in many works. Recently, [zhao2018two, maciejewski2019whamr] tried to address this problem with two modules for each task. We believe, however, a two-stage framework is not necessary and can be achieved using a single deep network.
3 Single-stage Denoising and Dereverberation
A noisy-reverberant mixture signal is commonly modeled as the sum of additive noise and reverberant source , where is a result of convolution between room impulse response (RIR) and dry source as follows, . More concretely, we can break down into two parts, that is, direct path part which does not includes the reflection path and the rest of the part that includes all the reflection paths as follows, , where and denotes direct path source and reverberation, respectively. In this setting, our goal is to separate into three elements , , and . Each of the corresponding time-frequency representations computed by STFT is denoted as , , , , and the estimated values will be denoted by the hat operator .
3.1 Phase-aware -sigmoid mask
Designing a mask that is not limited to output the optimal value of ideal mask requires two conditions to satisfy. First, the range of magnitude mask should not be limited. Second, the mask has to be complex-valued so that it can correct both the magnitude part and phase part of the mixture signal. The proposed phase-aware -sigmoid mask (PHM) is designed to handle both conditions while systemically restricting the sum of estimated complex values to be exactly the value of mixture, . The PHM separates the mixture in STFT domain into two parts as one-vs-rest approach, that is, the signal and the sum of the rest of the signals , where index could be one of direct path source (), reverberation (), and noise () in our setting, . The complex-valued mask estimates the magnitude and phase value of the source of interest . The mask is composed of two parts, (1) magnitude mask estimation, (2) phase estimation by reusing the magnitude estimation from (1) and two-class sign prediction.
First, the network outputs the magnitude part of two masks and
with sigmoid functionmultiplied by coefficient as follows,
where is the output located at
from the last layer of neural-network function, and is a function composed of network layers before the last layer. serves as a magnitude mask to estimate source and the value of it ranges from 0 to . The role of is to design a mask that is close to the optimal mask with a flexible magnitude range so that the value is not bounded between 0 and 1 unlike the typically used sigmoid mask. In addition, because the sum of the complex valued masks and must compose a triangle, it is reasonable to design a mask that satisfies the triangle inequalities, that is, and . To address the first inequality we designed the network to output
from the last layer with a softplus activation function as follows,, where denotes an additional network layer to output . The second inequality can be satisfied by clipping the upper bound of the by .
Once the magnitude masks are decided, we can construct a phase mask . Given the magnitudes as three sides of a triangle, we can compute the cosine of absolute phase difference between the mixture and source as follows, . Next, the rotational direction (clockwise or counterclockwise) for phase correction can be decided by estimating sign value as follows,
Two-class straight-through Gumbel-softmax estimator was used to estimate [jang2016categorical]. It allows us to discretize the output of the Gumbel-softmax function with and still be able to train the network in an end-to-end manner using a continuous approximation in the backward pass. is defined as follows,
where is defined as follows,
and and are samples from Gumbel,
is an additional network layer to output logit value, and is a temperature parameter for Gumbel-softmax. Finally, is defined as follows,
3.2 Masking from the perspective of quadrangle
As we desire to extract both direct source and reverberant source, two pairs of PHMs are used for each of them. The first pair of masks separates direct source and the rest of the component, denoted as and . The second pair of masks separates noise and reverberant source component, denoted as and . Since PHM guarantees the mixture and separated components to construct a triangle in the complex STFT domain, the outcome of the separation can be seen from the perspective of quadrangle as in Fig 1. In this setting, as the three sides and two side angles are already determined by the two pairs of PHMs, the last fourth side of quadrangle, the reverberation component , is uniquely decided.
3.3 Emphasized multi-scale cosine similarity loss
Learning to maximize cosine similarity can be regarded as maximizing the signal-to-distortion ratio (SDR)[choi2019phase]. Cosine similarity loss between estimated signal and ground truth signal is defined as follows,
where denotes the temporal dimensionality of a signal and denotes the type of signal (). Consider a sliced signal , where denotes the segment index and denotes the number of segment. By slicing the signal and normalize it by its norm, each sliced segment is considered as an unit for computing . Therefore, we hypothesize that it is important to choose a proper segment length unit when computing . In our case, we used multiple settings of segment lengths as follows,
where denotes the number of sliced segments. In our case the set of 's was chosen as follows, , assuming they moderately cover the range of duration of phonemes in speech.
To further improve the design of the loss function, we applied two simple techniques — 1. pre-emphasis () and 2. -law encoding () — on signals. As most of the speech signal components are concentrated in the lower frequency bands, we found that applying pre-emphasis on loss function can help penalize high frequency components. In addition, since the samples of speech signals are usually centered around zero, we found that it is helpful to use 16-bit -law encoding as it distributes samples more uniformly by the nature of continuous logarithmic transform. The proposed loss function is defined as follows,
Finally, we used the proposed loss function for every and combinations as follows,
4 Optimization for Real-Time U-Net
To connect each encoder layer with its corresponding decoder, U-Net is often composed of convolutional layers with zero padding for dynamic input sizes. Without zero padding, the valid size of input and output is uniquely determined by the kernel sizes and strides. This obviously takes less computation and allows to keep only the essential part. In our real-time setting, the input spectrogram has 253 frequency bins and 65 frames by discarding the four lowest bins from the original spectrogram with a 512-point FFT, assuming that 16 kHz speech signals have no significant spectral component below 93.75 Hz. We followed the network architecture of the real-valued U-Net proposed in[choi2019phase] (model10 and model20
specifically) with the modification of the last layer to output PHM. All the batch normalizations were fused into convolution filters.
In the encoder, the naïve implementation of U-Net repeatedly performs the same computation that has already been computed previously. This redundancy can be efficiently reduced by caching the pre-computed values using queues. Likewise, we utilized a similar concept with 2D convolution, but more than one queues are needed for the strided convolution in each layer. The number of required queues for depth is derived by , where denotes the temporal stride of the -th encoder layer.
Most computation of the naïve U-Net is concentrated on a few decoder layers before the output. Fortunately, only a single frame of the output mask is needed for real-time inference. Although using the latest frame can achieve the shortest latency, it is better to preview a few milliseconds for performance. It is computationally less efficient to use a longer lookahead because more frames should be calculated in the previous decoder layer. Our real-time implementation previews 32ms which is shorter than allowed in the DNS challenge [reddy2020interspeech]. The schematic details are shown in Fig. 2.
We used the DNS challenge dataset [reddy2020interspeech] and internally collected dataset for training. The former is a large scale dataset where the speech samples were collected from Librivox [mcguirelibrivox], and noise samples from Audioset [gemmeke2017audio] and Freesound [fonseca2017freesound]. Note that we did not use the provided noisy speech data from the DNS dataset but used on-the-fly augmentation with the clean speech and noise in the two datasets during the training phase. Since our goal is to perform both denoising and dereverberation, we used pyroomacoustics [scheibler2018pyroomacoustics]
to simulate an artificial reverberation with randomly sampled absorption, room size, location of source and microphone distance. We also trimmed random segments of 2 seconds from speech and noise data, and mixed them with uniformly distributed source-to-noise ratio (SNR) ranging from -10 dB to 30 dB.
For test, we used two datasets such as the synthesized testset in the DNS challenge (DNS) and WHAMR [maciejewski2019whamr]. The DNS synthesized testset provides noisy-reverberant mixtures and noisy mixtures without reverb. DNS was used only to test the denoising performance since it does not provide the direct source signal of synthesized mixture samples. Therefore, a reverberant source was given as ground truth when the model is tested on noisy-reverberant mixtures. Both the denoising and dereverberation performance were tested on the min subset of WHAMR dataset which contains 3,000 audio files. To test the denoising and dereverberation performances both simultaneously and separately, we tested our models on four scenarios: 1) nr2d: noisy-reverberant mixture to direct source 2) nr2r: noisy-reverberant mixture to reverberant source 3) n2d: noisy mixture to direct source 4) r2d: reverberant source to direct source. The corresponding four pair of test subsets, denoted in a following way (mixture, ground_truth), were used as follows, 1. nr2d: (mix_single_reverb, s1_anechoic), 2. nr2r: (mix_single_reverb, s1_reverb), 3. n2d: (mix_single_anechoic, s1_anechoic), 4. r2d: (s1_reverb, s1_anechoic).
Input features were used as a channel-wise concatenation of log-magnitude spectrogram, real and imaginary part of demodulated phase [takahashi2018phasenet], group delay, and delta-phase [mccowan2011delta]. The window size of model20 was 1024 with 256 hop size and the window size of model10 was 512 with 128 hop size. All models were trained for 125k iterations with AdamW optimizer [reddi2019convergence]. The learning rate was set to 0.0004 and halved at 62.5k iteration. Every test was done with a non-causal inference using model20 except the experiments in subsection 5.5.
5.3 Ablation studies
To show the effect of loss functions we observed SI-SDR [le2019sdr] and PESQ [rix2001perceptual] while changing four different loss functions. Complex MSE (MSE) and three different cosine similarity based loss functions — SingleScale, MultiScale, and MultiScale+ — were compared each of which corresponds to Eq. 6, Eq. 7, and Eq. 8, respectively. The quantitative results in Table 1 show that the proposed multi-scale and emphasis functions are beneficial for both denoising and dereverberation tasks in most of the cases.
5.4 Analysis on phase enhancement
Here, we used the phase distance defined in [choi2019phase] to quantitatively measure the phase enhancement performance. Phase distance () between spectrogram and is formulated as follows,
where is the angle between and , ranging from 0°to 180°. We measured between ground truth and mixture , and between ground truth and estimation , and checked how much was obtained. This was tested on all four scenarios of WHAMR testset and shown in Table 2. We found that the network is able to give a reasonable in tasks including dereverberation (nr2d, nr2r). However, was marginal for only-denoising-tasks (nr2r, n2d). We conjecture that this is because the network is not able to estimate a precise magnitude value for noisy mixture and this issue is left for futurework. A visualization of enhanced phase group delay tested on a reverberant source is shown in Fig. 3. Fig. 3 (b) shows the enhanced harmonic structure of phase group delay.
5.5 Computation of real-time U-Net
Following the constraint for real-time model suggested by the DNS challenge, we measured the elapsed time to compute a single frame. model20 that took 40 ms to compute a frame will be denoted as non-real-time (NRT) model, and model10 that took 4.32 ms to compute a frame will be denoted as real-time (RT) model. To compare the two models and how causal inference affects the model performance, we compare four combinations in nr2d task. Table 3 shows that both non-causal inference and model size are significant factors for performance.
Finally, we report the Mean Opinion Score (MOS) results from the DNS challenge based on the online subjective evaluation framework ITU-T P.808 [p808]. For better perceptual quality, we linearly added the estimated direct source and reverberant source with a 15 dB ratio, and implemented a simple and zero-delay dynamic range compression to apply on it. Our causal-NRT and causal-RT model achieved a mean opinion score of 3.36 and 3.24, respectively.
We proposed a new mask and loss function to improve the performance of single-stage denoising and dereverberation. As the proposed PHM and loss function are orthogonal to the network structure, we believe that a better performance can be achieved using the variant of U-Net architectures such as [takahashi2017multi, tolooshams2020channel].