State-of-the-art speech and audio communication codecs such as 3GPP Extended Adaptive Multi-Rate-Wideband (AMR-WB+) [AMRWB+] and 3GPP Enhanced Voice Services (EVS) [EVS:2014alt] typically use Code-Excited Linear Prediction (CELP) and transform coding to encode speech and music, respectively, at lower bitrates. However, CELP-based coding has higher complexity compared to transform coding, especially at the encoder side. Therefore, the recently standardized low complexity, low delay codec (LC3) [LC3:2018Std] [schnell2021lc3] completely relies on transform coding which involves quantizing and coding the spectral coefficients after an MDCT (Modified Discrete Cosine Transform), thus reducing the complexity by a factor of 6 compared to EVS [EVS:2014alt] in super-wideband mode. At medium to higher bitrates, due to the availability of sufficient bits, transform-based coding yields sufficiently good to transparent quality. Conversely, at low bitrates, due to insufficient bits, spectral holes are created, leading to audible artefacts [disch2016intelligent].
To enhance the perceptual quality of coded speech at these low bitrates, tools such as noise filling, gap filling [disch2016intelligent] and LTPF (Long Term Post-filter) are employed [EVS:2014alt] [Fuchs_2015]. While noise filling and gap filling typically aid in mitigating the audible artefacts by treating the spectral holes, the LTPF tries to improve the voiced parts of coded speech by the attenuating inter-harmonic noise [chen1995adaptive]. All of the above-mentioned techniques require the transmission of additional information to the decoder as side information, hence causing an overhead in the bit consumption.
In recent years, several data-driven post-filters which solely rely on the statistics obtained from the coded speech have been proposed in order to enhance the quality of coded speech [Das2020_ESSV] [Korse_2020] [Biswas2020] [Skoglund2019]. While [Das2020_ESSV] designs a post-filter in the MDCT domain, based on a simplistic statistical model of the quantization noise, [Korse_2020]
trains a DNN to estimate a real-valued mask per time-frequency tile based on log-magnitude as input in the STFT (Short Time Fourier Transform) domain. In contrast, both[Biswas2020] and [Skoglund2019]
have proposed a post-filter in the time-domain using generative models such as GAN (Generative Adversarial Networks) and LPCNet, respectively. While post-filters based on generative models have the possibility of processing both magnitude and phase in contrast to methods that operate only on magnitude, they suffer from a significant complexity overhead and can be prone to lack of generalization for unseen speakers. In addition,[Skoglund2019] relies on spectral features from decoded speech, and also needs features derived from LPC coefficients in bitstream which are usually unavailable in transform coding whereas [Korse_2020] needs to perform a forward and inverse STFT transform for the enhancement.
In this paper, we propose a mask-based post-filter that operates in the MDCT domain. Instead of working on decoded speech signal, our proposed post-filter can directly enhance the quantized MDCT coefficient available at LC3 decoder before inverse transformation, thus saving overhead caused by an additional analysis, synthesis or feature extraction. We discuss the constraint associated with mask-based approach in MDCT domain as it has been shown that a simple ratio mask-based approach similar to[Korse_2020] when directly applied to MDCT coefficients produces audible artefacts [Kuech_2007] [Koizumi2018]. To mitigate such artefacts, we propose to train our model to estimate a magnitude mask from the MCLT (Modulated Complex Lapped Transform) domain and show that such mask can be used to enhance MDCT coefficients during inference. We also show that such a training method does not require an inverse transform during DNN training and avoids the need to compute the loss in time-domain as suggested in [Koizumi2018].
2 Problem Formulation
2.1 System Overview
Fig. 1 shows the integration of our proposed post-filter with the MDCT-based LC3 codec. In such a setup, the post-filter operates in the MDCT domain at the decoder side, before the inverse transformation into time domain. It does not require additional feature extraction or time-frequency analysis, but is constrained by the MDCT transform used in the codec. In our experimental setup, we use LC3 with 10 ms frames [schnell2021lc3], which is then inherited by the post-filter.
2.2 Mask Formulation
In simple terms, coded speech can be described as:
where is uncoded speech and is the quantization noise. In a transform codec, the quantization noise is the approximation error arising from the spectral quantization. Spectral noise shaping based on a perceptual model is used to make the quantization noise less perceivable. As a result, the introduced quantization noise is correlated to the speech signal.
A post-filter that predicts a real-valued mask used on real-valued transform coefficients (e.g. MDCT coefficients) can be used to clean the quantization noise resulting from transform coding. However, the MDCT domain is not well suited for signal manipulation in the frequency domain for several reasons [wangmdct]
. Its basis vectors are not shift-invariant, and MDCT does not conserve energy. A perfect reconstruction can only be done by considering adjacent windows and the principle of time domain aliasing cancellation (TDAC). Any manipulation in the MDCT domain can affect these conditions and impact resulting time aliasing[Kuech_2007]. Moreover, the MDCT coefficients are real-valued and cannot be easily interpreted in terms of magnitude and phase. Therefore, we propose to train our model to estimate the real-valued magnitude mask computed on magnitude spectrum of the MCLT, a complex-valued transform similar to STFT but with time and frequency shifts, for which the MDCT is given by its real part.
The MCLT of the time-domain signal is given by:
where and are the time and frequency index of the MCLT bins, respectively, are the MDCT and are the MDST (Modified Discrete Sine Transform) of the time-domain signal and are defined as:
where is a low delay asymmetric window used in LC3 [schnell2021lc3], is the input signal of length and . The MCLT maps input samples to complex output coefficients. It is then straightforward to design an optimal filter by considering the magnitude of the complex-transform. We define the ideal magnitude mask of our post-filter in MCLT domain as,
where and denote the MCLT magnitude of clean and coded speech, respectively. A small constant is added to prevent division by zero. The so-obtained magnitude mask can be applied to the MDCT coefficients ignoring the MDST components during the inverse transform, which results in the following processed MDCT coefficients:
The MDCT does not explicitly carry phase information, but also not the exact magnitude information. The processing of the MDCT coefficients in Eq. (6) with a mask derived from the MCLT magnitude spectrum is able to simulate a magnitude manipulation in the MDCT domain. It can be then assumed that the phase is either unaffected or only very slightly affected by the so-derived masking operation. The post-filter usage is then greatly simplified, and no specific care is required to avoid artefacts arising from time-domain aliasing caused when the TDAC principle is broken by manipulating the MDCT coefficients.
In our proposed post-filter, the model takes MDCT alone as input and predict an optimal mask computed on the MCLT. The ability of a DNN to achieve such a prediction is not overly surprising since the spectrum of MDCT and MCLT have high similarities and the missing MDST part differs from the MDCT only in its basis functions [Chen_2010]. Thus, our DNN-based post-filter can infer this missing information in the hidden layers based on the past context and the current MDCT input.
2.3 Mask Analysis
In order to understand the impact of the magnitude mask, an oracle experiment is performed where the ideal magnitude mask computed using Eq. (5) is applied on the MDCT coefficients as shown in Eq. (6). Since the mask values are unbounded, a threshold is applied to constrain the mask values to be within [0, ]. The bounded mask can be defined as:
Fig. 2 compares the average Perceptual Objective Listening Quality Assessment (POLQA) [POLQA] results of applying the magnitude mask on MDCT coefficients with different values as an upper bound. It can be observed that the bounding value can be considered as an ideal upper limit as it provides a similar quality improvement as an unbounded mask. The threshold is important in order to clip the range of values to be predicted, and then ease the DNN task.
The assessment also validates the usage of mask derived from MCLT magnitude on MDCT coefficients. It shows that mask based post-filter can operate with and without the internal LTPF providing quality improvement over the coded signal in either case. Best performance is observed when the proposed post-filter operates in conjunction with LTPF.
3 Experimental Setup
A CNN based encoder-decoder (CED) architecture is implemented as shown in Table 1 largely inspired from model used in [Korse_2020]
. The input to the DNN is MDCT coefficients of size 160 each for 5 past frames and 1 current frame. Each layer of the CED uses batch normalization and ELU (Exponential Linear Unit) as activation function. Skip connections are used between encoder and decoder layers with required zero-padding inserted to match thefrequencyBins dimensions. The output layer uses sigmoid activation function multiplied with a factor 2 in order to estimate the real-valued mask in range [0,2]. The model is trained with the ADAM optimizer [Kingma2014] with a learning rate of 0.001 and a batch size of 32. Training is done till convergence using early stopping.
|Conv2d_1||, (1,2), 16|
|Conv2d_2||, (1,2), 32|
|Conv2d_3||, (1,2), 64|
|Conv2d_4||, (1,2), 128|
|Deconv2d_1||, (1,2), 64|
|Deconv2d_2||, (1,2), 32|
|Deconv2d_3||, (1,2), 16|
|Deconv2d_4||, (1,2), 1|
|Conv2d_5||, (1,1), 1|
kernelsize, strides, outchannels.
3.2 Training and Inference
. The input to the model is the logarithm of absolute value of MDCT coefficients obtained from core decoding tools of the LC3 decoder. Since speech signal exhibits temporal dependency, the input to the model contains 5 past frame along with current frame stacked together. The MCLT log-magnitude required for training phase is obtained from coded speech for enhancement and original speech for loss function. During the training phase, the DNN estimates a magnitude mask which is multiplied to MCLT of coded speech for enhancement. The MSE (Mean Squared Error) between log-magnitude of clean speech MCLT and enhanced MCLT is used as a loss function for training. In the inference, however, the estimated mask is directly applied to the MDCT coefficients thus making the inference completely independent of the complex-valued transform.
For both training and testing, files are encoded and decoded with LC3 with internal LTPF enabled or disabled at 16 kbps. The training is based on NTT-AT [nttdb:2012] database containing clean speech stereo signal sampled at 48kHz. It is resampled to 16kHz and a passive mono down-mix is obtained from the stereo files. Out of 3690 files, 3612 files are used for training, 198 files for validation and 150 for testing. The MCLT transform is computed using the same low-delay window employed in LC3 codec [schnell2021lc3]
. For signal with sampling rate of 16kHz, the frame size is 10 ms and there is a lookahead of 2.5 ms. We use the asymmetric low delay window of LC3 of size 320 samples obtaining 160 MCLT magnitudes per frame. Inputs to the model is normalised by the mean and standard deviation calculated over the entire dataset.
For assessment of the proposed setup, both subjective and objective tests are conducted. For objective assessment, we use POLQA, whereas for subjective assessment, we follow the methodology MUltiple Stimuli with Hidden Reference and Anchor (MUSHRA) [MUSHRA]. For complete performance evaluation, the assessment is provided for the following configurations:
Coded speech with LC3 at 16 kbps. Both cases of LTPF enabled (On) and LTPF disabled (Off) at the decoder is analysed.
Enhancement of MDCT coefficient from LC3 at 16 kbps using proposed post-filter. Both cases of LTPF enabled (On) and LTPF disabled (Off) at the decoder is analysed. No extra delay is introduced over LC3.
The DNN-based post-filter proposed in [Korse_2020] is modified to operate up to 8kHz, and works with an forward and inverse STFTs using 32 ms frame with 50% overlap hence operating on 256 frequency bins. This model treats the codec as black-box and takes the decoded time domain signal from codec with LTPF enabled as input. An additional delay of 30 ms (3 frames of 10ms each) is then added to coding scheme.
For comparison of our proposed method, STFT based method serves as baseline for MDCT enhancement with LTPF and coded speech with LTPF serves as baseline for MDCT enhancement without LTPF.
The POLQA scores are calculated and averaged over 150 test files from NTT-AT database that are not used during training or validation phase. The MUSHRA listening test is done with 10 items in 5 languages taken from various unseen databases thus analysing the robustness and generalization capabilities of our proposed method. Both subjective and objective results confirm that our proposed post-filter improves the perceptually quality of coded speech with and without LTPF. In line with the observation made in the oracle experiment described in the Section 2.3, the post-filter along with LTPF provides substantial improvement. The LTPF attenuates the inter harmonic noise at low frequency regions of the spectrum, whereas the DNN based post-filter enhances all regions of the spectrum. Thus, when used in conjunction both the system provides orthogonal improvement leading to better enhancement of the speech signal.
The objective and subjective score differs in their assessment of quality of speech in different configuration. The POLQA score shows that the considered baselines are always better than our proposed post-filter whereas MUSHRA test shows that our post-filter provides good improvement and are comparable in performance to the baselines. From the subjective scores we can infer that our post-filter in MDCT domain is capable of suppressing the quantization noise and generalizes well across different speakers and languages. Moreover, the proposed post-filter does not add any additional delay to the coding scheme and does not require additional frequency transformation unlike the baseline STFT based system.
In terms of complexity, the proposed post-filter has a complexity of 1.3 GFLOPS similar to the STFT based system. Although the complexity of our model is 1000 times more than the heuristic post-filter, we are less than half as complex as other generative models[Skoglund2019].
We proposed a DNN-based post-filter that estimates an optimal magnitude mask derived from an MCLT but applied in the MDCT domain. This method is highly relevant for transform coding, where MDCT is commonly used for its great properties. Integrating the post-filter into the decoder in the MDCT domain eliminates the need for additional algorithmic delay and works directly on quantized coefficients.
Subjective and objective evaluations have demonstrated the effectiveness and robustness of the proposed approach, which can compete with the conventional method of using an additional time-frequency decomposition in a post-processing stage.
Future work can be devoted to explore the ability of the mask-based approach to enhance signals other than clean speech, such as music and noisy or reverberant speech.