1 Introduction
Stateoftheart speech and audio communication codecs such as 3GPP Extended Adaptive MultiRateWideband (AMRWB+) [AMRWB+] and 3GPP Enhanced Voice Services (EVS) [EVS:2014alt] typically use CodeExcited Linear Prediction (CELP) and transform coding to encode speech and music, respectively, at lower bitrates. However, CELPbased coding has higher complexity compared to transform coding, especially at the encoder side. Therefore, the recently standardized low complexity, low delay codec (LC3) [LC3:2018Std] [schnell2021lc3] completely relies on transform coding which involves quantizing and coding the spectral coefficients after an MDCT (Modified Discrete Cosine Transform), thus reducing the complexity by a factor of 6 compared to EVS [EVS:2014alt] in superwideband mode. At medium to higher bitrates, due to the availability of sufficient bits, transformbased coding yields sufficiently good to transparent quality. Conversely, at low bitrates, due to insufficient bits, spectral holes are created, leading to audible artefacts [disch2016intelligent].
To enhance the perceptual quality of coded speech at these low bitrates, tools such as noise filling, gap filling [disch2016intelligent] and LTPF (Long Term Postfilter) are employed [EVS:2014alt] [Fuchs_2015]. While noise filling and gap filling typically aid in mitigating the audible artefacts by treating the spectral holes, the LTPF tries to improve the voiced parts of coded speech by the attenuating interharmonic noise [chen1995adaptive]. All of the abovementioned techniques require the transmission of additional information to the decoder as side information, hence causing an overhead in the bit consumption.
In recent years, several datadriven postfilters which solely rely on the statistics obtained from the coded speech have been proposed in order to enhance the quality of coded speech [Das2020_ESSV] [Korse_2020] [Biswas2020] [Skoglund2019]. While [Das2020_ESSV] designs a postfilter in the MDCT domain, based on a simplistic statistical model of the quantization noise, [Korse_2020]
trains a DNN to estimate a realvalued mask per timefrequency tile based on logmagnitude as input in the STFT (Short Time Fourier Transform) domain. In contrast, both
[Biswas2020] and [Skoglund2019]have proposed a postfilter in the timedomain using generative models such as GAN (Generative Adversarial Networks) and LPCNet, respectively. While postfilters based on generative models have the possibility of processing both magnitude and phase in contrast to methods that operate only on magnitude, they suffer from a significant complexity overhead and can be prone to lack of generalization for unseen speakers. In addition,
[Skoglund2019] relies on spectral features from decoded speech, and also needs features derived from LPC coefficients in bitstream which are usually unavailable in transform coding whereas [Korse_2020] needs to perform a forward and inverse STFT transform for the enhancement.In this paper, we propose a maskbased postfilter that operates in the MDCT domain. Instead of working on decoded speech signal, our proposed postfilter can directly enhance the quantized MDCT coefficient available at LC3 decoder before inverse transformation, thus saving overhead caused by an additional analysis, synthesis or feature extraction. We discuss the constraint associated with maskbased approach in MDCT domain as it has been shown that a simple ratio maskbased approach similar to
[Korse_2020] when directly applied to MDCT coefficients produces audible artefacts [Kuech_2007] [Koizumi2018]. To mitigate such artefacts, we propose to train our model to estimate a magnitude mask from the MCLT (Modulated Complex Lapped Transform) domain and show that such mask can be used to enhance MDCT coefficients during inference. We also show that such a training method does not require an inverse transform during DNN training and avoids the need to compute the loss in timedomain as suggested in [Koizumi2018].2 Problem Formulation
2.1 System Overview
Fig. 1 shows the integration of our proposed postfilter with the MDCTbased LC3 codec. In such a setup, the postfilter operates in the MDCT domain at the decoder side, before the inverse transformation into time domain. It does not require additional feature extraction or timefrequency analysis, but is constrained by the MDCT transform used in the codec. In our experimental setup, we use LC3 with 10 ms frames [schnell2021lc3], which is then inherited by the postfilter.
2.2 Mask Formulation
In simple terms, coded speech can be described as:
(1) 
where is uncoded speech and is the quantization noise. In a transform codec, the quantization noise is the approximation error arising from the spectral quantization. Spectral noise shaping based on a perceptual model is used to make the quantization noise less perceivable. As a result, the introduced quantization noise is correlated to the speech signal.
A postfilter that predicts a realvalued mask used on realvalued transform coefficients (e.g. MDCT coefficients) can be used to clean the quantization noise resulting from transform coding. However, the MDCT domain is not well suited for signal manipulation in the frequency domain for several reasons [wangmdct]
. Its basis vectors are not shiftinvariant, and MDCT does not conserve energy. A perfect reconstruction can only be done by considering adjacent windows and the principle of time domain aliasing cancellation (TDAC). Any manipulation in the MDCT domain can affect these conditions and impact resulting time aliasing
[Kuech_2007]. Moreover, the MDCT coefficients are realvalued and cannot be easily interpreted in terms of magnitude and phase. Therefore, we propose to train our model to estimate the realvalued magnitude mask computed on magnitude spectrum of the MCLT, a complexvalued transform similar to STFT but with time and frequency shifts, for which the MDCT is given by its real part.The MCLT of the timedomain signal is given by:
(2) 
where and are the time and frequency index of the MCLT bins, respectively, are the MDCT and are the MDST (Modified Discrete Sine Transform) of the timedomain signal and are defined as:
(3) 
(4) 
where is a low delay asymmetric window used in LC3 [schnell2021lc3], is the input signal of length and . The MCLT maps input samples to complex output coefficients. It is then straightforward to design an optimal filter by considering the magnitude of the complextransform. We define the ideal magnitude mask of our postfilter in MCLT domain as,
(5) 
where and denote the MCLT magnitude of clean and coded speech, respectively. A small constant is added to prevent division by zero. The soobtained magnitude mask can be applied to the MDCT coefficients ignoring the MDST components during the inverse transform, which results in the following processed MDCT coefficients:
(6) 
The MDCT does not explicitly carry phase information, but also not the exact magnitude information. The processing of the MDCT coefficients in Eq. (6) with a mask derived from the MCLT magnitude spectrum is able to simulate a magnitude manipulation in the MDCT domain. It can be then assumed that the phase is either unaffected or only very slightly affected by the soderived masking operation. The postfilter usage is then greatly simplified, and no specific care is required to avoid artefacts arising from timedomain aliasing caused when the TDAC principle is broken by manipulating the MDCT coefficients.
In our proposed postfilter, the model takes MDCT alone as input and predict an optimal mask computed on the MCLT. The ability of a DNN to achieve such a prediction is not overly surprising since the spectrum of MDCT and MCLT have high similarities and the missing MDST part differs from the MDCT only in its basis functions [Chen_2010]. Thus, our DNNbased postfilter can infer this missing information in the hidden layers based on the past context and the current MDCT input.
2.3 Mask Analysis
In order to understand the impact of the magnitude mask, an oracle experiment is performed where the ideal magnitude mask computed using Eq. (5) is applied on the MDCT coefficients as shown in Eq. (6). Since the mask values are unbounded, a threshold is applied to constrain the mask values to be within [0, ]. The bounded mask can be defined as:
(7) 
Fig. 2 compares the average Perceptual Objective Listening Quality Assessment (POLQA) [POLQA] results of applying the magnitude mask on MDCT coefficients with different values as an upper bound. It can be observed that the bounding value can be considered as an ideal upper limit as it provides a similar quality improvement as an unbounded mask. The threshold is important in order to clip the range of values to be predicted, and then ease the DNN task.
The assessment also validates the usage of mask derived from MCLT magnitude on MDCT coefficients. It shows that mask based postfilter can operate with and without the internal LTPF providing quality improvement over the coded signal in either case. Best performance is observed when the proposed postfilter operates in conjunction with LTPF.
3 Experimental Setup
3.1 Model
A CNN based encoderdecoder (CED) architecture is implemented as shown in Table 1 largely inspired from model used in [Korse_2020]
. The input to the DNN is MDCT coefficients of size 160 each for 5 past frames and 1 current frame. Each layer of the CED uses batch normalization and ELU (Exponential Linear Unit) as activation function. Skip connections are used between encoder and decoder layers with required zeropadding inserted to match the
frequencyBins dimensions. The output layer uses sigmoid activation function multiplied with a factor 2 in order to estimate the realvalued mask in range [0,2]. The model is trained with the ADAM optimizer [Kingma2014] with a learning rate of 0.001 and a batch size of 32. Training is done till convergence using early stopping.Layer name  Input  Hyperparameter  Output 

Reshape    
Conv2d_1  , (1,2), 16  
Conv2d_2  , (1,2), 32  
Conv2d_3  , (1,2), 64  
Conv2d_4  , (1,2), 128  
Deconv2d_1  , (1,2), 64  
Deconv2d_2  , (1,2), 32  
Deconv2d_3  , (1,2), 16  
Deconv2d_4  , (1,2), 1  
Conv2d_5  , (1,1), 1  
Flatten   
kernelsize, strides, outchannels
.3.2 Training and Inference
Based on the analysis shown in 2.3 which proved the benefits of the magnitude mask applied directly in MDCT domain, we propose the training and inference setup as shown in Fig 3
. The input to the model is the logarithm of absolute value of MDCT coefficients obtained from core decoding tools of the LC3 decoder. Since speech signal exhibits temporal dependency, the input to the model contains 5 past frame along with current frame stacked together. The MCLT logmagnitude required for training phase is obtained from coded speech for enhancement and original speech for loss function. During the training phase, the DNN estimates a magnitude mask which is multiplied to MCLT of coded speech for enhancement. The MSE (Mean Squared Error) between logmagnitude of clean speech MCLT and enhanced MCLT is used as a loss function for training. In the inference, however, the estimated mask is directly applied to the MDCT coefficients thus making the inference completely independent of the complexvalued transform.
3.3 Datasets
For both training and testing, files are encoded and decoded with LC3 with internal LTPF enabled or disabled at 16 kbps. The training is based on NTTAT [nttdb:2012] database containing clean speech stereo signal sampled at 48kHz. It is resampled to 16kHz and a passive mono downmix is obtained from the stereo files. Out of 3690 files, 3612 files are used for training, 198 files for validation and 150 for testing. The MCLT transform is computed using the same lowdelay window employed in LC3 codec [schnell2021lc3]
. For signal with sampling rate of 16kHz, the frame size is 10 ms and there is a lookahead of 2.5 ms. We use the asymmetric low delay window of LC3 of size 320 samples obtaining 160 MCLT magnitudes per frame. Inputs to the model is normalised by the mean and standard deviation calculated over the entire dataset.
4 Results
For assessment of the proposed setup, both subjective and objective tests are conducted. For objective assessment, we use POLQA, whereas for subjective assessment, we follow the methodology MUltiple Stimuli with Hidden Reference and Anchor (MUSHRA) [MUSHRA]. For complete performance evaluation, the assessment is provided for the following configurations:

Coded speech with LC3 at 16 kbps. Both cases of LTPF enabled (On) and LTPF disabled (Off) at the decoder is analysed.

Enhancement of MDCT coefficient from LC3 at 16 kbps using proposed postfilter. Both cases of LTPF enabled (On) and LTPF disabled (Off) at the decoder is analysed. No extra delay is introduced over LC3.

The DNNbased postfilter proposed in [Korse_2020] is modified to operate up to 8kHz, and works with an forward and inverse STFTs using 32 ms frame with 50% overlap hence operating on 256 frequency bins. This model treats the codec as blackbox and takes the decoded time domain signal from codec with LTPF enabled as input. An additional delay of 30 ms (3 frames of 10ms each) is then added to coding scheme.
For comparison of our proposed method, STFT based method serves as baseline for MDCT enhancement with LTPF and coded speech with LTPF serves as baseline for MDCT enhancement without LTPF.
The POLQA scores are calculated and averaged over 150 test files from NTTAT database that are not used during training or validation phase. The MUSHRA listening test is done with 10 items in 5 languages taken from various unseen databases thus analysing the robustness and generalization capabilities of our proposed method. Both subjective and objective results confirm that our proposed postfilter improves the perceptually quality of coded speech with and without LTPF. In line with the observation made in the oracle experiment described in the Section 2.3, the postfilter along with LTPF provides substantial improvement. The LTPF attenuates the inter harmonic noise at low frequency regions of the spectrum, whereas the DNN based postfilter enhances all regions of the spectrum. Thus, when used in conjunction both the system provides orthogonal improvement leading to better enhancement of the speech signal.
The objective and subjective score differs in their assessment of quality of speech in different configuration. The POLQA score shows that the considered baselines are always better than our proposed postfilter whereas MUSHRA test shows that our postfilter provides good improvement and are comparable in performance to the baselines. From the subjective scores we can infer that our postfilter in MDCT domain is capable of suppressing the quantization noise and generalizes well across different speakers and languages. Moreover, the proposed postfilter does not add any additional delay to the coding scheme and does not require additional frequency transformation unlike the baseline STFT based system.
In terms of complexity, the proposed postfilter has a complexity of 1.3 GFLOPS similar to the STFT based system. Although the complexity of our model is 1000 times more than the heuristic postfilter, we are less than half as complex as other generative models
[Skoglund2019].5 Conclusions
We proposed a DNNbased postfilter that estimates an optimal magnitude mask derived from an MCLT but applied in the MDCT domain. This method is highly relevant for transform coding, where MDCT is commonly used for its great properties. Integrating the postfilter into the decoder in the MDCT domain eliminates the need for additional algorithmic delay and works directly on quantized coefficients.
Subjective and objective evaluations have demonstrated the effectiveness and robustness of the proposed approach, which can compete with the conventional method of using an additional timefrequency decomposition in a postprocessing stage.
Future work can be devoted to explore the ability of the maskbased approach to enhance signals other than clean speech, such as music and noisy or reverberant speech.