Log In Sign Up

TUNet: A Block-online Bandwidth Extension Model based on Transformers and Self-supervised Pretraining

We introduce a block-online variant of the temporal feature-wise linear modulation (TFiLM) model to achieve bandwidth extension. The proposed architecture simplifies the UNet backbone of the TFiLM to reduce inference time and employs an efficient transformer at the bottleneck to alleviate performance degradation. We also utilize self-supervised pretraining and data augmentation to enhance the quality of bandwidth extended signals and reduce the sensitivity with respect to downsampling methods. Experiment results on the VCTK dataset show that the proposed method outperforms several recent baselines in terms of spectral distance and source-to-distortion ratio. Pretraining and filter augmentation also help stabilize and enhance the overall performance.


page 1

page 2

page 3

page 4


Self-Supervised Pretraining Improves Self-Supervised Pretraining

While self-supervised pretraining has proven beneficial for many compute...

Long-Short Temporal Contrastive Learning of Video Transformers

Video transformers have recently emerged as a competitive alternative to...

Auxiliary Learning for Self-Supervised Video Representation via Similarity-based Knowledge Distillation

Despite the outstanding success of self-supervised pretraining methods f...

Adversarial Pretraining of Self-Supervised Deep Networks: Past, Present and Future

In this paper, we review adversarial pretraining of self-supervised deep...

The Hidden Uniform Cluster Prior in Self-Supervised Learning

A successful paradigm in representation learning is to perform self-supe...

Self-Supervised Pretraining and Controlled Augmentation Improve Rare Wildlife Recognition in UAV Images

Automated animal censuses with aerial imagery are a vital ingredient tow...

On the relevance of bandwidth extension for speaker identification

In this paper we discuss the relevance of bandwidth extension for speake...

1 Introduction

Bandwidth extension (BWE), or audio super-resolution, enhances speech by generating a wideband (WB) signal from a narrowband (NB) signal. The NB signal is usually sampled below 8 kHz resulting in low auditory quality. Such sampling rate is widely used in G.711, G.729, and AMR audio codecs due to its efficient streaming. Including a BWE module at the receiver side will therefore improve audio fidelity.

Compared to conventional BWE approaches such as [13, 12, 5]

, recent end-to-end deep neural networks generate WB signals directly from NB signals without the need for feature engineering. For instance, inspired by the well-known UNet architecture

[14] in image processing, AudioUNet [6] is a wave-to-wave BWE model that has outperformed traditional methods. In [2], the limitation of convolution on long-range dependency modeling in UNet is addressed by introducing the TFiLM layer that modulates blocks of convolution’s feature maps with information learned by recurrent layers. Generative models such as the NU-Wave [7] neural vocoder relies on conditional diffusion models with modified noise level embedding and local conditioner. On the other hand, WSRGlow [22] models the distribution of the output conditioned on the input using normalizing flow.


TConv(C=128, K=8)

Conv(C=256, K=8)





TConv(C=64, K=18)

TConv(C=1, K=66)

Conv(C=128, K=18)

Conv(C=64, K=66)

Figure 1: TUNet architecture for speech enhancement. The encoder downsamples waveform input while the decoder does the reverse. A Transformer block is placed in the middle to model the attention of the bottleneck.

While convolutional neural network architectures exhibit promising results for end-to-end BWE training, their effectiveness on long-range dependency modeling is still limited by receptive fields of convolution

[8]. Stacking more convolution layers would help expand the receptive field at the expense of increased computation. In addition, training end-to-end BWE models requires high-rate target signals, making valuable low-rate data collected from telephony 8-kHz infrastructure unusable. It has also been observed that BWE models are susceptible to low-pass filtering [2, 16], generating severe distortion at the transition band of the anti-aliasing filter. This problem can be mitigated by data augmentation [16].

We propose a Transformer-aided UNet (TUNet) by employing a low-complexity transformer encoder on the bottleneck of a lightweight UNet. Here, the Transformer assists such a small UNet with its captured global dependency while the UNet effectively downsamples waveform input with strided convolution to reduce computation that the Transformer must perform. In addition, inspired by masked language modeling in natural language processing

[4], we propose masked speech modeling — a self-supervised representation learning scheme that reconstructs original signals from masked signals. The advantage of this pretraining is that it requires only low-rate data to make full use of telephony databases, allowing the model to learn the underlying statistics of the low-band speech and generalize to downstream tasks [1]. Finally, similar to [16], we make our model robust to downsampling methods by generating training data with different parameter sets of the Chebyshev Type I filter.

2 Review of TFiLM-UNet and proposed TUNet algorithm

2.1 TFiLM-UNet baseline

TFiLM-UNet is an offline UNet-based audio super-resolution model [2]

. To assist convolution layers in capturing long-range information, Temporal Feature-Wise Linear Modulation (TFiLM) has been proposed. This layer acts as a normalization layer that combines maxpooling and long short-term memory (LSTM). While maxpooling reduces temporal dimension into

blocks, LSTMs refine convolution’s feature maps by captured long-range dependency.

In the TFiLM-UNet model, the encoder contains four downsampling (D) blocks, each comprising a convolution layer, maxpooling layer, ReLU activation, and TFiLM layer, consecutively. In the decoder, upsampling (U) blocks follow sequential operations: convolution, dropout, ReLU, DimShuffle, and TFiLM, in which the DimShuffle layer doubles time dimension by manipulating the feature shape. Stacking and additive skip connections are applied between D/U blocks and input/output, respectively.

2.2 Lightweight UNet with Transformer

With reference to Fig. 1, our proposed model follows the same waveform-to-waveform UNet to that of TFiLM. As opposed to TFiLM-UNet, the proposed model is significantly smaller due to the use of fewer convolution filters and higher dimensional reduction rates. Precisely, the encoder consists of three strided 1D convolution layers, each having filters of kernel size . Stride of all these layers is set at 4, resulting in the time dimension of the bottleneck being 64 times shorter than the length of the input. Consequently, the bottleneck features can be processed efficiently in the follow-up Performers [3] blocks. We employ Performers since its self-attention mechanism has linear time complexity compared to the quadratic complexity of the conventional attention [17]. On the decoder side, three transposed 1D convolution layers commensurating the downsampling rates of the encoder are used to generate output signals that have the length . We use Tanh activation for the last transposed convolution and LeakyReLU [10] for the rest. TFiLM layers are applied after convolution layers except for the last encoder layer that is replaced by the Performer blocks. To smooth the loss landscape [18], skip connections that connect TFiLM encoders to the corresponding decoders are employed.

Compared to the TFiLM-UNet, our model has four key differences: i) Our encoder and decoder require one fewer layer and four times fewer filters than the baseline; ii) Each encoding layer reduces time dimension by four times instead of two to assist quick input compression; iii) We replace downsampling and upsampling blocks in TFiLM with strided convolution and transposed convolution layers, respectively; and iv) compared to the stacking skip-connection in TFiLM, we employ additive skip connection which further reduces the number of parameters in the decoder. These modifications ensure that our model is significantly lighter than the baseline while preserving learning capability.

2.3 Masked speech modeling

Figure 2: Masked speech modeling pretraining pipeline.

We propose masked speech modeling (MSM) pretraining as illustrated in Fig. 2. Since audio signals possess fine granular characteristics, instead of masking the sequence at sample level, we mask 20% of 256-sample blocks to create the masked input. The model will optimize the mean squared error between the output and the masked input. Compared to the masked reconstruction pretraining in [19], both encoder and decoder are pretrained in our proposed approach.

2.4 Improving robustness to downsampling methods by augmentation

The performance of BWE models is highly sensitive to different anti-aliasing filters when downsampling methods in testing differ from training [6, 2, 16]. Similar to [16], to improve the robustness of our model, we generate the low-rate signals by downsampling the high-rate speech dataset with random anti-aliasing filters. More specifically, we adopt the Chebyshev Type I anti-aliasing filter and randomize its ripple and order parameters. This helps in creating variations in the transition band of the anti-aliasing filter.

2.5 Learning objectives

Since the mean squared error (MSE) loss may not guarantee the good perceptual quality [11]

, we combine MSE loss with multi-resolution short-time Fourier transform (STFT) loss

[21] in the Mel scale. Given a reconstructed signal and a target signal , the training loss is given by


where denotes the weight of the MSE loss, and is the multi-resolution STFT loss (MR loss).

3 Experiments

3.1 Setup

We focus on extending 4-kHz bandwidth (8 kHz sampling rate) to 8-kHz bandwidth (16 kHz sampling rate). Training data was segmented into smaller chunks with a window size of 8192 and 50% overlapping. We used the VCTK Corpus [20] for training and testing. This dataset includes 109 English speakers, in which recordings of the first 100 speakers were for training and the remaining for testing.

Besides VCTK, we further used the VIVOS dataset [9] to verify the effectiveness of our pretraining approach. This dataset consists of 15-hour speech recordings from 65 Vietnamese speakers, recorded in a quiet environment with high-quality microphones. We followed the dataset’s default split: 46 speakers for training, 19 speakers for testing.

To evaluate the quality of the generated audio, we used three objective metrics: log-spectral distance (LSD), high-frequency log-spectral distance (LSD-HF), and scale-invariant source-to-distortion ratio (SI-SDR) [15]. LSD-HF computes LSD specifically on high-frequency bands, i.e., 4kHz - 8 kHz. As opposed to LSD, LSD-HF focuses only on the regeneration of the high-band spectrum and ignores artifacts or distortions in the low-band spectrum. A lower LSD/LSD-HF score implies a more similar spectral to the target, while a higher SI-SDR score indicates better performance.

The , , , and hyperparameters of our model are described in Fig. 1. The Performers111 block has three hidden layers, two attention heads for each layer, and each head’s dimension is 32; local window length is equivalent to bottleneck length divided by 8. Hyperparameters of MR loss such as resolutions were set with default values of the auraloss222 v2.0.1 library. The MSE weight was set to

. We trained our models for 150 epochs using the Adam optimizer,

learning rate with 800 samples in each batch. For the baseline TFiLM-UNet model, while official implementation is available, we adopted an unofficial which reportedly produces slightly better results and much faster training.

3.2 Performance comparison with baselines

We compared our model’s performance and inference speed with TFiLM-UNet and two recent generative models, NU-Wave [7] and WSRGlow [22]. The above baselines were trained on the VCTK dataset with low-rate data generated from 16-kHz data using only one 8th order Chebyshev Type I low-pass filter. In this experiment, MSM pretraining was excluded from our method.

Figure 3: Objective scores of the baselines and our model. Lower LSD/ LSD-HF is better, and higher SI-SDR is better.

Results in Fig. 3 show that our TUNet model achieved significantly higher performance than that of all the baselines except for SI-SDR of TFiLM-UNet, which improvement was modest. This modest improvement was due to speech energy being concentrated at low-frequency bands. Compared to our TUNet, the WSRGlow model achieves tight LSD-HF scores but relatively worse in LSD, indicating that our model better preserves low frequencies.

System #Params
Inference time
WSRGlow 229M 3146
TFiLM-UNet 68.2M 1335
NU-Wave 3M 398 (1 iter)
TUNet 2.9M 22.6
Table 1: Model size and inference time on a single core CPU.

In terms of single-threaded inference time, we measured it on the AMD EPYC 7742 using ONNX inference engine. Our proposed model was significantly faster and more lightweight than the others. In Table 1, TUNet requires only 22.63 ms to execute a single 512 ms audio frame while WSRGlow, TFiLM-UNet, and one inference step of NU-Wave took approximately 139, 59, and 17 times longer, respectively. Assuming each audio chunk being 87.5% overlapped, this amounts to 64 ms for a new block to arrive with a chunk size of 8192 and a sampling rate of 16 kHz. Since our inference time is shorter than 64 ms, this implies that the proposed method is more suited for semi-real-time applications compared to the baselines.

3.3 Ablation studies

To study the effects of its two main components, TFiLM layers and Performers blocks, we created three variations from TUNet: ‘No Transformer’ — TUNet without Performers blocks on the bottleneck, ‘LSTMs bottleneck’ — TUNet with the Transformer bottleneck replaced by a 3-layer, 256-unit (same as the Transformer) LSTM network, and ‘No TFiLM’ — TUNet without TFiLM layers.

In Table 2, both Performer and TFiLM layers play significant roles in the proposed model since excluding these two components led to noticeably decreased scores on all metrics. The ‘No Transformer’ model, which excluded the Transformer from the bottleneck, performed worst in terms of LSD and SI-SDR, and the performance was only improved by a small margin even with LSTMs aided. The removal of TFiLM also led to a significant degradation but relatively less than the removal of the Transformer.

No Transformer 1.45 2.64 21.61
LSTMs bottleneck 1.44 2.70 21.76
No TFiLM 1.44 2.69 21.89
TUNet 1.36 2.54 21.91
Table 2: Effectiveness of components on our model.

To determine the effectiveness of MSM pretraining, we pretrained TUNet on VCTK low-rate data with the pipeline described in Section 2.3. After obtaining a pretrained model, we subsequently trained it with the BWE task on the VCTK dataset. In this experiment, we used only one anti-aliasing filter in Section 3.2 to generate training data. To assess the generalization ability of MSM, we include an additional scenario where the pretraining dataset is VCTK, but the BWE training and test set are of a different language. We adopted one more metric — low-frequency log-spectral distance (LSD-LF) to measure the approximation error in the low band (0-4 kHz) caused by MSM pretraining.

Results in Table 3 show that models pretrained with MSM achieve significant improvements on spectral-based metrics while SI-SDR figures were modest. The scores indicate that the pretraining scheme not only enhanced high frequencies but also helped preserve low frequencies. Furthermore, the performance gain on the VIVOS was consistent with that of the VCTK. This implies that the BWE model adapted very well to the VIVOS dataset even though it was pretrained on a different language.

VCTK input 4.75 8.27 1.23 20.32
w/o MSM 1.36 2.54 0.18 21.69
MSM on VCTK 1.28 2.45 0.11 22.08
VIVOS input 5.59 9.79 1.39 21.75
w/o MSM 1.36 2.49 0.23 25.08
MSM on VCTK 1.29 2.42 0.16 26.15
Table 3: BWE results on VCTK and VIVOS datasets when employing MSM pretraining.

We next assessed sensitiveness to anti-aliasing filters of our models trained with and without filter augmentation. The first model, ‘Single Cheby’ is the best model obtained from the above experiments, which was trained with a single Chebyshev Type I anti-aliasing filter. The other ‘Multi-Cheby’ was trained with a set of random filters as described in Section 2.4. Both models employed the same MSM pretraining above. The BWE dataset used for this experiment was the VIVOS dataset. The test set was downsampled using all resampling methods available in the resampy444 library. However, due to space constraints, we will only report the results on test sets generated by single/multiple Chebyshev filters (same as training of ‘Single Cheby’ and ‘Multi-Cheby’, respectively), Kaiser (‘best’ and ‘fast’ variations) filters, and the sinc downsampling.

Figure 4: LSD scores of our models trained with a single and multiple anti-aliasing filter(s) on the VIVOS test set.

As shown in Fig. 4, the ‘Single Cheby’ model achieved the best score when evaluated with the same filter. Although this model performed well on several downsampling methods such as ‘kaiser_fast’, its performance significantly degraded on test sets processed by the other downsampling methods such as the sinc algorithm. On the other hand, the ‘Multi-Cheby’ showed a stable performance across all the methods.

4 Conclusions

We have proposed a Transformer-aided UNet for bandwidth extension. Despite remarkable performance scores, our model remains lightweight and achieves fast processing. By leveraging only narrowband audio data for pretraining, we have achieved an overall improvement in performance. With multiple anti-aliasing filters applied, the model achieves robustness to different low-pass filters, an essential characteristic for real-world applications.


  • [1] A. Baevski, M. Auli, and A. Mohamed (2019) Effectiveness of self-supervised pre-training for speech recognition. ArXiv abs/1911.03912. Cited by: §1.
  • [2] S. Birnbaum, V. Kuleshov, S. Z. Enam, P. W. Koh, and S. Ermon (2019) Temporal FiLM: Capturing Long-Range Sequence Dependencies with Feature-Wise Modulations. In Proc. Neural Inf. Process. Syst., Cited by: §1, §1, §2.1, §2.4.
  • [3] K. M. Choromanski, V. Likhosherstov, D. Dohan, X. Song, A. Gane, T. Sarlos, P. Hawkins, J. Q. Davis, A. Mohiuddin, L. Kaiser, D. B. Belanger, L. J. Colwell, and A. Weller (2021) Rethinking attention with Performers. In Proc. Int. Conf. Learn. Representations, Cited by: §2.2.
  • [4] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proc. North Amer. Chapter Assoc. Comput. Linguistics, Cited by: §1.
  • [5] P. Jax and P. Vary (2003) On artificial bandwidth extension of telephone speech. Signal Processing 83, pp. . External Links: Document Cited by: §1.
  • [6] V. Kuleshov, S. Z. Enam, and S. Ermon (2017) Audio super resolution using neural networks. In Int. Conf. Learn. Representations, Workshop Track, Cited by: §1, §2.4.
  • [7] J. Lee and S. Han (2021) NU-Wave: A Diffusion Probabilistic Model for Neural Audio Upsampling. In Proc. Interspeech, pp. . External Links: Document Cited by: §1, §3.2.
  • [8] D. Linsley, J. Kim, V. Veerabadran, C. Windolf, and T. Serre (2018)

    Learning long-range spatial dependencies with horizontal gated recurrent units

    In Proc. Neural Inf. Process. Syst., , pp. 152–164. Cited by: §1.
  • [9] H. T. Luong and H. Q. Vu (2016) A non-expert Kaldi recipe for Vietnamese speech recognition system. In WLSI/OIAF4HLT@COLING, Cited by: §3.1.
  • [10] A. L. Maas, A. Y. Hannun, and A. Y. Ng (2013) Rectifier nonlinearities improve neural network acoustic models. In

    ICML Workshop on Deep Learn. for Audio, Speech and Lang. Process.

    Cited by: §2.2.
  • [11] J. Martín-Doñas, A. Gomez, J. Gonzalez Lopez, and A. Peinado (2018-09)

    A deep learning loss function based on the perceptual evaluation of the speech quality

    IEEE Signal Process. Lett. PP, pp. 1–1. External Links: Document Cited by: §2.5.
  • [12] A. H. Nour-Eldin and P. Kabal (2008) Mel-frequency cepstral coefficient-based bandwidth extension of narrowband speech. In Proc. Interspeech, Cited by: §1.
  • [13] Y. Qian and P. Kabal (2002)

    WIDEBAND speech recovery from narrowband speech using classified codebook mapping

    In Proc. Australian Int. Conf. Speech Sci., Technol. (Melbourne), pp. . Cited by: §1.
  • [14] O. Ronneberger, P. Fischer, and T. Brox (2015) U-Net: convolutional networks for biomedical image segmentation. In Proc. Med. Image Comput. Comput. Assist. Interv., Cited by: §1.
  • [15] J. L. Roux, S. Wisdom, H. Erdogan, and J. R. Hershey (2019) SDR – Half-baked or well done?. In Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., pp. . Cited by: §3.1.
  • [16] S. Sulun and M. E. P. Davies (2021-01) On filter generalization for music bandwidth extension using deep neural networks. IEEE J. of Sel. Topics in Signal Process. 15 (1), pp. . External Links: ISSN 1941-0484, Link, Document Cited by: §1, §1, §2.4.
  • [17] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Proc. Neural Inf. Process. Syst., , pp. . External Links: ISBN 9781510860964 Cited by: §2.2.
  • [18] L. Wang, B. Shen, N. Zhao, and Z. Zhang (2020) Is the skip connection provable to reform the neural network loss landscape?. In Proc. Int. Joint Conf. Artif. Intell., Cited by: §2.2.
  • [19] W. Wang, Q. Tang, and K. Livescu (2020) Unsupervised pre-training of bidirectional speech encoders via masked reconstruction. In Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., External Links: 2001.10603 Cited by: §2.3.
  • [20] J. Yamagishi, C. Veaux, and K. MacDonald (2019) CSTR VCTK Corpus: english multi-speaker corpus for CSTR voice cloning toolkit (version 0.92). University of Edinburgh. The Centre for Speech Technology Research (CSTR). Cited by: §3.1.
  • [21] R. Yamamoto, E. Song, and J. Kim (2020)

    Parallel Wavegan: a fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram

    In Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., pp. . Cited by: §2.5.
  • [22] K. Zhang, Y. Ren, C. Xu, and Z. Zhao (2021) WSRGlow: a Glow-based waveform generative model for audio super-resolution. In Proc. Interspeech, pp. . External Links: Document Cited by: §1, §3.2.