ARMAS: Active Reconstruction of Missing Audio Segments

11/21/2021
by   Sachin Pokharel, et al.
BTH
5

Digital audio signal reconstruction of lost or corrupt segment using deep learning algorithms has been explored intensively in the recent years. Nevertheless, prior traditional methods with linear interpolation, phase coding and tone insertion techniques are still in vogue. However, we found no research work on the reconstruction of audio signals with the fusion of dithering, steganography, and machine learning regressors. Therefore, this paper proposes the combination of steganography, halftoning (dithering), and state-of-the-art shallow (RF- Random Forest and SVR- Support Vector Regression) and deep learning (LSTM- Long Short-Term Memory) methods. The results (including comparison to the SPAIN and Autoregressive methods) are evaluated with four different metrics. The observations from the results show that the proposed solution is effective and can enhance the reconstruction of audio signals performed by the side information (noisy-latent representation) steganography provides. This work may trigger interest in the optimization of this approach and/or in transferring it to different domains (i.e., image reconstruction).

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 4

08/24/2020

ATM Cash demand forecasting in an Indian Bank with chaos and deep learning

This paper proposes to model chaos in the ATM cash withdrawal time serie...
01/21/2021

Effect of Deep Learning Feature Inference Techniques on Respiratory Sounds

Analysis of respiratory sounds increases its importance every day. Many ...
03/28/2021

KNN, An Underestimated Model for Regional Rainfall Forecasting

Regional rainfall forecasting is an important issue in hydrology and met...
09/11/2020

Investigating Bi-LSTM and CRF with POS Tag Embedding for Indonesian Named Entity Tagger

Researches on Indonesian named entity (NE) tagger have been conducted si...
12/26/2021

Acoustic scene classification using auditory datasets

The approach used not only challenges some of the fundamental mathematic...
11/10/2020

Generalized LSTM-based End-to-End Text-Independent Speaker Verification

The increasing amount of available data and more affordable hardware sol...
08/08/2021

Audio Spectral Enhancement: Leveraging Autoencoders for Low Latency Reconstruction of Long, Lossy Audio Sequences

With active research in audio compression techniques yielding substantia...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Corrupt audio files and lost audio transmission and signals are severe issues in several audio processing activities such as audio enhancement and restoration. For example, in different applications and music enhancement and restoration situations, gaps could occur for several seconds [5]

. Audio signals reconstruction remains a fundamental challenge in machine learning and deep learning despite the remarkable recent development in neural networks 

[5]. Audio in-painting, audio interpolation/extrapolation, or waveform replacement have been referred to as the restoration of lost information in audio. The reconstruction aims to provide consistent and relevant information while eliminating audible artifacts to keep the listener unaware of any occurring issues [16]. Active reconstruction can be considered as a preemptive security measure to allow for self-healing when part of an audio becomes corrupted. To this end and to the best of our knowledge, we found no prior research work on active reconstruction of audio signals with the fusion of steganography (an information hiding technique), halftonning and machine learning (ML) models. The initial idea (without ML) was proposed in a PhD thesis as an application of steganography. The hiding strategy of steganography can be tailored to act as an intelligent streaming audio/video system that uses techniques to conceal transmission faults from the listener that are due to lost or delayed packets on wireless networks with bursty arrivals, thus, providing a disruption tolerant broadcasting channel [2].

2 Related Work

In the work of Khan et al. [11]

, a modern neuro-evolution algorithm, Enhanced Cartesian Genetic Programming Evolved Artificial Neural Network (ECGPANN), was proposed by the authors as a predictor of the lost signal samples in real-time. The authors have trained and tested the algorithms on audio speech signal data and evaluated them on the music signal. A deep neural network (DNN)-based regression method was proposed in 

[14] for a packet loss concealment (PLC) algorithm to predict a missing frame’s characteristics. Two other DNNs were developed for the training part by integrating the log-power spectra and phases based on the unsupervised pre-training and supervised fine-tuning. To reconstruct the missing frames, the algorithm provides the previous frame features to the trained DNN. In [5]

, researchers have analyzed audio gaps (500 - 550 ms) and used Wasserstein Generative Adversarial Network (WGAN) and Dual Discriminator WGAN (D2WGAN) models to reconstruct the lost audio content. In Khan et al. 

[10], the authors proposed an audio signal reconstruction model called Cartesian Genetic Programming evolved Artificial Neural Network (CGPANN), which was more efficient than the interpolation-extrapolation techniques. The developed model was more robust in recovering signals contaminated with up to 50% noise. In [16], the authors proposed a DNN structure to restore the missing audio content based on the audio gaps. The signals provided in the audio gaps in the DNN structure were time-frequency coefficients (either complex-values or magnitude). In the work of Sperschneider et al. [18]

, the authors presented a delay-less packet-loss concealment (PLC) method for stationary tonal signals, which addresses audio codecs that utilizes a modified discrete cosine transformation (MDCT). In the case of a frame loss, tonal components are identified using the last two obtained spectra and their pitch information. Furthermore, the MDCT coefficients of the tonal components were estimated using phase prediction based on the detection of tonal components. Mokrý et al 

[17] presented an inpainting algorithm called SPAIN (SParse Audio INpainter) developed by an adaptation of the successful declipping method, SPADE [13]

, to the context of audio inpainting. The authors show that the analysis of their algorithm, SPAIN, performs the best in terms of SNR (signal-to-noise ratio) among sparsity-based methods and reaches results on a par with the state-of-the-art Janssen algorithm 

[9] 111

Iteratively fits autoregressive models using a gap’s all previous points for forward estimation and all its future points for backward estimation.

for audio inpainting. A composite model for audio enhancement that combines the Long Short-Term Memory (LSTM) and the Convolutional Neural Network (CNN) models was proposed in 

[8].

3 Method

The aim of our method is to reconstruct a realistic segment from an audio containing corrupted or dropped region using active embedding and machine learning. The experimental sample in this paper is a single audio file sample taken from the “The Sound Archive” 222Sounds from the Star Wars Movies: https://www.thesoundarchive.com/star-wars.asp, last accessed on 2021-09-14. data repository as a proof-of-concept. As the focus of this research is active reconstruction, the audio input is in .wav file format. The main reason for selecting this format is the sound quality as it preserves the originality of analog audio without any compression.

3.1 Halftone-based Compression and Reconstruction (HCR)

In the field of bit-rate reduction, or data compression, the Lempel–Ziv (LZ) or its variant, the Lempel–Ziv–Welch (LZW), are popular methods, however, they are slow and prone to failure if data corruption occurs as in our case where more than 800 samples are deleted from the audio signal. Therefore, finding a compression method that is better immune to data corruption and which can provide heavy compression and good approximate reconstruction is needed. The proposed HCR is, hence, meant to exploit the halftoning for this purpose. The algorithm this work adapts is that of Floyd Steinberg which applies error forward diffusion [7]. the rationale behind conceiving the notion of HCR is that the embedding of data in the least significant bits (LSB) of a bit-stream, necessitates dealing with binary data. Let the original audio sampled data be denoted by the vector , which is then transformed into a matrix with suitable dimensions whose automatic estimation is outside of the scope of this work, see equation 1.

(1)

The matrix is then passed to the dithering phase using Floyd Steinberg algorithm, which results in a binary matrix (as seen in Fig. 1b) that could be partially inverted. This contributes to the heavy compression that we obtain. For instance, the original audio sampled data and the resulting compressed vector pertaining to Fig. 1b (vectorized), show the following properties: Original Audio (1316019 — 10.04 MB — Byte) and its corresponding compressed vector (1316019 — 1.26 MB — Binary), for the length, size and type, respectively.

The error diffusion algorithm exploits the optical system of the human eye which acts as a low-pass filter removing all high frequencies resulting in the illusion of perceiving a dithered image (only binary) as a continuous tone image. Hence, it follows from such notation that in order to partially inverse the dithering operation we need to apply a low-pass filter to attenuate high frequencies (in our case, we choose a 2D Gaussian filter as in Eq. 2 with a kernal size ).

(2)

where and x, y are the coordinates of the image D. Recently, the development of deep machine learning rekindled interest in addressing the inverse halftonning problem by optimization-based filtering  [12] [15]. Nevertheless, we opt to use the aforementioned simple and computationally-inexpensive method.

(a)
(b)
(c)
(d)
Figure 1: HCR Visual Inspection: (a) Original audio data reshaped using Eq. 1 and visualised (b) Halftone of (a) (binary image), (c) Reconstructed (a) from (b) and (d) Small patches cropped from each image left to right, respectively.

In order to scrutinize the efficiency of the reconstruction, we plot the original sampled data against the estimated values from the above HCR process (see Fig. 2

). Given that the dithered version constitutes of only binary values (either 0 or 1), the reconstruction still demonstrates a good correlation R = 0.62. The fitted linear regression model is depicted in Table 

1. In Fig. 2c, we observe that the process captures a noisy structure and orientation of the original data shown in Fig. 2a, therefore, when this side-information is wedded to machine learning (particularly the state-of-the art models), it leverages the quality of audio the algorithm reconstructs. This was the impetus for the initiation of this study. To gauge this performance improvement, we tested some ML models whose results appear in section 3.2.

Figure 2: A dot plot depicting the correlation between the original data and its estimated version. Note that the reconstruction is made from merely binary values values (without ML and without signal drop).
Estimate SE tStat P-value
(Intercept) 0.23791 0.00032051 742.29 0
x1 0.54719 0.00060376 906.3 0
Table 1: Effect-estimates and P-values (Wald tests) from fitting a linear regression model for the data plotted in Fig. 1

. Number of observations: 1316019, Error degrees of freedom: 1316017 Root Mean Squared Error: 0.0527 R-squared: 0.384, Adjusted R-Squared: 0.384.

3.2 Deployed deep/machine learning architectures

In this section, we list down the different machine learning models which we deploy in this study.

3.2.1 Shallow machine learning

Random Forest (RF) - RF regressor is a supervised decision learning technique for regression that employs the ensemble learning method. The most important parameter is the number of trees which is set to 100 as per the recommendation in the literature [3]. It has been demonstrated that RF is more robust than other similar approaches for being able to handle small number of samples, high dimensional and nonlinear fitting [4] [6].

Support Vector Regression (SVR)

- SVR is a hybrid of Support Vector Machines used for regression. Unlike most other regression models, the SVR seeks to fit the optimal line within a threshold value (distance between the hyperplane and borderline) 

[1].

3.2.2 Deep machine learning

Long short-term memory recurrent (LSTM)

- LSTM is a type of recurrent neural network (RNN) models, it is best suited to make predictions based on time series data by learning long-term dependencies 

[19].

3.2.3 Training and testing segments

This section discusses in brief the mechanism whereby the three deep/machine learning models are trained and tested. In Fig. 3 (top row), the original audio sampled signal is displayed for mere comparison. Fig. 3 (second row) shows the stego-audio with the embedded copy and with a simulation of signal loss (i.e., empty segment). Note that the stego-audio and the original look identical as all what was flipped is the last LSB value. Subsequently, we extract the hidden data from the LSB plane and by using the same secret key the values in the extracted binary vector are rearranged. The vector is then transformed to a matrix using Eq. (1) (should correspond to the dithered version) and then a 2D Gaussian filter is applied using Eq. (2), the result is then vectorised to yield the plot shown in Fig. 3 (third row). Part of this vector will be used for training and validation (i.e., the red segments) and the tuned model is finally applied to the test data (i.e., the green segment) to predict (reconstruct) the lost segment.

Figure 3: Audio track segments to train (red) and test (green) machine learning models.

Performing global tuning based on locally adaptive learnt statistics was discussed in a previous work though in a different context  [20].

4 Results and Discussion

The audio files of our experiments are all available online 333Audio Clips: https://ardisdataset.github.io/ARMAS/. The performance of the models applied in this research is evaluated using Correlation (Corr), Peak Signal to Noise Ratio (PSNR), Weighted Peak Signal to Noise Ratio (WPSNR), and Structural Similarity Index Measure (SSIM). The comparison is performed between the original and the reconstructed audio signals. In this experiment, the audio gap of 4000ms length was considered a large gap drop in the audio signals. After extracting the required training data from the sequence (see Fig. 3), the data was passed to RF, SVR and LSTM models for training. The test set was the hidden data embedded in the stego-audio. The evaluation of the reconstructed signals to the original signal (which acts as the ground truth for validation) was observed by calculating the statistical metrics reported above. A quantitative assessment is constructed in Table 2 -only the reconstructed gap is measured- while a qualitative evaluation can be inferred from Fig. 4. Comparison to the state-of-the-art audio inpainting algorithm is restricted to very short gaps ( 45ms) since these algorithms are unable to handle gap range in our study, Table 3 compares the performance on such range. We also observed a good performance of the proposed approach for longer gap reconstruction ( 8000ms) but due to the page limit, we could not report it here (nevertheless, readers can find the results of both experiments online,URL in section  4).

Figure 4: Reconstruction of short audio signal using RF, SVR and LSTM.
Model Corr PSNR WPSNR SSIM
HCR (Baseline) 0.5073 20.7395 34.4945 0.4972
HCR-RF 0.6281 24.6815 38.1425 0.6430
HCR-SVR 0.5073 22.5008 36.4070 0.5267
HCR-LSTM 0.6246 24.6223 38.0182 0.6528
AutoREGR [9] 0.0941 22.6077 35.6863 0.6143
SPAIN [17]
Table 2: Performance of ML models on the reconstruction of a dropped signal (4000ms). The SPAIN [17] method’s performance deteriorates when handling gaps longer than 45ms.
Model Corr PSNR WPSNR SSIM
HCR (Baseline) 0.0483 20.0771 33.5243 0.4861
HCR-RF 0.1267 26.6411 40.1574 0.6789
HCR-SVR 0.0483 22.2346 35.8378 0.5286
HCR-LSTM 0.1237 26.9244 40.3381 0.6971
SPAIN [17] -0.0494 5.7143 18.4342 0.0274
Table 3: Performance of ML models on the reconstruction of a very short gap (36ms) -only the reconstructed gap is measured- to comply with Mokry et al. [17] algorithm which is optimised for very short gaps.

5 Conclusions

This paper proposes the fusion of audio dither-based steganography with machine/deep learning for active reconstruction of lost signals. The results show that the proposed solution is feasible and can enhance the reconstruction of lost audio signals (readers may wish to listen to the audio online, see URL in section  4). We conducted experiments on two types of signal drops of 4000ms and 8000ms length (not shown-page limit-). As a proof-of-concept, we can assert that, in general, the LSTM and the RF models are good models to utilize. Our approach is not meant to replace current inpainting audio methods, but rather to assist them by providing side-latent-information. It can also benefit security systems in protecting audio files from unauthorised manipulation. The contribution of this work is twofold: —A halftone-based compression and reconstruction (HCR) —Orchestration of the three scientific disciplines: steganography, compression and audio processing. To the best of our knowledge we found no similar implementation in the literature for audio missing-segment reconstruction. Thus, we conclude this paper by stating that the fusion of steganography and the state-of-the-art machine learning algorithms can be considered for the active reconstruction of audio signals, however, there is a room for enhancement as we pointed out in the discussion section.

References

  • [1] E.Y. Boateng, J. Otoo, and D.A. Abaye (2020) Basic tenets of classification algorithms k-nearest-neighbor, support vector machine, random forest and neural network: a review. Journal of Data Analysis and Information Processing 08 (04), pp. 341–357. External Links: Document Cited by: §3.2.1.
  • [2] A. Cheddad (2009) Steganoflage: a new image steganography algorithm. Ph.D. Thesis, University of Ulster. Cited by: §1.
  • [3] S.K. Dasari, A. Cheddad, and P. Andersson (2019) Random forest surrogate models to support design space exploration in aerospace use-case. In IFIP Advances in Information and Communication Technology, pp. 532–544. External Links: Document Cited by: §3.2.1.
  • [4] S.K. Dasari, A. Cheddad, and P. Andersson (2020-01) Predictive modelling to support sensitivity analysis for robust design in aerospace engineering. Structural and Multidisciplinary Optimization 61 (5), pp. 2177–2192. External Links: Document Cited by: §3.2.1.
  • [5] P. P. Ebner and A. Eltelt (2020-03) Audio inpainting with generative adversarial network. Cited by: §1, §2.
  • [6] R. Espinosa, J. Palma, F. Jiménez, J. Kamińska, G. Sciavicco, and E. Lucena-Sánchez (2021-12) A time series forecasting based multi-criteria methodology for air quality prediction. Applied Soft Computing 113, pp. 107850. External Links: Document Cited by: §3.2.1.
  • [7] R.W. Floyd and L. Steinberg (1976) An adaptive algorithm for spatial greyscale. In In: Proceedings of the Society of Information Display, Vol. 17, pp. 75–77. Cited by: §3.1.
  • [8] M. Hasannezhad, W.P. Zhu, and B. Champagne (2021) A novel low-complexity attention-driven composite model for speech enhancement. In 2021 IEEE International Symposium on Circuits and Systems (ISCAS), Vol. , pp. 1–5. External Links: Document Cited by: §2.
  • [9] A. Janssen, R. Veldhuis, and L. Vries (1986) Adaptive interpolation of discrete-time signals that can be modeled as autoregressive processes. IEEE Transactions on Acoustics, Speech, and Signal Processing 34 (2), pp. 317–330. External Links: Document Cited by: §2, Table 2.
  • [10] N.M. Khan and G.M. Khan (2017) Audio signal reconstruction using cartesian genetic programming evolved artificial neural network (cgpann). In 2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA), Vol. , pp. 568–573. External Links: Document Cited by: §2.
  • [11] N.M. Khan and G.M. Khan (2020) Real-time lossy audio signal reconstruction using novel sliding based multi-instance linear regression/random forest and enhanced cgpann. Neural Processing Letters 53 (1), pp. 227–255. External Links: Document Cited by: §2.
  • [12] R.H. Kim and S. Park (2018) Deep context-aware descreening and rescreening of halftone images. ACM Transactions on Graphics 37 (4), pp. 1–12. External Links: Document Cited by: §3.1.
  • [13] S. Kitić, N. Bertin, and R. Gribonval (2015-06) Sparsity and cosparsity for audio declipping: a flexible non-convex approach. In In Proc: 12th International Conference on Latent Variable Analysis and Signal Separation, LNCS, Springer., Vol. 9237. External Links: Document Cited by: §2.
  • [14] B.K. Lee and J.H. Chang (2016) Packet loss concealment based on deep neural networks for digital speech transmission. IEEE/ACM Transactions on Audio, Speech, and Language Processing 24 (2), pp. 378–387. External Links: Document Cited by: §2.
  • [15] Y. Li, J.B. Huang, N. Ahuja, and M. Yang (2016) Deep joint image filtering. In Computer Vision – ECCV 2016, pp. 154–169. External Links: Document Cited by: §3.1.
  • [16] A. Marafioti, N. Perraudin, N. Holighaus, and P. Majdak (2019) A context encoder for audio inpainting. IEEE/ACM Transactions on Audio, Speech, and Language Processing 27 (12), pp. 2362–2372. External Links: Document Cited by: §1, §2.
  • [17] O. Mokrý, P. Záviška, P. Rajmic, and V. Veselý (2019) Introducing spain (sparse audio inpainter). In 2019 27th European Signal Processing Conference (EUSIPCO), Vol. , pp. 1–5. External Links: Document Cited by: §2, Table 2, Table 3.
  • [18] R. Sperschneider, J. Sukowski, and G. Marković (2015)

    Delay-less frequency domain packet-loss concealment for tonal audio signals

    .
    In 2015 IEEE Global Conference on Signal and Information Processing (GlobalSIP), Vol. , pp. 766–770. External Links: Document Cited by: §2.
  • [19] L. Sun, J. Du, L.R. Dai, and C.H. Lee (2017) Multiple-target deep learning for lstm-rnn based speech enhancement. In 2017 Hands-free Speech Communications and Microphone Arrays (HSCMA), Vol. , pp. 136–140. External Links: Document Cited by: §3.2.2.
  • [20] P. Yogarajah, J. Condell, K. Curran, P. McKevitt, and A. Cheddad (2012) A dynamic threshold approach for skin tone detection in colour images. International Journal of Biometrics 4 (1), pp. 38. External Links: Document Cited by: §3.2.3.