Corrupt audio files and lost audio transmission and signals are severe issues in several audio processing activities such as audio enhancement and restoration. For example, in different applications and music enhancement and restoration situations, gaps could occur for several seconds 
. Audio signals reconstruction remains a fundamental challenge in machine learning and deep learning despite the remarkable recent development in neural networks. Audio in-painting, audio interpolation/extrapolation, or waveform replacement have been referred to as the restoration of lost information in audio. The reconstruction aims to provide consistent and relevant information while eliminating audible artifacts to keep the listener unaware of any occurring issues . Active reconstruction can be considered as a preemptive security measure to allow for self-healing when part of an audio becomes corrupted. To this end and to the best of our knowledge, we found no prior research work on active reconstruction of audio signals with the fusion of steganography (an information hiding technique), halftonning and machine learning (ML) models. The initial idea (without ML) was proposed in a PhD thesis as an application of steganography. The hiding strategy of steganography can be tailored to act as an intelligent streaming audio/video system that uses techniques to conceal transmission faults from the listener that are due to lost or delayed packets on wireless networks with bursty arrivals, thus, providing a disruption tolerant broadcasting channel .
2 Related Work
In the work of Khan et al. 
, a modern neuro-evolution algorithm, Enhanced Cartesian Genetic Programming Evolved Artificial Neural Network (ECGPANN), was proposed by the authors as a predictor of the lost signal samples in real-time. The authors have trained and tested the algorithms on audio speech signal data and evaluated them on the music signal. A deep neural network (DNN)-based regression method was proposed in for a packet loss concealment (PLC) algorithm to predict a missing frame’s characteristics. Two other DNNs were developed for the training part by integrating the log-power spectra and phases based on the unsupervised pre-training and supervised fine-tuning. To reconstruct the missing frames, the algorithm provides the previous frame features to the trained DNN. In 
, researchers have analyzed audio gaps (500 - 550 ms) and used Wasserstein Generative Adversarial Network (WGAN) and Dual Discriminator WGAN (D2WGAN) models to reconstruct the lost audio content. In Khan et al., the authors proposed an audio signal reconstruction model called Cartesian Genetic Programming evolved Artificial Neural Network (CGPANN), which was more efficient than the interpolation-extrapolation techniques. The developed model was more robust in recovering signals contaminated with up to 50% noise. In , the authors proposed a DNN structure to restore the missing audio content based on the audio gaps. The signals provided in the audio gaps in the DNN structure were time-frequency coefficients (either complex-values or magnitude). In the work of Sperschneider et al. 
, the authors presented a delay-less packet-loss concealment (PLC) method for stationary tonal signals, which addresses audio codecs that utilizes a modified discrete cosine transformation (MDCT). In the case of a frame loss, tonal components are identified using the last two obtained spectra and their pitch information. Furthermore, the MDCT coefficients of the tonal components were estimated using phase prediction based on the detection of tonal components. Mokrý et al presented an inpainting algorithm called SPAIN (SParse Audio INpainter) developed by an adaptation of the successful declipping method, SPADE 
, to the context of audio inpainting. The authors show that the analysis of their algorithm, SPAIN, performs the best in terms of SNR (signal-to-noise ratio) among sparsity-based methods and reaches results on a par with the state-of-the-art Janssen algorithm 111
Iteratively fits autoregressive models using a gap’s all previous points for forward estimation and all its future points for backward estimation.
for audio inpainting. A composite model for audio enhancement that combines the Long Short-Term Memory (LSTM) and the Convolutional Neural Network (CNN) models was proposed in.
The aim of our method is to reconstruct a realistic segment from an audio containing corrupted or dropped region using active embedding and machine learning. The experimental sample in this paper is a single audio file sample taken from the “The Sound Archive” 222Sounds from the Star Wars Movies: https://www.thesoundarchive.com/star-wars.asp, last accessed on 2021-09-14. data repository as a proof-of-concept. As the focus of this research is active reconstruction, the audio input is in .wav file format. The main reason for selecting this format is the sound quality as it preserves the originality of analog audio without any compression.
3.1 Halftone-based Compression and Reconstruction (HCR)
In the field of bit-rate reduction, or data compression, the Lempel–Ziv (LZ) or its variant, the Lempel–Ziv–Welch (LZW), are popular methods, however, they are slow and prone to failure if data corruption occurs as in our case where more than 800 samples are deleted from the audio signal. Therefore, finding a compression method that is better immune to data corruption and which can provide heavy compression and good approximate reconstruction is needed. The proposed HCR is, hence, meant to exploit the halftoning for this purpose. The algorithm this work adapts is that of Floyd Steinberg which applies error forward diffusion . the rationale behind conceiving the notion of HCR is that the embedding of data in the least significant bits (LSB) of a bit-stream, necessitates dealing with binary data. Let the original audio sampled data be denoted by the vector , which is then transformed into a matrix with suitable dimensions whose automatic estimation is outside of the scope of this work, see equation 1.
The matrix is then passed to the dithering phase using Floyd Steinberg algorithm, which results in a binary matrix (as seen in Fig. 1b) that could be partially inverted. This contributes to the heavy compression that we obtain. For instance, the original audio sampled data and the resulting compressed vector pertaining to Fig. 1b (vectorized), show the following properties: Original Audio (1316019 — 10.04 MB — Byte) and its corresponding compressed vector (1316019 — 1.26 MB — Binary), for the length, size and type, respectively.
The error diffusion algorithm exploits the optical system of the human eye which acts as a low-pass filter removing all high frequencies resulting in the illusion of perceiving a dithered image (only binary) as a continuous tone image. Hence, it follows from such notation that in order to partially inverse the dithering operation we need to apply a low-pass filter to attenuate high frequencies (in our case, we choose a 2D Gaussian filter as in Eq. 2 with a kernal size ).
where and x, y are the coordinates of the image D. Recently, the development of deep machine learning rekindled interest in addressing the inverse halftonning problem by optimization-based filtering  . Nevertheless, we opt to use the aforementioned simple and computationally-inexpensive method.
In order to scrutinize the efficiency of the reconstruction, we plot the original sampled data against the estimated values from the above HCR process (see Fig. 2
). Given that the dithered version constitutes of only binary values (either 0 or 1), the reconstruction still demonstrates a good correlation R = 0.62. The fitted linear regression model is depicted in Table1. In Fig. 2c, we observe that the process captures a noisy structure and orientation of the original data shown in Fig. 2a, therefore, when this side-information is wedded to machine learning (particularly the state-of-the art models), it leverages the quality of audio the algorithm reconstructs. This was the impetus for the initiation of this study. To gauge this performance improvement, we tested some ML models whose results appear in section 3.2.
. Number of observations: 1316019, Error degrees of freedom: 1316017 Root Mean Squared Error: 0.0527 R-squared: 0.384, Adjusted R-Squared: 0.384.
3.2 Deployed deep/machine learning architectures
In this section, we list down the different machine learning models which we deploy in this study.
3.2.1 Shallow machine learning
Random Forest (RF) - RF regressor is a supervised decision learning technique for regression that employs the ensemble learning method. The most important parameter is the number of trees which is set to 100 as per the recommendation in the literature . It has been demonstrated that RF is more robust than other similar approaches for being able to handle small number of samples, high dimensional and nonlinear fitting  .
3.2.2 Deep machine learning
3.2.3 Training and testing segments
This section discusses in brief the mechanism whereby the three deep/machine learning models are trained and tested. In Fig. 3 (top row), the original audio sampled signal is displayed for mere comparison. Fig. 3 (second row) shows the stego-audio with the embedded copy and with a simulation of signal loss (i.e., empty segment). Note that the stego-audio and the original look identical as all what was flipped is the last LSB value. Subsequently, we extract the hidden data from the LSB plane and by using the same secret key the values in the extracted binary vector are rearranged. The vector is then transformed to a matrix using Eq. (1) (should correspond to the dithered version) and then a 2D Gaussian filter is applied using Eq. (2), the result is then vectorised to yield the plot shown in Fig. 3 (third row). Part of this vector will be used for training and validation (i.e., the red segments) and the tuned model is finally applied to the test data (i.e., the green segment) to predict (reconstruct) the lost segment.
Performing global tuning based on locally adaptive learnt statistics was discussed in a previous work though in a different context .
4 Results and Discussion
The audio files of our experiments are all available online 333Audio Clips: https://ardisdataset.github.io/ARMAS/. The performance of the models applied in this research is evaluated using Correlation (Corr), Peak Signal to Noise Ratio (PSNR), Weighted Peak Signal to Noise Ratio (WPSNR), and Structural Similarity Index Measure (SSIM). The comparison is performed between the original and the reconstructed audio signals. In this experiment, the audio gap of 4000ms length was considered a large gap drop in the audio signals. After extracting the required training data from the sequence (see Fig. 3), the data was passed to RF, SVR and LSTM models for training. The test set was the hidden data embedded in the stego-audio. The evaluation of the reconstructed signals to the original signal (which acts as the ground truth for validation) was observed by calculating the statistical metrics reported above. A quantitative assessment is constructed in Table 2 -only the reconstructed gap is measured- while a qualitative evaluation can be inferred from Fig. 4. Comparison to the state-of-the-art audio inpainting algorithm is restricted to very short gaps ( 45ms) since these algorithms are unable to handle gap range in our study, Table 3 compares the performance on such range. We also observed a good performance of the proposed approach for longer gap reconstruction ( 8000ms) but due to the page limit, we could not report it here (nevertheless, readers can find the results of both experiments online,URL in section 4).
This paper proposes the fusion of audio dither-based steganography with machine/deep learning for active reconstruction of lost signals. The results show that the proposed solution is feasible and can enhance the reconstruction of lost audio signals (readers may wish to listen to the audio online, see URL in section 4). We conducted experiments on two types of signal drops of 4000ms and 8000ms length (not shown-page limit-). As a proof-of-concept, we can assert that, in general, the LSTM and the RF models are good models to utilize. Our approach is not meant to replace current inpainting audio methods, but rather to assist them by providing side-latent-information. It can also benefit security systems in protecting audio files from unauthorised manipulation. The contribution of this work is twofold: —A halftone-based compression and reconstruction (HCR) —Orchestration of the three scientific disciplines: steganography, compression and audio processing. To the best of our knowledge we found no similar implementation in the literature for audio missing-segment reconstruction. Thus, we conclude this paper by stating that the fusion of steganography and the state-of-the-art machine learning algorithms can be considered for the active reconstruction of audio signals, however, there is a room for enhancement as we pointed out in the discussion section.
-  (2020) Basic tenets of classification algorithms k-nearest-neighbor, support vector machine, random forest and neural network: a review. Journal of Data Analysis and Information Processing 08 (04), pp. 341–357. External Links: Cited by: §3.2.1.
-  (2009) Steganoflage: a new image steganography algorithm. Ph.D. Thesis, University of Ulster. Cited by: §1.
-  (2019) Random forest surrogate models to support design space exploration in aerospace use-case. In IFIP Advances in Information and Communication Technology, pp. 532–544. External Links: Cited by: §3.2.1.
-  (2020-01) Predictive modelling to support sensitivity analysis for robust design in aerospace engineering. Structural and Multidisciplinary Optimization 61 (5), pp. 2177–2192. External Links: Cited by: §3.2.1.
-  (2020-03) Audio inpainting with generative adversarial network. Cited by: §1, §2.
-  (2021-12) A time series forecasting based multi-criteria methodology for air quality prediction. Applied Soft Computing 113, pp. 107850. External Links: Cited by: §3.2.1.
-  (1976) An adaptive algorithm for spatial greyscale. In In: Proceedings of the Society of Information Display, Vol. 17, pp. 75–77. Cited by: §3.1.
-  (2021) A novel low-complexity attention-driven composite model for speech enhancement. In 2021 IEEE International Symposium on Circuits and Systems (ISCAS), Vol. , pp. 1–5. External Links: Cited by: §2.
-  (1986) Adaptive interpolation of discrete-time signals that can be modeled as autoregressive processes. IEEE Transactions on Acoustics, Speech, and Signal Processing 34 (2), pp. 317–330. External Links: Cited by: §2, Table 2.
-  (2017) Audio signal reconstruction using cartesian genetic programming evolved artificial neural network (cgpann). In 2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA), Vol. , pp. 568–573. External Links: Cited by: §2.
-  (2020) Real-time lossy audio signal reconstruction using novel sliding based multi-instance linear regression/random forest and enhanced cgpann. Neural Processing Letters 53 (1), pp. 227–255. External Links: Cited by: §2.
-  (2018) Deep context-aware descreening and rescreening of halftone images. ACM Transactions on Graphics 37 (4), pp. 1–12. External Links: Cited by: §3.1.
-  (2015-06) Sparsity and cosparsity for audio declipping: a flexible non-convex approach. In In Proc: 12th International Conference on Latent Variable Analysis and Signal Separation, LNCS, Springer., Vol. 9237. External Links: Cited by: §2.
-  (2016) Packet loss concealment based on deep neural networks for digital speech transmission. IEEE/ACM Transactions on Audio, Speech, and Language Processing 24 (2), pp. 378–387. External Links: Cited by: §2.
-  (2016) Deep joint image filtering. In Computer Vision – ECCV 2016, pp. 154–169. External Links: Cited by: §3.1.
-  (2019) A context encoder for audio inpainting. IEEE/ACM Transactions on Audio, Speech, and Language Processing 27 (12), pp. 2362–2372. External Links: Cited by: §1, §2.
-  (2019) Introducing spain (sparse audio inpainter). In 2019 27th European Signal Processing Conference (EUSIPCO), Vol. , pp. 1–5. External Links: Cited by: §2, Table 2, Table 3.
Delay-less frequency domain packet-loss concealment for tonal audio signals. In 2015 IEEE Global Conference on Signal and Information Processing (GlobalSIP), Vol. , pp. 766–770. External Links: Cited by: §2.
-  (2017) Multiple-target deep learning for lstm-rnn based speech enhancement. In 2017 Hands-free Speech Communications and Microphone Arrays (HSCMA), Vol. , pp. 136–140. External Links: Cited by: §3.2.2.
-  (2012) A dynamic threshold approach for skin tone detection in colour images. International Journal of Biometrics 4 (1), pp. 38. External Links: Cited by: §3.2.3.