Steganography is an efficient way to hide secret messages into seemingly innocent carriers without perceptible distortions, and hence it is popularly employed to achieve covert communications. On the contrary, steganalysis aims to distinguish stego objects (objects containing a secret message) from cover objects. VoIP is a real-time service which enables users to make phone calls anywhere through IP data networks. With the popularity of instant messaging tools such as Wechat, Skype and Snapchat, the network traffic of VoIP increased sharply. It has been widely reported that VoIP was an excellent scheme for covert communication since it possesses many particular characteristic, such as instantaneity, a large mass of carrier data, high covert bandwidth and flexible conversation length [17, 9, 18, 11]. Despite being studied, practical online steganalysis tools for VoIP streams are still rare because most proposed method can not balance the detection accuracy and efficiency. Thus, it is crucial to develop a fast and efficient steganalysis tool of VoIP stream for online system.
Steganography in VoIP streams can be carried out in network protocol and payload field [11, 20, 5]. Compared with embedding secret data into network protocol, the latter one can achieve higher concealment . In order to reduce bandwidth consumption, VoIP often integrated low-bit-rate compressed speech coding standards such as G.729 and G.723.1. Quantization Index Modulation (QIM)  make it possible to integrate secret data into low-bit-rate speech . QIM based steganography incorporates information hiding mainly by introducing QIM algorithm to segment and encode the codebooks in the process of speech quantization which achieves high concealment and is hard to detect [18, 12].
Steganalysis of digital audio always follows the same pattern: directly extracting statistical features from the carrier and then conducting classification. Mel Frequency Cepstrum Coefficient (MFCC), Markov transition features and other combined features like high order moment of carriers are the main focus[14, 7, 15]
. The classification module adopted mostly was support Vector Machine (SVM) classifier. Speech steganalysis follows the same intuition but pays more attention to correlation features between code-words and frames
. Researchers always extract handcraft features based on prior knowledge. Although this is simple and fast, these method performs poorly in detection accuracy. Recently, many architectures of neural networks such as Recurrent Neural Networks (RNNs) and Convolutional Neural Networks (CNNs) have been further studied to model a different level of correlations and they achieved the best results most recently[16, 19, 1]
. However, most of these deep learning based models have high requirement both in storage resource and in computation capacity, making it hard for real-time serving, which poses a great threat to the security of cyberspace. Thus, in this letter, we propose a light weight and high efficient steganalysis scheme to tackle this problem.
Ii Problem Definition
Information-theoretic model for steganography was proposed by Cachin in . Given a message , one can “embed” into the given cover carrier which conforms to distribution with the embedding function and then produces modified carrier which conforms to distribution . The goal of steganography is to reduce the differences in statistical distribution of carriers before and after steganography as much as possible, which can be expressed as:
On the contrary, steganalysis should make full use of the differences between cover and stego carrier and judge whether there were extra messages embeded in the given carrier. Given a sample , steganalysis can be taken as a map . where means that is detected as cover and otherwise as stego.
In VoIP stream, speech frames were first converted to Line Spectrum Frequency (LSF) coefficients. And the LSFs are encoded by Vector Quantization (VQ). Take a low bit rate codec G.729 as an example, the quantized LSF coefficients in each frame are described by a quantization codeword set using codebooks , and , respectively. Steganography scheme such as QIM will have impact on the statistical distribution of these code-words. Thus, in this paper, we adopt the quantized LSF codeword as a clue for steganalysis of VoIP stream. In the application scenario, we can utilize sliding detection window  to collect one or several continuous packets each time to construct these quantized LSF coefficients sequences for steganalysis. Assume that the sample window size is , quantized LSF codewords can be expressed as , where represents -th speech frame. Our goal is to construct an end-to-end model to predict a label for a given carrier. In the experiment, we can get many labeled samples from dataset such as , where is the total sample number in dataset and is the true label of an sample.
Iii Proposed Method
Iii-a Correlation Analysis
There are corresponding relationships between text and speech . Given a speech sequence
as input, we can get an estimate of the corresponding text sequencein the following way:
where is taken as an acoustic model and as a language model. Because of the dependency in speech sequence, word sequence and corresponding relationship between acoustic model and language model exists in Equation 2, code-words and frames in compressed speech have strong correlation.
Define as the -th codeword at frame in sequence , for G.729 and G.723, . When all codewords are uncorrelated, their appearances are independent. Therefore, we have:
When there are correlations between code-words and , the value of two sides in Equation 3 become different. The difference between two sides of the Equation 3 indicates degree of correlation. Specificly, larger imbalance of the two sides indicates stronger correlation. In addition to explicit correlation in code-word, implicit correlation such as word correlation also exist but is difficult to model. One possible way is to map the code-words to a continuous semantic space. Recently, many remarkable results have proved that neural networks have the ability to map things like words into a continuous semantic space by self-learning with plenty of data [6, 13]. Thus, we utilized this mechanism for further study and explored semantic correlations in different levels by analyzing the distribution of these code-word vectors.
Iii-B Correlation Extraction and Classification
Input of correlation extraction module is quantized LSF coefficients sequence
. At the start of correlation extraction process, we firstly map the vector quantilization code-word with a dense vector, which contains more abundant information than its original code-word. The mapping process is based on an embedding matrix, which can be expressed as follows:
where is the size of quantization code-books , is the embedding dimension and the -th row indicates the -th code-word in . Thus, we can convert vector quantilization codewords to a one-hot vector based on the value of the quantization codeword. Then, we can directly get a dense vector as follows:
After the transformation, the representation of the -th frame can be denoted as:
where is the concatenation operator. The sequence was flatted to one dimension vector for further correlation analysis, where .
With these dense representations, we construct our correlation extraction process based on a hidden layer. Specificly, we used a weight matrix
to calculate the probability that this speech carrier contains covert information:
where and are respectively learned weight matrix and bias. The dimension of the output vector is 2. Finally, we add a softmax classifier to the output layer to calculate the possible probability of each category:
The output value reflects the probability that our model believes that the input sequence contains confidential messages. We can set a detection threshold, like 0.5, and then the final detection result can be expressed as
Iii-C Trainning Framework
The trainning process of the proposed model followed a supervised framework and incorporated knowledge distillation (kd)  to improve the performance. Knowledge distillation was a framework proposed by Hinton to compress a large model into a simplified model that can improve the performance of the latter one. The framework uses a teacher-student setting where the student learns from both the ground-truth labels (hard labels) and the soft labels provided by the teacher. The probability mass associated with each class in the soft-labels allows the student to learn more information about the label similarities form a given sample.
In our setting, in order to accelerate the detection efficiency, the proposed model is designed as simple as possible. Thus, it is essential to utilized this framework in the training process of the proposed model. In this paper, we designed a teacher model which contains three Fully Connected (FC) layers to capture correlation in the sequence. Parameters in this teacher model are learned by minimizing the log loss for all training samples in the following manner:
where is the total number of samples. Our key idea is to train the proposed (student) model with the resulting distribution in Equation 10 rather than the ground truth labels. Denote the label generated from teacher model as
, the new loss function for the student model can be set as:
The framework of this training process is presented in Figure 2.
Iv Experiments and analysis
Our experiments are based on a dataset111https://github.com/fjxmlzn/RNN-SM that has been published by Lin . This dataset has more than 100 hours of speech data and includes speaks both in Chinese and in English. Since different languages have different characteristics, we test the speech in different languages separately. In the experiment, we used a low-bit-rate speech codec G.729 to compress and encode the original audio. Steganography algorithm we adopted was a representative steganography scheme CNV-QIM . In addition, in order to test the detection performance, we cut the speech into clips with different lengths. In addition, we generated speech samples embedded with different embedding rates to test the models’ adaptation to various embedding rates. Embedding rate is defined as the ratio of the number of embedded bits to the whole embedding capacity. Embedding positions are chosen randomly.
In the training process, we divided the entire sample set into training set, validation set, and test set according to a ratio of 8:1:1. In the experiment, model hyperparameters were selected via cross-validation. In particular, the embedding sizeis 64. When the sample length is 0.1s, total number of frames in sliding window is 10. the number of hidden layer in proposed model is 1. The number of hidden layers in teacher model is 3 and the dimensions in last two layers are 128 and 64. We used adam 
as our learning algorithm to optimize the model parameters. Our code is implemented by Keras222https://keras.io/. The training and testing environment for the experiments is: Inter(R) Xeon(R) CPU E5-2683 v3 2.00 GHz and GeForce GTX 1080 GPU for accelerating.
To validate the performance of our model, we chose several different representative steganalysis algorithms as our baseline models [15, 19, 16]. Li  constructed a model called the Quantization codeword correlation network (QCCN) based on split VQ codeword from adjacent speech frames to capture correlation and build a high-performance detector with the support vector machine (SVM) classifier. Lin  pointed out four types of correlation between code-words and frames and proposed a codeword correlation model, which is based on the recurrent neural network (RNN). Authors in  indicated a proper way to take advantage of two main deep learning architectures CNN and RNN and propose a novel CNN-LSTM model to detect steganography in VoIP streams, which achieves the state-of-the-art performance.
Besides, metrics used to evaluate the performance of different models in experiments are detection accuracy and inference time. Detection accuracy is defined as the ratio of the number of samples that are correctly classified to the total number of samples.
Iv-a Time Efficiency
We mainly compare our proposed method with several previous models that automatically extract correlations. The results of the experiment are shown in the Table I and Figure LABEL:fig:trainning. From the results, it is obvious that as the length of the sample becomes longer, the detection time will continue to increase, but the detection time of our algorithm rises slowest. Moreover, our algorithm’s detection performances at different sample lengths are significantly higher than other methods. When the sample length is 1s, the detection efficiency of our model is 20 times more than that of the other two models. Even if the sample length is 0.1s, inference time of proposed model is about 0.05 milliseconds and is only 1/10 of the previous model, which shows strong practical value for online serving.
Iv-B Detection Accuracy
|Language||Method||Embedding Rates (%)|
|Ours (no kd)||59.69||69.80||77.80||84.32||89.21|
|Ours (no kd)||59.84||69.88||77.71||84.06||88.99|
|Language||Method||Sample Length (s)|
|Ours (no kd)||61.99||69.80||74.45||78.12||81.64|
|Ours (no kd)||61.74||69.88||74.44||77.68||81.87|
Sample duration and embedding rate are the two most important factors affecting detection performance. In a practical application scenario, two communication parties always adopt a low embedding rate strategy and communicate in a short period of time to reduce the probability of being detected. Thus, our experiments tested the performance at low embedding rates and short durations, respectively. As can be seen from the table, our model performs well at different embedding rates and different durations. In addition, it can be seen from the experiments that this proposed model utilized knowledge distillation performs better than the model that adopt commonly used training methods. Thus, we can draw the conclusion that the training framework using knowledge distillation has important value for modeling other real-time steganalysis problem.
In this paper, we present a novel and extremely fast steganalysis method of VoIP streams, driven by the need for a quick and accurate detection of possible steganography in stream media. Despite simple structure in exploring the correlation in the carrier, it achieves the state-of-the-art performance both in detection accuracy and efficiency. Especially, the average processing time of this proposed method is only about when the sample length is as short as 0.1s, making it have strong practical value for online serving of steganography monitor. Thus, the proposed correlation extraction scheme and training framework can serve as a guide for other researchers in exploring real time steganalysis, since the solution has the potential to address similar runtime efficiency problems in online systems.
-  (2017) Audio steganalysis with convolutional neural network. In Proceedings of the 5th ACM Workshop on Information Hiding and Multimedia Security, pp. 85–90. Cited by: §I.
-  (2014) Adam: amethod for stochastic optimization. In 3rd Int. Conf. Learn. Representations, Cited by: §IV.
-  (2003) Emotions, speech and the asr framework. Speech Communication 40 (1), pp. 213–225. Cited by: §III-A.
-  (2004) An information-theoretic model for steganography. 192 (1), pp. 306–318. Cited by: §II.
-  (2002) Quantization index modulation: a class of provably good methods for digital watermarking and information embedding. IEEE Transactions on Information Theory 47 (4), pp. 1423–1443. Cited by: §I.
-  (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §III-A.
-  (2005) Steganography and steganalysis in voice-over ip scenarios: operational aspects and first experiences with a new steganalysis tool set. In Security, Steganography, & Watermarking of Multimedia Contents VII, Cited by: §I.
-  (2015) Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. Cited by: §III-C.
-  (2011-06) Steganography in inactive frames of voip streams encoded by source codec. IEEE Transactions on Information Forensics and Security 6 (2), pp. 296–306. External Links: Cited by: §I.
-  (2011) Detection of covert voice-over internet protocol communications using sliding window-based steganalysis. Iet Communications 5 (7), pp. 929–936. Cited by: §II.
-  (2012) Steganography integration into a low-bit rate speech codec. IEEE Transactions on Information Forensics & Security 7 (6), pp. 1865–1875. Cited by: §I, §I.
-  (2014) Improving security of quantization-index-modulation steganography in low bit-rate speech streams. Multimedia Systems 20 (2), pp. 143–154. Cited by: §I.
-  (2016) Bag of tricks for efficient text classification. Cited by: §III-A.
-  (2007) Mel-cepstrum-based steganalysis for voip steganography. In Security, Steganography, & Watermarking of Multimedia Contents IX, Cited by: §I.
-  (2017) Steganalysis of qim steganography in low-bit-rate speech signals. IEEE/ACM Transactions on Audio Speech & Language Processing 25 (99), pp. 1–1. Cited by: §I, TABLE II, TABLE III, §IV.
-  (2018) RNN-sm: fast steganalysis of voip streams using recurrent neural network. IEEE Transactions on Information Forensics & Security PP (99), pp. 1–1. Cited by: §I, TABLE I, TABLE II, TABLE III, §IV, §IV.
-  (2008-08) Perceptually transparent information hiding in g.729 bitstream. In 2008 International Conference on Intelligent Information Hiding and Multimedia Signal Processing, Vol. , pp. 406–409. External Links: Cited by: §I.
-  (2008) An approach to information hiding in low bit-rate speech stream. In Global Telecommunications Conference, 2008. IEEE GLOBECOM, pp. 1–5. Cited by: §I, §I, §IV.
-  (2019) Steganalysis of voip streams with cnn-lstm network. In Proceedings of the ACM Workshop on Information Hiding and Multimedia Security, pp. 204–209. Cited by: §I, TABLE I, TABLE II, TABLE III, §IV.
-  (2007) Covert channels and countermeasures in computer network protocols. IEEE Communications Surveys & Tutorials 9 (3), pp. 44–57. Cited by: §I.