Steganography tries to hide messages in plain sight while steganalysis tries to detect their existence secret data from suspicious carriers. VoIP is a real-time service which enables users to make phone calls through IP data networks. It is utilized in many popular apps such as Wechat, Skype and Snapchat, which are widely used for instant messaging in daily life. Compared with traditional steganographic audio carriers, VoIP possesses many particular advantages, including instantaneity, a large mass of carrier data, high covert bandwidth and flexible conversation length, making it a very popular scheme for covert communication[1, 2, 3]. However, like many other security techniques, VoIP-based steganography might also be employed by lawbreakers, terrorists and hackers for illegitimate purposes, which causes serious threats to cybersecurity. Thus, it is crucial to develop a powerful and practiacal steganalysis tool for VoIP stream.
Recent years have witnessed a variety of neural network models that boost the performance of audio steganalysis. For example, Paulin  used Convolutional Neural Network to detect LSB steganography in time domain. In , a general steganalysis scheme Spec-ResNet (Deep Residual Network of (Spectrogram) is proposed to detect the steganography schemes of different embedding domain for AAC and MP3. Lin  proposed Codeword Correlation Model (CCM), which used recurrent neural network (RNN) to extract correlation features in speech steganalysis. The authors in 
proposed a model to combine the strength of bi-direactional Long Short-Term Memory (LSTM) and Convolutional Neural Network to conduct audio steganalysis and achieved currently state-of-the-art result.
Obviously, recent works often switch between two types of deep neural network (DNN): RNNs and CNNs. RNNs perform better in sequential architecture and can capture long-range dependencies but have low computational efficiency. CNN whose hierarchical structure is good at extracting local or position-invariant features but is hard to capture long-range dependencies. Compared with RNNs and CNNs, the attention mechanism  is more flexible in model sequence than RNN/CNN, and is more task/data-driven when modeling dependencies. Unlike sequential models, its computation can be easily and significantly accelerated by existing distributed/parallel computing schemes which are very suitable for real time steganalysis in scenarios like VoIP. However, to the best of our knowledge, steganalysis method solely based on attention to extract correlation features has not been studied for VoIP steganalysis or even for audio steganalysis. Thus, it’s very promising to design a model that thoroughly based on attention mechanism to steganalysis.
In this paper, we proposed a light-weight and RNN/CNN-free neural network named Fast Correlation Extract Model (FCEM) to conduct steganalysis of QIM steganography 
in VoIP streams. In the proposed model, compressed speech frames are firstly converted to dense vector based on an embedding matrix. Then, a variant of attention called ”multi-head attention” is used to capture the correlations or frame context dependency which produce context-aware representations for compressed speech frames. Finally, the new representations are passed to classification module to compute the final prediction for a steganalysis task. The simple architecture of FCEM leads to fewer parameters, less computation and easier parallelization, making it a very excellent model to be widely applied in online services.
The contribution of this work is: 1) we firstly proposed a method solely based on attention mechanism to extract correlation features in steganalysis of VoIP stream or even audio steganalysis task. 2) experiments show that the proposed steganalysis method achieves excellent performance in detecting low embedding rates and short samples. What’s more, proposed model are significantly reduced inference time when sample length is as short as 0.1s, making it can be easily utilized in online services.
The rest of the paper is structured as follows. In part 2, we introduce the proposed method in detail. Part 3 shows experiments results. Finally, we make the conclusion and offer some aspirations for future research in part 4.
2 The Proposed Network
The architecture of the proposed Fast Correlation Extract Model model is illustrated in Figure 1. Each layer in our model from bottom to top is explain in detail in the following parts.
2.1 Input Layer-Quantization Index Sequence
In general, low bit-rate speech codecs such as ITU-G.723.1 are widely used in VoIP communications for compressing the speech or audio signal components of streaming media. Most of the speech codecs are based on the linear predictive coding (LPC)  model, which uses an LPC filter to analyze and synthesize acoustic signals between the encoding and decoding endpoints. In LPC encoding, the LPC coefficients are first converted into Line Spectrum Frequency (LSF) coefficients. And the LSFs are encoded by Vector Quantization (VQ). Quantization Index Sequence (QIS) is generated in the process of VQ, which can be denoted as follows:
where is the total frame number in the sample window of the speech. In VoIP scenario, we can utilize sliding detection window  to collect one or several continuous packets each time to construct these quantization index sequences for steganalysis .
QIM steganography divides quantization codebook of VQ into several parts according to the embedded data. The VQ codewords in different position use independent quantization codebooks, which will bring distortion to the characteristics of QIS. Thus, QIS can be taken as clues for steganalysis and we take it as model inputs.
2.2 Embedding Layer
Embedding Layer is a common approach to represent some objects using a low-dimension dense vector representation, which is a very popular technique in various task such as natural language processing and speech process . Specifically, in our task, the weights of the Embedding layer are of the shape . For each training sample, its input codewords are quantization integers in QIS. The integers are in the range of the . The Embedding layer transforms each quantization integer into the -th line of the embedding weights matrix, i.e.,
where is an random initialized embedding matrix for code-words, is corresponding one-hot vector created by quantization integer code-words in QIS and is an dense vector whose dimension is . After the embedding layer, representation of the -th QIS frame can be denoted as:
where is the concatenation operator.
2.3 Position Encoding
The attention mechanism cannot distinguish between different positions. So it is crucial to encode positions of each code words. There are various ways to encode positions, and the simplest one is to use an additional position encoding module. In this paper, we try the approach proposed in , which is formulated as follows:
where is the frame number, is the position in the embedded vector of and is the dimension of . euqals to . The position encodings are simply added to the input embeddings to encode position information.
2.4 Correlation Extraction Layer
Correlations between different frames are crucial in speech steganalysis . Here, we adopt Multi-head self attention  which has recently achieved remarkable performance in modeling complicated relations between different code-words. Taking the -th frame feature as an example, we will explain how to identify multiple meaningful correlation features involving feature based on such a mechanism. At first, We defined the correlation between feature and feature under a specific attention head as follows:
where is an attention function which defines the correlation between feature and feature . In most cases, it can be defined as inner product. are transformation matrices which map the original embedding space into a new semantic space . Next we recalibrate representation of feature in subspace by combining all relevant features guided by coefficients :
where . Since is a combination of feature and its relevant features under head , it represents a new combinatorial feature learned by our method. Furthermore, a feature is also likely to be involved in different combinatorial features, and we achieve this by using multiple heads, which create different subspaces and learn distinct feature interactions separately. We collect combinatorial features learned in all subspaces as follows:
where is the concatenation operator, and is the number of total heads.
With such an interacting layer, the representation of each feature will be updated into to represent high-order correlation features.
2.5 Output Layer
The output of the correlation extraction layer is a set of feature vectors. We simply concatenate them and then put them into a classification layer for final prediction. The process can be expressed as:
where , and
are parameter and bias term of linear transformation and
is the total frame in the sample window. After this step, the model will output probabilities that samples belongs to cover or stego carrier. Besides, dropout is used to prevent over-fitting in this layer.
The proposed model is trained in a supervised framework. Loss of training process is cross entropy loss. Parameters in the model can be update via minimizing the cross entropy loss based on gradient descent .
3.1 Experimental Settings
Our experiments were conducted in public dataset 111https://github.com/fjxmlzn/RNN-SM.The above dataset contained different types of native speakers making it a good diversity in speech. Besides, each speech file in the datasets was encoded according to the G.729a standard and secret data were embedded using CNV-QIM  steganography with different embedding rates. In our experiment, in order to test the performance of proposed model in low embedding rates, we choose samples embeded secret data with embedding rates from to in step of . Besides, we cut samples into 0.1s, 0.3s, 0.5, 0.7s and 1s segments to test the model performance in short samples. The efficency test is also based on these samples. We split datasets into training set, testing set and the validation set with the ratio of 8:1:1. The proportion of ”cover” and ”stego” samples are equal in the experiments.
The hyperparameters in our model were selected via cross-validation on the trail set. More specifically, theis 100. The head of attention . The dimension of hidden units
in each head is 32. The dropout rate was 0.6 for the output layer. The batch size in training process was 256, and the maximal training epoch was set to 100. We used Adam
as the optimizer for network training. Our model was implemented by Keras. We train all the networks on GeForce GTX 1080 GPU with 16G graphics memory. Prediction process is done both on previous GPU and on ”Intel(R) Xeon(R) CPU E5-2683 v3 2.00GHz”.
3.2 Impact of Sample Length
Durations of samples have significantly influences on detection accuracy of QIM steganography in VoIP streams. From the Table 1, we can see the accuracy of detecting short samples are harder than long samples. The reason is that longer sample will provide more clues for steganalysis. Besides, we can also observe that the proposed model can improve the performance in various length of sample especially sample as short as 0.1s which is more practical in actual application. Thus, we can conclude that the proposed method can effectively detect the QIM steganography only by capturing a small segment speech stream of a monitored VoIP session.
|Language.||Method||Sample Length (s)|
3.3 Impact of Embedding Rates
The embedding rate is an important factor influencing detecting accuracy. In order to prevent steganalysis, the two parties of communication often adopt low embedding strategy. Obviously, detection rates in low embedding rates make great sense in practical scenario. In this part, we mainly conduct steganalysis in low embedding rates to show the effectiveness of proposed model. From the Table 2, we can observe that the proposed model outperforms all of the previous methods in low embedding rates.
3.4 Evaluation of Time Efficiency of Proposed Model
VoIP is a real-time services used in IP network which means that the detection of steganography in VoIP stream should conduct as fast as possible. Compared with previous machine learning framework, the proposed model utilizes attention have the advantage for easier parallelizable computation. Experiments in Table3 shows that our model drastically reduces the steganalysis time compared to recent works. Especially, we can see that even when the sample is as short as 0.1, the proposed model accelerate the steganalysis speed more than twice.
VoIP is a very popular real time streaming media for steganography. Previous steganalysis methods always extracted correlation feature by adopting RNN, CNN and their combination in VoIP streams, which either has low computation effiency or is hard to model long dependency. In this paper, we introduced attention mechanisms to extract correlation in code-words for speech steganalysis tasks and designed a fast correlation extract network without any RNN/CNN structure to detect QIM based steganography in VoIP stream. Experiment results indicate that our model can outperform all of the previous work and achieve state-of-the-art results. Excellent detection performance of proposed method shows that steganalysis model based solely on attention is a powerful alternative to extract correlation feature for steganalysis of VoIP stream and can be easily extend to similar problems in audio steganalysis.
-  L. Liu, M. Li, Q. Li, and Y. Liang, “Perceptually transparent information hiding in g.729 bitstream,” in 2008 International Conference on Intelligent Information Hiding and Multimedia Signal Processing, Aug 2008, pp. 406–409.
-  Y. F. Huang, S. Tang, and J. Yuan, “Steganography in inactive frames of voip streams encoded by source codec,” IEEE Transactions on Information Forensics and Security, vol. 6, no. 2, pp. 296–306, June 2011.
-  Bo Xiao, Yongfeng Huang, and Shanyu Tang, “An approach to information hiding in low bit-rate speech stream,” in Global Telecommunications Conference, 2008. IEEE GLOBECOM, 2008, pp. 1–5.
-  Catherine Paulin, Sid Ahmed Selouani, and Éric Hervet, “Audio steganalysis using deep belief networks,” International Journal of Speech Technology, vol. 19, no. 3, pp. 1–7, 2016.
-  Chen B, Luo W, and Li H, “Audio steganalysis with convolutional neural network,” in Proceedings of the 5th ACM Workshop on Information Hiding and Multimedia Security, 2017, pp. 85–90.
-  Yanzhen Ren, Dengkai Liu, Qiaochu Xiong, Jianming Fu, and Lina Wang, “Spec-resnet: A general audio steganalysis scheme based on deep residual network of spectrogram,” 2019.
-  Zinan Lin, Yongfeng Huang, and Jilong Wang, “Rnn-sm: Fast steganalysis of voip streams using recurrent neural network,” IEEE Transactions on Information Forensics & Security, vol. PP, no. 99, pp. 1–1, 2018.
-  Hao Yang, Zhongliang Yang, and Yongfeng Huang, “Steganalysis of voip streams with cnn-lstm network,” in Proceedings of the ACM Workshop on Information Hiding and Multimedia Security. ACM, 2019, pp. 204–209.
-  Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin, “Attention is all you need,” in Advances in neural information processing systems, 2017, pp. 5998–6008.
-  B Chen, Wornell, and G.W., “Quantization index modulation: a class of provably good methods for digital watermarking and information embedding,” IEEE Transactions on Information Theory, vol. 47, no. 4, pp. 1423–1443, 2002.
-  D. O’Shaughnessy, “Linear predictive coding,” IEEE Potentials, vol. 7, no. 1, pp. 29–32, 1988.
-  Y. F. Huang, S. Tang, and Y. Zhang, “Detection of covert voice-over internet protocol communications using sliding window-based steganalysis,” Iet Communications, vol. 5, no. 7, pp. 929–936, 2011.
-  Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
-  Yi Ren, Yangjun Ruan, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu, “Fastspeech: Fast, robust and controllable text to speech,” arXiv preprint arXiv:1905.09263, 2019.
-  Diederik P. Kingma and Jimmy Lei Ba, “Adam: Amethod for stochastic optimization,” in 3rd Int. Conf. Learn. Representations, 2014.
-  Songbin Li, Yizhen Jia, and C. C. Jay Kuo, “Steganalysis of qim steganography in low-bit-rate speech signals,” IEEE/ACM Transactions on Audio Speech & Language Processing, vol. 25, no. 99, pp. 1–1, 2017.
-  Hao Yang, Zhongliang Yang, YongJian Bao, and Yongfeng Huang, “Hierarchical representation network for steganalysis of qim steganography in low-bit-rate speech signals,” arXiv preprint arXiv:1910.04433, 2019.