Text Steganalysis with Attentional LSTM-CNN

12/30/2019 ∙ by YongJian Bao, et al. ∙ Tsinghua University 0

With the rapid development of Natural Language Processing (NLP) technologies, text steganography methods have been significantly innovated recently, which poses a great threat to cybersecurity. In this paper, we propose a novel attentional LSTM-CNN model to tackle the text steganalysis problem. The proposed method firstly maps words into semantic space for better exploitation of the semantic feature in texts and then utilizes a combination of Convolutional Neural Networks (CNNs) and Long Short-Term Memory (LSTM) recurrent neural networks to capture both local and long-distance contextual information in steganography texts. In addition, we apply attention mechanism to recognize and attend to important clues within suspicious sentences. After merge feature clues from Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs), we use a softmax layer to categorize the input text as cover or stego. Experiments showed that our model can achieve the state-of-art result in the text steganalysis task.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Steganography is an ancient technique aiming at embedding secret messages into carriers such as images [5], texts [25] and voices [14] in undetectable ways. On the contrary, steganalysis aims to detect hidden messages in suspicious carriers. Text steganography has attracted considerable attention because text has become the most widely used information carrier in daily life, while massive texts on the Internet can provide rich carriers for text steganography. However, like many other security techniques, text steganography might also be exploited by lawbreakers, terrorists and hackers for illegitimate purposes, which causes serious threats to cybersecurity. Thus, it is crucial to develop a powerful and practical steganalysis tool for text steganography.

Fig. 1: Structure of Proposed Attentional LSTM-CNN network

Text steganography can mainly be divided into two families: the modification based methods and the generation based methods. The former usually embed secret information by modifying the cover texts. Representative methods are synonym substitution [13], [20], word-shifting [16]

, etc. The latter usually generate stego texts directly based on technologies such as Markov chain based methods


, deep learning based methods

[24]. Compared with the modification based methods, the generation based methods have attracted more and more attention for the reason that the methods proposed recently can generate more natural texts which conform to statistical distribution of natural language and achieve greater hiding capacity, which makes it harder to detect.

Steganalysis of text always follows the same pattern: directly extracting statistical features from the carrier and then conducting classification. For example, Taskiran et al. [18]

use a universal steganalysis method based on language models and Support Vector Machines (SVMs) to distinguish the stego texts modified by a lexical steganography algorithm. Chen et al.

[2] propose a blind steganalytic method named natural frequency zoned word distribution analysis (NFZ-WDA) to deal with translation based steganography (TBS). Yang and Cao [21] present a novel linguistics steganalysis approach based on meta features and immune clone mechanism. However, these methods always use handcrafted features which are heavily based on domain knowledge and are hard to adapt to various text steganography methods.

Recently, many architectures of neural networks such as Recurrent Neural Networks (RNNs) and Convolutional Neural Networks (CNNs) have achieved significant results in many fields [6, 10, 23, 22]

. The neural network based methods can automatically extract features from carriers and then conduct steganalysis with an end-to-end manner, which have attracted much attention in steganalysis. Based on the observation that CNN is able to learn local response from temporal or spatial data but lacks the ability of learning sequential correlations, while RNN is able to handle sequences of any length and capture long-term contextual dependency. In this paper, we indicate a proper way to take advantage of these two architectures and propose a novel attentional LSTM-CNN model to text steganalysis. In the proposed model, we firstly utilized a word embedding layer to map the words into dense vectors, so that we can get more accurate representations of words and extract their semantic and syntactic features. Then, a bidirectional long short-term memory recurrent neural network is used to capture long-term contextual information from texts, and a CNN layer with different kernel sizes is used to extract local features. Features from CNN layer and Bi-LSTM layer are merged as clues for steganalysis. Finally a fully-connected layer and a softmax layer are served as a classifier to output the classification probabilities.

The rest of this paper is organized as follows. Section 2 describes the details of the proposed attentional LSTM-CNN architecture. Section 3 shows the experimental results and the model is discussed in this part. Finally, concluding remarks are given in Section 4.

Ii Proposed Approach

The framework of our attentional LSTM-CNN model is illustrated in Figure 1. Each layer of network is introduced from bottom to top in the following sections.

Method Meng et al. [12] Samanta et al [15] Din et al. [3] Yang et al.[26] Proposed method
Format bpw Acc P R Acc P R Acc P R Acc P R Acc P R
News 1 0.532 0.517 0.382 0.763 0.739 0.812 0.840 0.869 0.801 0.858 0.858 0.858 0.913 0.930 0.894
2 0.513 0.535 0.204 0.786 0.762 0.832 0.835 0.867 0.791 0.864 0.915 0.803 0.920 0.923 0.916
3 0.597 0.679 0.367 0.824 0.767 0.931 0.897 0.909 0.882 0.920 0.922 0.918 0.962 0.966 0.958
4 0.755 0.831 0.640 0.859 0.797 0.962 0.938 0.962 0.911 0.961 0.979 0.942 0.973 0.981 0.966
5 0.847 0.918 0.761 0.881 0.829 0.959 0.961 0.976 0.945 0.973 0.988 0.958 0.985 0.983 0.987
IMDB 1 0.577 0.642 0.345 0.767 0.779 0.744 0.787 0.829 0.722 0.845 0.941 0.736 0.901 0.953 0.844
2 0.713 0.807 0.560 0.849 0.934 0.871 0.869 0.911 0.818 0.918 0.947 0.886 0.957 0.972 0.940
3 0.840 0.925 0.741 0.90 0.877 0.931 0.916 0.944 0.885 0.941 0.950 0.932 0.966 0.983 0.949
4 0.909 0.969 0.845 0.937 0.905 0.975 0.962 0.975 0.947 0.976 0.986 0.966 0.987 0.990 0.983
5 0.909 0.989 0.828 0.929 0.921 0.940 0.977 0.987 0.966 0.990 0.988 0.992 0.995 0.996 0.993
Twitter 1 0.538 0.520 0.387 0.654 0.652 0.658 0.665 0.664 0.670 0.745 0.811 0.621 0.786 0.873 0.657
2 0.544 0.523 0.399 0.745 0.762 0.712 0.750 0.827 0.631 0.793 0.914 0.647 0.834 0.883 0.770
3 0.577 0.669 0.303 0.809 0.798 0.826 0.834 0.889 0.764 0.879 0.939 0.812 0.908 0.950 0.861
4 0.729 0.836 0.570 0.842 0.824 0.871 0.885 0.950 0.813 0.934 0.988 0.879 0.943 0.986 0.899
5 0.850 0.916 0.770 0.851 0.839 0.870 0.899 0.961 0.832 0.921 0.960 0.879 0.936 0.958 0.911
TABLE I: Detection Accuracy with Previous State-of-Art Methods

Ii-a Word Embedding Layer

The first layer is a word embedding layer, which aims to convert the sequence of words in sentences into a low-dimensional vector sequence. In the proposed model, each inputted sentence can be expressed as: and has a unique id in word dict. With this word index, we can get a dense vector from the embedding layer. After embedding process, the sentence is expressed as a vector sequence, denoted as , where and is the embedding dimension.

Ii-B Bi-LSTM Layer

Considering the close relevance between two turns of words, we use Bi-LSTM as encoder to capture abstract information from both directions. LSTM [8] is a variant of RNN which alleviates gradient vanish problem. Given an input sequence , an LSTM unit can be described using the following formulas:



is the logistic sigmoid function,

and are parameters. Bi-LSTM consists of a forward LSTM that encodes the sentence from to and a backward LSTM that encodes the sentence backward. This can be expressed as follows:


where is the hidden dimension of LSTM unit, and are parameters.

Ii-C Attention Mechanism

An attention layer [19] is incorporated after the Bi-LSTM layer to automatically select and attend to important words. Inputted data which generated by each Bi-LSTM layer can be denoted as , and for , its attention weight can be formulated as follows:


where and are the parameters of the attention layer. Therefore, the output representation is given by:


Ii-D CNN Layer

We utilize Convolutional Neural Network (CNN) to capture local contexts. The one-dimensional convolution involves a filter vector sliding over a sequence and detecting features at different positions which is similar with the convolution used in [10]. A filter convolves with the window vectors at each position in a valid way to generate a feature map , each element of the feature map is produced as:


where means a vector with consecutive frame vectors in , is element-wise multiplication, is a bias term and

is a nonlinear transformation function where ReLU

[7] is used in our model. Three different convolution kernels are used to exploit different lengths of local contexts.

Ii-E Feature Fusion and Classification

In order to better exploit features in different levels and inspired by the residual connection in ResNet, we apply a concatenating layer, which concatenates features from Bi-LSTM layer and CNN layer. The compound feature vector

can be denoted as:


where is the representative feature from Bi-LSTM layer, is the representative feature from CNN layer. Generally, the dimension of

is still very high, which is under the risk of overfitting. Therefore, we firstly utilize global average pooling to reduce the dimension of features. After that, features are fed into a classification layer to generate the probability distribution over the label set. The classification layer can be formulated as:


where , and

are parameter and bias term of linear transformation.

Ii-F Training Framework

The whole proposed model is trained under a supervised learning framework where cross entropy error loss is chosen as loss function of the network. Given a training sample

and its true label where

is the number of possible labels and the estimated probabilities

for each label , the error is defined as:


where 1{condition} is an indicator such that 1{condition is true} = 1 otherwise 1{condition is false} = 0. Moreover, in order to mitigate overfitting, we apply dropout technique [17]

and Batch Normalization

[9] to regularize our model.

Iii Experiments and analysis

Our experiments are based on T-Steg dataset collected by Yang et al. [27]. The natural texts are generated by the model proposed by Fang et al. [4]. T-Steg contains the most common text media on the Internet, including Twitter, movie reviews from IMDB and news. All steganographic methods can generate steganographic texts with different embedding rates by altering the number of bits hidden in per word (bpw). In T-Steg dataset, it contains 10,000 steganographic sentences for different types of texts with different embedding rates.

The hyper-parameters in the proposed model were finally determined based on cross validation in the validation set. Specifically, the embedding size was 256. The number of LSTM units per layer was set to 200. Kernel sizes we chose in CNN layer were 3,4,5 and the number of feature map was 128. The dimension of fully-connected layer in classification layer was 100. We chose Adam [1] as the optimization method. The learning rate was initially set to 0.001 and batch size was 128, dropout rate was 0.5.

To validate the performance of our model, we chose several different representative steganalysis algorithms as our baseline models [12, 15, 3, 26]. We used several evaluation indicators commonly used in classification tasks to evaluate the performance of our model, which are Accuracy (Acc), Precision (P) and Recall (R) [27].

Iii-a Detection Performance

The results of comparison to previous state-of-the-art methods are shown in Table I. From the results, we can observe that in most cases, with the increase of steganographic information in generated texts, the detection performance of steganographic texts also increased. This is easy to understand, because once more information is embedded in texts, the naturalness of the generated texts will decrease, which will damage the coherence of text semantics and provide more clues for steganalysis. Besides, compared to other text steganalysis methods, the proposed model has achieved the best detection results on various metrics, including different text formats and different embedding rates.

Iii-B Model Discussion

In this part, we try to investigate the function of different parts in the proposed model by comparing it with its several variants. The baseline in this experiment is LSTM-CNN architecture (LSTM+CNN). The variant model utilized bidirectional information is denoted as (Bi-LSTM+CNN). Upon this, attention mechanism is added, which is written as (Bi-LSTM+CNN+ATT). Finally, we introduce our full model which combines the features of Bi-LSTM and CNN layer, namely (Bi-LSTM+CNN+ATT+CL). Performances of different variants are shown in Table II. From the table, we can see that utilizing both direction information in texts, adding attention into network and combining the features of Bi-LSTM and CNN layer are effective in proposed model.

Index Network Description Accuracy
#0 LSTM+CNN 89.09
#1 Bi-LSTM+CNN 89.37
#2 Bi-LSTM+CNN+ATT 90.62
#3 Bi-LSTM+CNN+ATT+CL 91.35
TABLE II: Results on test data under different model variants

Iv Conclusions

In this paper, we propose a novel attentional LSTM-CNN model to tackle the text steganalysis problem. The proposed method firstly maps words into semantic space and then utilizes a combination of CNNs and LSTMs to capture both local and long-distance contextual information in steganography texts. In addition, we apply attention mechanism to recognize and attend to important clues within suspicious sentences. A concatenating layer is used to combine feature clues from CNNs and RNNs. Finally we use a softmax layer to classify the input text as cover or stego. Experiments showed that our model can achieve the state-of-art result in the text steganalysis task. Our code will be released as this paper is accepted.


This research is supported by the National Key RD Program (SQ2018YGX210002) and the National Natural Science Foundation of China (No.U1536207 and No.U1636113).


  • [1] J. L. Ba (2014) Adam: amethod for stochastic optimization. In 3rd Int. Conf. Learn. Representations, Cited by: §III.
  • [2] Z. Chen, L. Huang, P. Meng, W. Yang, and H. Miao (2011) Blind linguistic steganalysis against translation based steganography. Cited by: §I.
  • [3] R. Din, S. A. M. Yusof, A. Amphawan, H. S. Hussain, H. Yaacob, N. Jamaludin, and A. Samsudin (2015) Performance analysis on text steganalysis method using a computational intelligence approach. Proceeding of the Electrical Engineering Computer Science and Informatics 2 (1), pp. 67–73. Cited by: TABLE I, §III.
  • [4] T. Fang, M. Jaggi, and K. Argyraki (2017) Generating steganographic text with lstms. arXiv preprint arXiv:1705.10742. Cited by: §III.
  • [5] J. Fridrich (2009) Steganography in digital media. Cambridge University Press Cambridge, pp. xxiv+437. Cited by: §I.
  • [6] K. He, X. Zhang, S. Ren, and S. Jian (2016) Deep residual learning for image recognition. In

    IEEE Conference on Computer Vision & Pattern Recognition

    Cited by: §I.
  • [7] G. E. Hinton (2010) Rectified linear units improve restricted boltzmann machines vinod nair. Cited by: §II-D.
  • [8] S. Hochreiter and J. Schmidhuber Long short-term memory. Neural Computation 9 (8), pp. 1735–1780. Cited by: §II-B.
  • [9] S. C. Ioffe S (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In

    International Conference on International Conference on Machine Learning,JMLR.org

    pp. 448–456. Cited by: §II-F.
  • [10] Y. Kim (2014) Convolutional neural networks for sentence classification. Eprint Arxiv. Cited by: §I, §II-D.
  • [11] X. Li and H. H. Yu (2000) Transparent and robust audio data hiding in subband domain. In The International Conference on Information Technology: Coding and Computing, pp. 74. Cited by: §I.
  • [12] P. Meng, L. Hang, W. Yang, Z. Chen, and H. Zheng (2009) Linguistic steganography detection algorithm using statistical language model. In 2009 International Conference on Information Technology and Computer Science, Vol. 2, pp. 540–543. Cited by: TABLE I, §III.
  • [13] N. Provos and P. Honeyman (2003) Hide and seek: an introduction to steganography. IEEE Security & Privacy 1 (3), pp. 32–44. Cited by: §I.
  • [14] G. B. Rhoads (2004) Audio steganography. Cited by: §I.
  • [15] S. Samanta, S. Dutta, and G. Sanyal (2016) A real time text steganalysis by using statistical method. In 2016 IEEE International Conference on Engineering and Technology (ICETECH), pp. 264–268. Cited by: TABLE I, §III.
  • [16] M. H. Shirali-Shahreza and M. Shirali-Shahreza (2006) A new approach to persian/arabic text steganography. In 5th IEEE/ACIS International Conference on Computer and Information Science and 1st IEEE/ACIS International Workshop on Component-Based Software Engineering, Software Architecture and Reuse (ICIS-COMSAR’06), pp. 310–315. Cited by: §I.
  • [17] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014) Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15 (1), pp. 1929–1958. Cited by: §II-F.
  • [18] C. M. Taskiran, U. Topkara, M. Topkara, and E. J. Delp (2006) Attacks on lexical natural language steganography systems. Cited by: §I.
  • [19] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §II-C.
  • [20] K. Winstein (1999) Lexical steganography through adaptive modulation of the word choice hash. secondary education at the illinoismathematics and science academy, january. Cited by: §I.
  • [21] H. Yang and X. Cao (2010) Linguistic steganalysis based on meta features and immune mechanism. Chinese Journal of Electronics 19 (4), pp. 661–666. Cited by: §I.
  • [22] H. Yang, Z. Yang, Y. Bao, and Y. Huang (2019) Hierarchical representation network for steganalysis of qim steganography in low-bit-rate speech signals. arXiv preprint arXiv:1910.04433. Cited by: §I.
  • [23] H. Yang, Z. Yang, and Y. Huang (2019) Steganalysis of voip streams with cnn-lstm network. In Proceedings of the ACM Workshop on Information Hiding and Multimedia Security, pp. 204–209. Cited by: §I.
  • [24] Z. Yang, X. Guo, Z. Chen, Y. Huang, and Y. Zhang (2018) RNN-stega: linguistic steganography based on recurrent neural networks. IEEE Transactions on Information Forensics and Security 14 (5), pp. 1280–1295. Cited by: §I.
  • [25] Z. Yang, X. Guo, Z. Chen, Y. Huang, and Y. Zhang (2018) RNN-stega: linguistic steganography based on recurrent neural networks. IEEE Transactions on Information Forensics and Security. Cited by: §I.
  • [26] Z. Yang, Y. Huang, and Y. Zhang (2019) A fast and efficient text steganalysis method. IEEE Signal Processing Letters 26 (4), pp. 627–631. Cited by: TABLE I, §III.
  • [27] Z. Yang, N. Wei, J. Sheng, Y. Huang, and Y. Zhang (2018) TS-cnn: text steganalysis from semantic space based on convolutional neural network. arXiv preprint arXiv:1810.08136. Cited by: §III, §III.