DeepAI
Log In Sign Up

AAG-Stega: Automatic Audio Generation-based Steganography

09/10/2018
by   Zhongliang Yang, et al.
Tsinghua University
0

Steganography, as one of the three basic information security systems, has long played an important role in safeguarding the privacy and confidentiality of data in cyberspace. Audio is one of the most common means of information transmission in our daily life. Thus it's of great practical significance to using audio as a carrier of information hiding. At present, almost all audio-based information hiding methods are based on carrier modification mode. However, this mode is equivalent to adding noise to the original signal, resulting in a difference in the statistical feature distribution of the carrier before and after steganography, which impairs the concealment of the entire system. In this paper, we propose an automatic audio generation-based steganography(AAG-Stega), which can automatically generate high-quality audio covers on the basis of the secret bits stream that needs to be embedded. In the automatic audio generation process, we reasonably encode the conditional probability distribution space of each sampling point and select the corresponding signal output according to the bitstream to realize the secret information embedding. We designed several experiments to test the proposed model from the perspectives of information imperceptibility and information hidden capacity. The experimental results show that the proposed model can guarantee high hidden capacity and concealment at the same time.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

11/12/2018

Automatically Generate Steganographic Text Based on Markov Model and Huffman Coding

Steganography, as one of the three basic information security systems, h...
04/08/2019

Adversarial Audio: A New Information Hiding Method and Backdoor for DNN-based Speech Recognition Models

Audio is an important medium in people's daily life, hidden information ...
05/30/2019

TS-RNN: Text Steganalysis Based on Recurrent Neural Networks

With the rapid development of natural language processing technologies, ...
07/11/2019

Heard More Than Heard: An Audio Steganography Method Based on GAN

Audio steganography is a collection of techniques for concealing the exi...
06/02/2020

Graph-Stega: Semantic Controllable Steganographic Text Generation Guided by Knowledge Graph

Most of the existing text generative steganographic methods are based on...
01/20/2020

Audio Summarization with Audio Features and Probability Distribution Divergence

The automatic summarization of multimedia sources is an important task t...
01/29/2022

ItôWave: Itô Stochastic Differential Equation Is All You Need For Wave Generation

In this paper, we propose a vocoder based on a pair of forward and rever...

1 Introduction

In the monograph on information security[Shannon1949], Shannon summarized three basic information security systems: encryption system, privacy system, and concealment system. Encryption system encodes the information in a special way so that only authorized parties can decode it while those who are not authorized cannot. It ensures the security of information by making the message indecipherable. Privacy system is mainly to restrict access to information, so that only authorized users can access important information. Unauthorized users cannot access it by any means under any circumstances. However, while these two systems ensure information security, they also expose the existence and importance of information, making it more vulnerable to get attacks, such as interception and cracking[Bernaille and Teixeira2007]. Concealment system is very different from these two secrecy systems. It uses various carriers to embed secret information and then transmit through public channels, hide the existence of secret information to achieve the purpose of not being easily suspected and attacked[Simmons1984]. Due to its extremely powerful information hiding ability, information hiding system plays an important role in protecting trade secrets, military security and even national defense security.

Figure 1: The overall framework of the proposed method. The sender uses the proposed model to automatically generate a piece of music based on the secret bits stream that needs to be transmitted, and then sends them over the open network channel. The receiver uses the decoding algorithm to decode the received music and obtain the secret information.

Steganography is the key technology in a concealment system. A concealment system can be illustrated by Simmons’ “Prisoners’ Problem”[Simmons1984]: Alice and Bob are in jail, locked up in separate cells far apart from each other, and they wish to devise an escape plan but cannot be perceived by the warden Eve. Faced with this situation, Alice and Bob intend to hide their true information in the normal carrier. We can model this task in a mathematical way as follows. Alice and Bob need to transmit some secret message in the secret message space . Alice gets a cover from the cover space . Under the guidance of a certain key in the keys space , the mapping function is used to map to which is in the hidden space , that is:

(1)

Bob uses the extraction function to extract the correct secret message from the hidden object under the guidance of the key in the keys space :

(2)

In order not to expose the existence of the embedded information, it is usually required that the elements in and are exactly the same, that is . But generally speaking, this mapping function will affect the probability distributions, named and . In order to prevent suspicion, we generally hope that the steganographic operation will not cause big differences in the probability distribution space of the carrier, that is:

(3)

There are various media forms of carrier that can be used for information hiding, including image[Fridrich2009], audio[Yang, Peng, and Huang2017, Huang, Tang, and Yuan2011], text[Luo and Huang2017, Fang, Jaggi, and Argyraki2017] and so on[Johnson and Sallee2008]. Audio is a main carrier of human communication and information transmission, but it is also very easy to be monitored. Therefore, it is of great significance to study audio steganography and find an effective way to use audio carriers transmit secret messages and ensure information security. However, audio steganography is considered more difficult than image steganography because the Human Auditory System (HAS) is more sensitive than Human Visual System (HVS)[Gopalan2003]. For the above reasons, audio steganography has attracted a large number of researchers’ interests. In recent years, more and more audio based information hiding methods have emerged[Yang, Peng, and Huang2017, Gupta and Sharma2014, Huang, Tang, and Yuan2011].

Fridrich J[Fridrich2009] has summarized that, in general, steganography algorithms can utilize three different fundamental architectures that determine the internal mechanism of the embedding and extraction algorithms: steganography by cover selection, cover modification, and cover synthesis. In steganography by cover selection, Alice first encodes all the covers in a cover set and then selects different covers for transmission to achieve the covert message delivery. The advantage of this approach is that the cover is always “100% natural”, but an obvious disadvantage is an impractically low payload. The most studied steganography paradigm today is the steganography by cover modification. Alice implements the embedding of secret information by modifying a given carrier. This kind of method has a wide range of applications on multiple carriers such as images[Fridrich2009], speeches[Huang, Tang, and Yuan2011], and texts[Topkara, Topkara, and Atallah2006]

. However, directly modifying the carrier usually affects the statistical distribution of the carrier, making it easy to be detected by Eve. The third type of method is steganography by cover synthesis. Alice automatically generates a carrier based on the secret message that needs to be delivered, and embeds covert information during the generation process. The biggest advantage of this method is that it does not need to be given a carrier in advance, but can directly generate a carrier that conforms to the corresponding statistical distribution. Therefore, it has a broader application prospect than the first two methods, and is considered a very promising research direction in the current steganography field. There has been methods based on automatic generation of vectors in the field of text steganography

[Luo and Huang2017, Fang, Jaggi, and Argyraki2017] and image steganography[Hayes and Danezis2017]. However, to the best of our knowledge, we have not found an effective information hiding method based on automatic audio generation.

In this paper, we propose an audio steganography based on Recurrent Neural Networks (AAG-Stega), which belongs to the third category that can automatically generate high-quality audios based on secret bits stream that needs to be embedded. In the audio generation process, we code each note reasonably based on their conditional probability distribution, and then control the audio generation according to the bits stream. We can finely adjust the encoding part to control the information embedding rate, so that we can ensure that the concealment and hidden capacity can be optimized at the same time through fine control. In this way, our model can guarantee a good enough concealment while achieving a high hidden capacity.

2 Related Work

2.1 Audio Steganography

Most of the previous audio steganography is base on the carrier modification mode, the difference is the modified features and ways. Currently, the most commonly used audio steganography methods include Least Significant Bit(LSB) encoding, phase coding, echo hiding, and Spread Spectrum(SS) method[Jayaram, Ranganatha, and Anupama2011].

The basic idea of the LSB encoding is to replace the least significant bit of the cover file to hide a sequence of bytes containing the hidden data[Chowdhury et al.2016]. This type of method is simple and easy to implement. However, at the same time its fatal shortcoming is poor anti-attack ability, such as channel interference, data compression, filtering, etc. will destroy hidden information. Phase encoding[Bender et al.1996] mainly uses the human ear’s insensitivity to absolute phase, replaces the absolute phase of the original audio segment with the reference phase representing the secret information, and adjusts the other audio segments to maintain the relative phase between the segments. This method has little effect on the original audio and is hard to detect. However, the hidden capacity is small, and when the reference phase indicating the secret information changes abruptly, a significant phase difference occurs. The basic principle of echo hiding is to embed secret information into the original audio by introducing echoes[Ghasemzadeh and Kayvanrad2015]. It takes advantage of the fact that the human auditory system cannot detect short echoes (milliseconds). The echo hiding method has strong robustness and the ability to resist active attacks. In the case of reasonable selection of echo parameters, the additional echo is difficult to be perceived by the human auditory system. For spread spectrum (SS) method[Kaur et al.2015], it spreads the secret information to the widest possible spectrum or the specified frequency band by using a spread spectrum sequence that is independent of the hidden information to achieve the purpose of information hiding. However, the SS method shares a disadvantage with LSB and parity coding in that it can bring noise into the sound file.

These above methods require a given audio carrier in advance, and then achieve information hiding by modifying some of the relatively insensitive features. However, these modifications are equivalent to adding noise to the original signal, resulting in a large difference in the statistical distribution of the carrier before and after information hiding. Such methods inevitably lead to an irreconcilable contradiction between concealment and hidden capacity. If they want to satisfy the formula (3), it will greatly limit the scope and extent of the modification, affecting the hidden capacity. Once the performance (such as hidden capacity) is improved, the performance (such as concealment) on the other hand will be greatly impaired.

However, the steganography method based on the automatic generation of information carriers does not need to be given a carrier in advance, but can automatically generate a piece of information carrier according to the covert information. In the process of generation, it can learn the statistical model of a large number of samples and generate a steganographic carrier that conforms to its statistical distribution, so it can effectively alleviate or avoid this dilemma, that is, achieve high concealment and high-capacity information hiding at the same time. For this reason, the steganography based on the automatic generation of information carriers is considered to be a very promising research direction in the field of steganography.

2.2 Audio automatic generation based on RNN

Automatic generation of sufficiently realistic information carriers has always been hard. In recent years, with the development of deep neural network technology, more and more research works proved that we can use the powerful feature extraction and expression capabilities of deep neural networks to model information carriers and generate real enough covers such as image, text and audio. Recurrent Neural Network

[Mikolov et al.2010] is a special artificial neural network model. Unlike other deep neural networks, RNN is actually not “deep” in space, the simplest RNN can have only one hidden layer. The fundamental feature of a Recurrent Neural Network is that the network contains a feed-back connection at each step, so it can be extended in the time dimension and form a “deep” neural network in time dimension, which has been shown in Figure 2.

Figure 2: The structure of RNN.

Due to its special structural characteristics, RNN is very suitable for modeling sequential signals. Audio is a typical sequence of signals. For a piece of audio, we can represent it as a sequence , with each element in the sequence

representing the signal at each moment. Currently, the majority of audio automatic generation work is modeled in such a way. The signal of each time

can be expressed as a conditional probability distribution based on the first time signals, the probability distribution of the entire audio can be expressed as the product of the conditional probabilities, which can be expressed by the following formulas:

(4)

With the powerful self-learning ability of neural networks, when we provide enough training samples, the model will have the ability to automatically learn and get the optimal conditional probability distribution estimate. In this way, we can iteratively select the signal with the highest conditional probability as the output, and finally generate a sufficiently realistic information carrier. A lot of preliminary work has proven that it is effective to use this method for automatic audio generation

[Sturm et al.2016].

Figure 3: A detailed explanation of the proposed model and the information hiding algorithm. The top of the figure is the bits stream that needs to be embedded. The middle part is the generated steganographic audio, and the bottom is a two-layer LSTM with the Lookback mechanism and the Attention mechanism. We encode the probability distribution of notes, and then select the corresponding note according to the secret bitstream, so as to achieve the purpose of hiding information.

3 AAG-Stega Methodology

Compared to other steganographic modes, the steganography methods based on carrier automatic generation are characterized by the fact that they do not need to be given carrier in advance. Instead, they can automatically generate a textual carrier based on secret information. This type of method omits in Equation (1), and the embedding function becomes:

(5)

However, they still have to satisfy formula (3), that is, the steganographic operation should minimize the impact of the carrier on the semantic spatial distribution. This could be very hard, but also very promising. The overall structure of our model is shown in Figure 3. The whole system consists of three modules: automatic audio generation (AAG) module, information embedding module and information extraction module.

3.1 AAG Module

By modeling the audio as a product of the conditional probabilities at each moments, as shown in equation (4), it is easy to know that if we want to generate high-quality audio, we need to get an optimal estimate of the conditional probability distribution at each moment, than is .

For the simplest recurrent neural network that has only one hidden layer, it can be described in the following set of formulas:

(6)

where and indicate the input and output vector at -th step respectively, represents the vector of hidden layer, , and are learned weight matrices and biases, and are nonlinear functions, where we usually use or function.

Theoretically, this simplest RNN model can deal with arbitrary length sequence signals. However, due to the gradient vanish problem[Hochreiter1998]

, it cannot handle with the problem of long-range dependence effectively. But its improved algorithm, Long Short-Term Memory (LSTM) model

[Hochreiter and Schmidhuber1997], can effectively solve this problem by elaborately designed unit nodes. The main improvement of LSTM is the hidden layer unit, which is composed of four components: a cell, an input gate, an output gate and a forget gate. It can store the input information of the past time into the cell unit so as to overcome the problem of long distance dependence, and realize the modeling of long time series. An LSTM unit can be described using the following formulas:

(7)

where indicates the input gate, it controls the amount of new information to be stored to the memory cell. The forget gate, which is , enables the memory cell to throw away previously stored information. In this regard, the memory cell is a summation of the incoming information modulated by the input gate and previous memory modulated by the forget gate . The output gate allows the memory cell to have an effect on the current hidden state and output or block its influence.

Basic RNN or LSTM can only generate a short melody that stays in key, but they have trouble generating a longer melody. In order to improve the model’s ability to learn longer-term structures, in our automatic audio generation module, we use the Lookback mechanism[Waite2016] and the Attention mechanism[Bahdanau, Cho, and Bengio2014]. For the basic RNN, the input to the model was a one-hot vector of the previous event, and the label was the target next event. In Lookback RNN, some additional information will be added to the input vector: 1) the events from 1 and 2 bars ago, 2) signals if the last event was creating something new, or just repeating an already established melody, 3) the current position within the measure.

To learn even longer-term structure, we further combine the attention mechanism. Attention is one of the ways that models can access previous information without having to store it in the RNN cell’s state. The difference is that our model does not pay attention to the output of all the previous moments, but only the outputs from the last steps when generating the output for the current step, which can be described in the following formulas:

(8)

where

is a multilayer perceptron conditioned on the hidden state of last

steps. are the RNN outputs from the previous steps, and vector is the current step’s RNN cell state. These values are used to calculate , an length vector with one value for each of the previous steps. The values represent how much attention each step should receive. A softmax is used to normalize these values and create a mask-like vector , called the attention mask. The RNN outputs from the previous steps are then multiplied by these attention mask values and then summed together to get .

The vector is essentially all previous outputs combined together, but each output contributing a different amount relative to how much attention that step received. This

vector is then concatenated with the RNN output from the current step and a linear layer is applied to that concatenated vector to create the new output for the current step. Different from some other attention models that only apply this

vector to the RNN output, in our module, this vector is also applied to the input of the next step. The vector is concatenated with the next step’s input vector and a linear layer is applied to that concatenated vector to create the new input to the RNN cell. This helps attention not only affect the data coming out of the RNN cell, but also the data being fed into the RNN cell.

As we have mentioned before, the output at time step is not only based on the input vector of the current time , but also the information of the previous moments. Therefore, the output of the last hidden layer at time step can be regarded as the information fusion of the previous

steps. Based on these features, after all the hidden layers, we add a softmax layer to calculate the probability distribution of the

-th value, that is

(9)

where , , ,

are learned parameters. All the parameters of the neural network need to be obtained through training. In the training process, we update network parameters using softmax cross-entropy loss and backpropagation algorithm

[Rumelhart, Hinton, and Williams1988]

. After minimizing the loss function through the iterative optimization of the network, we will get a good estimate of the probability distribution in Equation (4).

3.2 Information Hiding Algorithm

In the information embedding module, we mainly code each note based on their conditional probability distribution, which is , to form the mapping relationship from the binary bitstream to note space. Our thought is mainly based on the fact that when our model is well trained, there is actually more than one feasible solution at each time point. After descending the prediction probability of all the notes in the dictionary , we can choose the top sorted notes to build the Candidate Pool (CP). To be more specific, suppose we use to represent the -th note in the Candidate Pool, then the CP can be written as

In fact, when we choose a suitable size of the Candidate Pool, any note in CP selected as the output at that time step is reasonable and will not affect the quality of the generated audio, so it becomes a place where information can be embedded. It is worth noting that at each moment we choose different note, the probability distribution of next note will be different according to the Equation. After we get the Candidate Pool, we need to find an effective encoding method to encode the words in it.

Figure 4: Encoding notes in Candidate Pool using a Huffman tree.

In order to make the coding of each word more in line with its conditional probability distribution, we use the Huffman tree to encode the words in the candidate pool. In computer science and information theory, a Huffman code is a particular type of optimal prefix code. The output from Huffman’s algorithm can be viewed as a variable-length code table for encoding a source symbol. In the encoding process, this method takes more into consideration of the probability distribution of each source symbol in the construction process, and can ensure that the code length required by the symbol with higher coding probability is shorter[Huffman1952]. In the audio generation process, at each moment, we represent each note in the Candidate Pool with each leaf node of the tree, the edges connect each non-leaf node (including the root node) and its two child nodes are then encoded with 0 and 1, respectively, with 0 on the left and 1 on the right, which has been shown in Figure 4.

After the notes in the Candidate Pool are all encoded, the process of information embedding is to select the corresponding leaf node as the output of the current time according to the binary code stream that needs to be embedded. Algorithm details of the proposed information hiding method are shown in Algorithm 1. With this method, we can generate a piece of natural audio according to the input secret code stream. And then these generated audio can be sent out through the open channel to achieve the purpose of secret information hidden and sent, which has a high concealment.

Algorithm 1   Information Hiding Algorithm
Input:
   Secret bit stream:
   Size of Candidate Pool(CPS):
   Start notes list:
Output:
   Generated Steganographic Audio:
 if (not the end of current audio) then
  Calculate the probability distribution of the next note according to the previously generated notes using well trained RNN;
  Descending the prediction probability of all the notes and select the top sorted notes to form the Candidate Pool(CP);
  Construct a Huffman tree according to the probability distribution of each notes in the CP and encode the tree;
  Read the binary stream, and search from the root of the tree according to the encoding rules until the corresponding leaf node is found and output its corresponding note;
 else
  Random select a start note in the Start notes list as the start of the next audio;
 end if
    Return: Generated Steganographic Audio

3.3 Information Extraction Algorithm

Information embedding and extraction are two completely opposite operations. After receiving the steganographic audio, the receiver inputs the first note of each audio as a key into the RNN which will calculate the distribution probability of the notes at each subsequent time point in turn. At each time point, when the receiver gets the probability distribution of the current note, he firstly sorts all the notes in the dictionary in descending order of probability, and selects the top notes to form the Candidate Pool. Then he builds Huffman tree according with the same rules to encode the notes in the candidate pool. Finally, according to the actual transmitted word at the current moment, the path of the corresponding leaf node to the root node is determined, so that we can successfully and accurately decode the bits embedded in the current note. By this way, the bits stream embedded in the original audio can be extracted very quickly and without errors.

4 Experiments and Analysis

In this section, we first introduce the training dataset as well as model details. Then we tested and analyzed the performance of the proposed model from two aspects: information imperceptibility and hiding capacity.

4.1 Data Preparing and Model Detail

For model training, we use Lakh MIDI dataset[Raffel2016] as our training dataset, it contains 176,581 unique MIDI files, 45,129 of which have been matched and aligned to entries in the Million Song Dataset[Bertin-Mahieux et al.2011].

Almost all the parameters in our model can be obtained through training, but there are still some hyper-parameters need to be determined. Through the comparison test, the hyper-parameters of our model are set as follows: the number of LSTM hidden layers is , with each layer containing LSTM units. The length of attention vector is in this experimental. During model training, in order to strengthen the regularization and prevent overfitting, we adopt the dropout mechanism[Srivastava et al.2014] during the training process. We chose Adam[Kingma and Ba2014] as the optimization method. The learning rates are initially set as 0.001, batch size is set as 64, dropout rate is 0.5 and the attention length is 40.

4.2 Information Hiding Capacity

Embedding Rate(ER) calculates how much information can be embedded in the audio. The calculation method of embedding rate is to divide the actual number of embedded bits by the number of bits occupied by the entire generated audio in the computer. The mathematical expression is as follows:

(10)

where is the number of generated audios and is the number of notes in the -th audio. indicates the number of bits embedded in each note and indicates the number of bits occupied by the -th audio in the computer. Obviously, the size of the candidate pool (CPS) can directly affect the embedding rate. In the experiment, for each CPS, we generated 50 pieces of audio, and counted the average number of notes contained in it and the average bytes of the generated audio. The final statistical results and the calculated information embedding rate are shown in Table 1.

CPS 2 4 8 16 32 64
1 1.95 2.78 3.59 4.37 5.22
147.9 146.3 146.9 160.5 139.5 141.8
505.8 518.2 524.9 504.5 530.8 495.4
ER 3.7% 6.9% 9.7% 14.3% 14.4% 18.7%
Table 1: The calculated information embedding rate under different CPS.

4.3 Information Imperceptibility

The purpose of a concealment system is to hide the existence of information in the carrier to ensure the security of important information. Therefore, the imperceptibility of information is the most important performance evaluation factor of a concealment system. Currently, there are two different evaluation methods to measure the quality of speech, i.e., objective evaluation and subjective evaluation respectively.

4.3.1 Objective Evaluation

In order to ensure a sufficiently high concealment, according to formula (3), the statistical distribution of the carriers before and after steganography is required to be as consistent as possible. For each of the 50 piece of audio generated under each CPS, we calculated their average likelihood probability as follows:

(11)

where is the number of samples and indicates the number of notes in -th audio. The test results are shown in table 2.

CPS 2 4 8 16 32 64
score 0.306 0.417 0.315 0.503 0.547 0.783
Table 2: Results of the average likelihood probability.

From Table 2, we can find that the calculated values are relatively small, indicating that the steganographic samples and training samples generated by our model are close in probability distribution. Although the score will gradually increase as the embedding rate increases, it remains within a reasonable range.

4.3.2 Subjective Evaluation

The “A/B/X” testing is a standard subjective evaluation method in the field of audio steganography[Huang, Tang, and Yuan2011]. The detail of this testing is as follows. Suppose there are three types of speech files, denoted by A, B, and X, respectively. A represents the stego audio file containing hidden information, B denotes the audio file without any hidden information, and X is either A or B. Evaluators were employed to listen the audio files, and then asked to decide whether X is A or B.

Evaluator Accuracy Recall f1-score
Group1 0.48610.049 0.4800.082 0.5850.064
Group2 0.3880.085 0.3890.161 0.4800.129
Group3 0.4750.063 0.4860.079 0.5840.067
Group4 0.4320.100 0.4060.143 0.5140.126
Group5 0.4400.055 0.4440.064 0.5480.057
Total 0.4400.077 0.4370.107 0.5390.094
Table 3: Testing results of five evaluator groups.

We invited 5 different test groups, each group contains 10 people. For each group, we randomly selected 10 steganographic audio generated by the proposed model under different CPS. At the same time, we random selected 15 audio samples without hidden information. We mixed them up to form a test dataset containing 65 audio samples. Each evaluator listens to the 65 audios one by one and gives a judgment on each audio, whether it belongs to A or belongs to B. To help them judge, we gave them 3 audio without covert information and 3 audio generated under . This test data set can be found in the supplemental materials111https://github.com/YangzlTHU/AAG-Stega.

We first calculated the accuracy, recall, and F1 score for steganographic and non-steganographic audio to be correctly identified. We uniformly marked the steganographic audio at different embedding rates as negative samples, and the audio without steganographic information as positive samples. The final test results are shown in Table 3. From Table 3, we can find that the accuracy, recall rate or F1 score are all around 0.5. This means it was impossible to distinguish the stego speech from the original speech by using the A/B/X testing method when secret information was embedded in audio using the proposed method.

CPS 2 4 8 16 32
Group1 44% 48% 56% 52% 56%
Group2 66% 48% 62% 70% 56%
Group3 52% 40% 58% 52% 52%
Group4 70% 62% 60% 54% 50%
Group5 74% 50% 50% 48% 56%
Average 61.2% 49.6% 57.2% 55.2% 54.0%
Table 4: Percentages of failures Using A/B/X Test.

Table 4 shows the percentage of failures to identify the stego audio file under different CPS. From Table 4 we can find that, first, when CPS=2, that is, when the embedding rate is the lowest, the average percentage of failure judgments is the highest, which is 61.2%. Secondly, as the embedding rate increases, the average percentage of failure judgments does not decrease significantly, both at around 50%. This also proves that our steganographic audios are very difficult to recognize and thus the proposed model has a high degree of concealment.

5 Conclusion

In this paper, we proposed an automatic audio generation-based steganography(AAG-Stega), which is completely different from the previous audio steganography. The proposed model can automatically generate high-quality audio covers on the basis of the secret bits stream that needs to be embedded. The experimental results showed that the proposed model can guarantee high hidden capacity and concealment at the same time.

References

  • [Bahdanau, Cho, and Bengio2014] Bahdanau, D.; Cho, K.; and Bengio, Y. 2014. Neural machine translation by jointly learning to align and translate. Computer Science.
  • [Bender et al.1996] Bender, W.; Gruhl, D.; Morimoto, N.; and Lu, A. 1996. Techniques for data hiding. IBM systems journal 35(3.4):313–336.
  • [Bernaille and Teixeira2007] Bernaille, L., and Teixeira, R. 2007. Early recognition of encrypted applications. In International Conference on Passive and Active Network Measurement, 165–175.
  • [Bertin-Mahieux et al.2011] Bertin-Mahieux, T.; Ellis, D. P.; Whitman, B.; and Lamere, P. 2011. The million song dataset. In Ismir, volume 2,  10.
  • [Chowdhury et al.2016] Chowdhury, R.; Bhattacharyya, D.; Bandyopadhyay, S. K.; and Kim, T.-h. 2016. A view on lsb based audio steganography. International Journal of Security and Its Applications 10(2):51–62.
  • [Fang, Jaggi, and Argyraki2017] Fang, T.; Jaggi, M.; and Argyraki, K. 2017. Generating steganographic text with lstms. arXiv preprint arXiv:1705.10742.
  • [Fridrich2009] Fridrich, J. 2009. Steganography in digital media: principles, algorithms, and applications. Cambridge University Press.
  • [Ghasemzadeh and Kayvanrad2015] Ghasemzadeh, H., and Kayvanrad, M. H. 2015. Toward a robust and secure echo steganography method based on parameters hopping. In Signal Processing and Intelligent Systems Conference (SPIS), 2015, 143–147. IEEE.
  • [Gopalan2003] Gopalan, K. 2003. Audio steganography using bit modification. In Multimedia and Expo, 2003. ICME’03. Proceedings. 2003 International Conference on, volume 1, I–629. IEEE.
  • [Gupta and Sharma2014] Gupta, N., and Sharma, N. 2014. Dwt and lsb based audio steganography. In Optimization, Reliabilty, and Information Technology (ICROIT), 2014 International Conference on, 428–431. IEEE.
  • [Hayes and Danezis2017] Hayes, J., and Danezis, G. 2017. Generating steganographic images via adversarial training. In Advances in Neural Information Processing Systems, 1954–1963.
  • [Hochreiter and Schmidhuber1997] Hochreiter, S., and Schmidhuber, J. 1997. Long short-term memory. Neural Computation 9(8):1735–1780.
  • [Hochreiter1998] Hochreiter, S. 1998.

    The vanishing gradient problem during learning recurrent neural nets and problem solutions.

    International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems 06(02):–.
  • [Huang and Yeo2002] Huang, D.-Y., and Yeo, T. Y. 2002. Robust and inaudible multi-echo audio watermarking. In Pacific-Rim Conference on Multimedia, 615–622. Springer.
  • [Huang, Tang, and Yuan2011] Huang, Y. F.; Tang, S.; and Yuan, J. 2011. Steganography in inactive frames of voip streams encoded by source codec. IEEE Transactions on information forensics and security 6(2):296–306.
  • [Huffman1952] Huffman, D. A. 1952. A method for the construction of minimum-redundancy codes. Proceedings of the IRE 40(9):1098–1101.
  • [Jayaram, Ranganatha, and Anupama2011] Jayaram, P.; Ranganatha, H.; and Anupama, H. 2011. Information hiding using audio steganography–a survey. The International Journal of Multimedia & Its Applications (IJMA) Vol 3:86–96.
  • [Johnson and Sallee2008] Johnson, N. F., and Sallee, P. A. 2008. Detection of hidden information, covert channels and information flows. Wiley Handbook of Science and Technology for Homeland Security.
  • [Kaur et al.2015] Kaur, R.; Thakur, A.; Saini, H. S.; and Kumar, R. 2015. Enhanced steganographic method preserving base quality of information using lsb, parity and spread spectrum technique. In Advanced Computing & Communication Technologies (ACCT), 2015 Fifth International Conference on, 148–152. IEEE.
  • [Kingma and Ba2014] Kingma, D. P., and Ba, J. 2014. Adam: A method for stochastic optimization. Computer Science.
  • [Luo and Huang2017] Luo, Y., and Huang, Y. 2017. Text steganography with high embedding rate: Using recurrent neural networks to generate chinese classic poetry. In ACM Workshop on Information Hiding and Multimedia Security, 99–104.
  • [Matsuoka2006] Matsuoka, H. 2006. Spread spectrum audio steganography using sub-band phase shifting. In Intelligent Information Hiding and Multimedia Signal Processing, 2006. IIH-MSP’06. International Conference on, 3–6. IEEE.
  • [Mikolov et al.2010] Mikolov, T.; Karafiát, M.; Burget, L.; Cernocký, J.; and Khudanpur, S. 2010. Recurrent neural network based language model. In INTERSPEECH 2010, Conference of the International Speech Communication Association, Makuhari, Chiba, Japan, September, 1045–1048.
  • [Raffel2016] Raffel, C. 2016. Learning-based methods for comparing sequences, with applications to audio-to-midi alignment and matching. Columbia University.
  • [Rumelhart, Hinton, and Williams1988] Rumelhart, D. E.; Hinton, G. E.; and Williams, R. J. 1988. Learning representations by back-propagating errors. Readings in Cognitive Science 323(6088):399–421.
  • [Shannon1949] Shannon, C. E. 1949. Communication theory of secrecy systems. Bell Labs Technical Journal 28(4):656–715.
  • [Simmons1984] Simmons, G. J. 1984. The prisoners’ problem and the subliminal channel. Advances in Cryptology Proc Crypto 51–67.
  • [Sridevi, Damodaram, and Narasimham2009] Sridevi, R.; Damodaram, A.; and Narasimham, S. 2009. Efficient method of audio steganography by modified lsb algorithm and strong encryption key with enhanced security. Journal of Theoretical & Applied Information Technology 5(6).
  • [Srivastava et al.2014] Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; and Salakhutdinov, R. 2014. Dropout: a simple way to prevent neural networks from overfitting.

    Journal of Machine Learning Research

    15(1):1929–1958.
  • [Sturm et al.2016] Sturm, B. L.; Santos, J. F.; Ben-Tal, O.; and Korshunova, I. 2016. Music transcription modelling and composition using deep learning. arXiv preprint arXiv:1604.08723.
  • [Topkara, Topkara, and Atallah2006] Topkara, U.; Topkara, M.; and Atallah, M. J. 2006. The hiding virtues of ambiguity: quantifiably resilient watermarking of natural language text through synonym substitutions. In Proceedings of the 8th workshop on Multimedia and security, 164–174. ACM.
  • [Waite2016] Waite, E. 2016. Generating long-term structure in songs and stories. https://magenta.tensorflow.org/2016/07/15/lookback-rnn-attention-rnn/. Jul 15, 2016.
  • [Yang, Peng, and Huang2017] Yang, Z.; Peng, X.; and Huang, Y. 2017. A sudoku matrix-based method of pitch period steganography in low-rate speech coding. In International Conference on Security and Privacy in Communication Systems, 752–762. Springer.