DNA Steganalysis Using Deep Recurrent Neural Networks

04/27/2017
by   Ho Bae, et al.
Seoul National University
0

The technique of hiding messages in digital data is called a steganography technique. With improved sequencing techniques, increasing attempts have been conducted to hide hidden messages in deoxyribonucleic acid (DNA) sequences which have been become a medium for steganography. Many detection schemes have developed for conventional digital data, but these schemes not applicable to DNA sequences because of DNA's complex internal structures. In this paper, we propose the first DNA steganalysis framework for detecting hidden messages and conduct an experiment based on the random oracle model. Among the suitable models for the framework, splice junction classification using deep recurrent neural networks (RNNs) is most appropriate for performing DNA steganalysis. In our DNA steganography approach, we extract the hidden layer composed of RNNs to model the internal structure of a DNA sequence. We provide security for steganography schemes based on mutual entropy and provide simulation results that illustrate how our model detects hidden messages, independent of regions of a targeted reference genome. We apply our method to human genome datasets and determine that hidden messages in DNA sequences with a minimum sample size of 100 are detectable, regardless of the presence of hidden regions.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 4

page 6

page 7

page 8

10/03/2017

Dilated Convolutions for Modeling Long-Distance Genomic Dependencies

We consider the task of detecting regulatory elements in the human genom...
10/10/2019

LISA: Towards Learned DNA Sequence Search

Next-generation sequencing (NGS) technologies have enabled affordable se...
02/07/2018

Spectral Learning of Binomial HMMs for DNA Methylation Data

We consider learning parameters of Binomial Hidden Markov Models, which ...
12/24/2021

Measuring Quality of DNA Sequence Data via Degradation

We propose and apply a novel paradigm for characterization of genome dat...
02/22/2017

Memory Matching Networks for Genomic Sequence Classification

When analyzing the genome, researchers have discovered that proteins bin...
02/13/2021

DNA codes over two noncommutative rings of order four

DNA codes based on error-correcting codes have been successful in DNA-ba...
02/06/2019

Restriction enzymes use a 24 dimensional coding space to recognize 6 base long DNA sequences

Restriction enzymes recognize and bind to specific sequences on invading...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Steganography serves to conceal the existence and content of messages in media using various techniques, including changing the pixels in an image [bennett2004linguistic]. Generally, steganography is used to achieve two main goals. On the one hand, it is used as digital watermarking to protect intellectual property. On the other hand, it is used as a covert approach to communicating without the possibility of detection by unintended observers. In contrast, steganalysis is the study of detecting hidden messages. Steganalysis also has two main goals, which are detection and decryption of hidden messages [bennett2004linguistic, mitras2013proposed].

Among the various media employed to hide information, deoxyribonucleic acid (DNA) is appealing owing to its chemical stability and, thus is a suitable candidates as a carrier of concealed information. As a storage medium, DNA has the capacity to store large amounts of data that exceed the capacity of current storage media [beck2012finding]. For instance, a gram of DNA contains approximately DNA bases (108 tera-bytes), which indicates that only a few grams of DNA can store all information available [gehani2003dna]. In addition, with the advent of next-generation sequencing, individual genotyping has become affordable [cordell2005genetic], and DNA in turn has become an appealing covert channels.

To hide information in a DNA sequence, steganography methods require that a reference target sequence and a message to be hidden [katzenbeisser2000information]. A naïve example of a substitution-based method for watermarking that exploits the preservation of amino acids is shown in Figure 2 (see the caption for details). The hiding space of this method is restricted to exon regions using a complementary pair that does not interfere with protein translation. However, most DNA steganography methods are designed without considering the hiding spaces, and they change a sequence into binary format utilizing well-known encryption techniques.

In this regard, Clelland et al. [clelland1999hiding], first proposed DNA steganography that utilized the microdot technique. Yachie et al. [yachie2007alignment], demonstrated that living organisms can be used as data storage media by inserting artificial DNA into artificial genomes and using a substitution cipher coding scheme. This technique is reproducible and successfully inserts four watermarks into the cell of a living organism [gibson2010creation]. Several other encoding schemes have been proposed  [brenner2000vitro, tanaka2005public]. The DNA-Crypt coding scheme  [heider2007dna] translates a message into 5-bit sequences, and the ASCII coding scheme [jiao2008code] translates words into their ASCII representation, converts them from decimals to binary, and then replaces 00 with adenine (A), 01 with cytosine (C), 10 with guanine (G), and 11 with thymine (T).

With the recent advancements with respect to steganography methods, various steganalysis studies have been conducted using traditional storage media. Detection techniques that are

based on statistical analysis, neural networks, and genetic algorithms 

[maitra2011digital] have been developed for common covert objects such as digital images, video, and audio. For example, Bennett [bennett2004linguistic] exploits letter frequency, word frequency, grammar style, semantic continuity, and logical methodologies. However, these conventional steganalysis methods have not been applied to DNA steganography.

In this paper, we show that conventional steganalysis methods are not directly applicable to DNA steganography. Currently, the most commonly employed detection schemes, i.e.,

a statistical hypothesis testing methods,

are limited with respect to the number of input queries in order

to estimate distribution to perform statistical test 

[grosse2017statistical]. To overcome the limitations of these existing methods, we propose a DNA steganalysis method based on learning the internal structure of unmodified genome sequences (i.e., intron and exon modeling [lee2015boosted, lee2015dna]) using deep recurrent neural networks (RNNs). The RNN

-based classifier is used to

identify

modified genome sequences. In addition, we enhance our proposed model using unsupervised pre-training of a sequence-to-sequence autoencoder

in order to overcome the restriction of the robustness of detection performance. Finally, we compared our proposed

method to various machine learning-based classifiers and biological sequence analysis methods

that were implemented on top of our framework.

2 Background

Figure 1: DNA hiding scheme using synonymous codons. A watermark is a scheme used to deter unauthorized dissemination by marking hidden symbols or texts. For the conservation of amino acids, DNA watermarking can be changed to one of the synonymous codons.
Figure 2: Learned representation of DNA sequences. The learned representations for each coding and non-coding region projected into a two-dimensional (2-D) space using t-SNE [maaten2008visualizing]. The representation is based on sequence-to-sequence learning using an autoencoder and stacked RNNs.

We use the standard terminology of information hiding [anderson1996information] to provide a brief explanation of the related background. For example, two hypothetical parties, (i.e., a sender and a receiver) wish to exchange genetically modified organisms (GMOs) protected by patents. A third party detects watermark sequence from the GMOs for unauthorized use. Both the sender and receiver use the random oracle [canetti2004random] model, which posits existing steganography schemes, to embed their watermark message, and the third party uses our proposed model to detect the watermarked signal. A random oracle model posits the randomly chosen function , which can be evaluated only by querying the oracle that returns given input .

2.1 Notations

The notations used in this paper are as follows: is a set of DNA sequences of species; is a set of DNA sequences of species and the hidden messages are embedded for some species ; is the input sequence where is the length of the input sequence; is the encrypted value of where is the length of the encrypted sequence; is an encryption function, which takes input and returns the encrypted sequence ; is a trained model that takes target species as training input; is an averaged output score ;

is a probability output given by the trained model

given input , where ; is a probabilistic polynomial-time adversary. The adversary[bellare1993random] is an attacker that queries messages to the oracle model;

is the standard deviation value of

score .

2.2 Hiding Messages

The hiding positions of a DNA sequence segment are limited compared to those of the covert channel because the sequences are carried over after the translation and transcription processes in the exon region. For example, assume that ACGGTTCCAATGC is a reference sequence, and 01001100 is the message to be hidden. The reference sequence is then translated according to any coding schemes. In this example, we apply the DNA-crypt coding scheme [heider2007dna], which converts the DNA sequence to binary replacing A with 00, C with 01, G with 10, and T with 11. The reference sequence is then translated to 00011010111101010000111001 and divided into key bits that are defined by the sender and receiver. Assume that the length of the key is 3, the reference sequence can be expressed as 000, 110, 101, 111, 010, 100, 001, 110, 01, and the message is concealed at the first position. The DNA sequence with the concealed messages are then represented as 0000, 1110, 0101, 0111, 1010, 1100, 0001, 0110, 01. Finally, the sender transmits the transformed DNA sequence of AATGCCCTGGTAACCG. The recipient can extract the hidden message using the pre-defined key.

2.3 Determination of Message-Hiding Regions

Genomic sequence regions (i.e., exons and introns) are utilized depending on whether the task is data storage or transport. Intron regions are suitable for transportation since they are not transcribed and are removed by splicing [keren2010alternative, lockhart2000genomics] during transcription. This property of introns provides large sequence space for concealing data, creating potential covert channels. In contrast, data storage (watermarking) requires data to be resistant to degradation or truncation. Exons are a suitable candidate for storage because underlying DNA sequence is conserved after the translation and transcription processes [shimanovsky2002hiding]. These two components of internal structure components in eukaryote genes are involved in DNA steganography as the payload (watermarking) or carrier (covert channels). Figure 2 shows the learned representations of introns and exons which are calculated by softmax function. The softmax function reduces the outputs of intron and exons to range between 0 and 1. The 2D projection position of introns and exons will change if hidden messages are embedded without considering shared patterns between the genetic components (e.g., complementary pair rules). Thus, the construction of a classification model to enable a clear separation axis of these shared patterns is an important factor in the detection of hidden messages.

3 Methods

Figure 3: Flowchart of proposed DNA steganalysis pipeline.

Our proposed method uses RNNs [schmidhuber2015deep] to detect hidden messages in DNA. Figure 3 shows our proposed steganalysis pipeline. The pipeline comprises of training and detection phases. In the model training phase, the model learns the distribution of unmodified genome sequences that distinguishes between introns and exons (see Section 3.2 for the model architecture). In the detection phase, we obtain a prediction score exhibiting the distribution of introns and exons. By exploiting the obtained prediction score, we formulate a detection principle. The details of the detection principle are described in Section 3.1.

3.1 Proposed DNA Steganalysis Principle

The security of the random oracle is based on an experiment involving an adversary , as well as ’s indistinguishability of the encryption. Assume that we have the random oracle that acts like a current steganography scheme with only a negligible success probability. The experiment can be defined for any encryption scheme over message space D and for adversary . We describe the proposed method to detect hidden messages using the random oracle. For the , the random oracle chooses a random steganography scheme . Scheme modifies or extends the process of mapping a sequence with length input to a sequence with length with a random sequence as the output. The process of mapping sequences can be considered as a table that indicates for each possible input the corresponding output value . With chosen scheme , chooses a pair of sequences . The random oracle which posits the scheme selects a bit and sends encrypted message to the adversary. The adversary outputs a bit . Finally, the output of the is defined as 1 if , and 0 otherwise. succeeds in the in the case of distinguishing . Our methodology using is described as follows:

  1. [leftmargin=*,label=(), labelindent=1.5mm,labelsep=1.3mm]

  2. We construct (Figure 3-A) that runs on a random oracle where selected species . Note that a model can be based on any classification model, but the key to select a model is to reduce the standard deviation. Our proposed model is described in Section 3.2.

  3. computes (Figure 3-B4) using given .

  4. computes the standard deviation of (Figure 3-B).

  5. computes (Figure 3-C3) using given .

  6. is successfully detected (Figure 3-C4) if

    (1)
Figure 4: Final score of intron/exon sequence obtained from the softmax of the neural network (best viewed in color). (a) kernel density differences between two stego-free intron sequences (b) kernel density differences between stego-free and 1% perturbed steged intron sequences. (c) kernel density differences between stego-free and 5% perturbed steged intron sequences.

This gives two independent scores and from . The score will have the same range of the unmodified genome sequences whereas the score will have a different range of modified genome sequences. If the score difference between y and is larger than the standard deviation of the unmodified genome sequence distribution, it may be that the sequence has been forcibly changed. Figure 4 shows the histogram of the final score of and returned from softmax of the neural network. If the message is hidden, we can see that the final score from softmax of the neural network differs over the range . From Eq. (1) below, we show that detection is possible using information theoretical proof based on entropy  (Ref. [blahut1987principles]).

A DNA steganography scheme is not secure if . The mutual joint entropy is the union of both entropies for distribution D and . According to Gallager at el [gallager1968information], the mutual information of is given as . It is symmetric in D and such that , and always non-negative. The conditional entropy between two distribution is 0 if and only if the distributions are equal. Thus, the mutual information must be zero to define secure DNA steganography schemes:

(2)

where C is message hiding space and it follows that:

(3)

Eq. (2) indicates that the amount of entropy must not be decreased based on the knowledge of D and . It follows that the secure steganography scheme is obtained if and only if:

Note that for it is not possible to distinguish between the original sequence and the stego sequence. Considering that the representations of are limited to {A,C,G,T}, it is nearly impossible to satisfy the condition because current steganography schemes are all based on the assumption of addition or substitution. Because C is independent of D, the amount of information will increase over distribution D if hidden messages are inserted over distribution . We can conclude that the schemes are not secure under condition .

Figure 5: Overview of proposed RNN methodology.

3.2 Proposed Steganalysis Rnn Model

The proposed model is based on sequence-to-sequence learning using an autoencoder and stacked RNNs [peterson2014common], where the model training consists of two main steps: 1) unsupervised pre-training of sequence-to-sequence autoencoder for modeling an overcomplete case, and 2) supervised fine-tuning of stacked RNNs for modeling patterns between canonical and non-canonical splice sites (see Figure 5). In the proposed model, we use a set of DNA sequences labeled as

introns and exons. These sequences are converted into a binary vector by orthogonal encoding 

[baldi2001bioinformatics]. It employs

-bit one-hot encoding. For

, is encoded by

(4)

For example, the sequence ATTT is encoded into a dimensional binary vector . The encoded sequence is a tuple of a four-dimensional (4D) dense vector, and is connected to the first layer of an autoencoder, which is used for the unsupervised pre-training of sequence-to-sequence learning. An autoencoder is an artificial neural network (ANN) that is used to learn meaningful encoding for a set of data in a case involvingunsupervised learning. An autoencoder consists of two components, namely an encoder and decoder.

The encoder RNN encodes to the representation of sequence features , and the decoder RNN decodes to the reconstructed ; thus minimizing the reconstruction errors of , where is one-hot encoded input. Through unsupervised learning of the encoder-decoder model [srivastava2015unsupervised], we obtain representations of inherent features , which are directly connected to the second activation layer. The second layer is RNNs layer used to construct the model. The model in turn is used to determine patterns between canonical and non-canonical splice signals. We then obtain the tuple of fine-tunned , where h is the representation of sequence features learned by features, which is a representation of introns and exons in hidden layers, and is the dimension of a vector.

The features h learned from the autoencoder are connected to the second stacked RNN layer, which consists of our proposed architecture for outputting a classification score for the given sequence

. For the fully connected output layer, we use the sigmoid function as the activation. The activation score is given by

, where is the label that indicates whether the given region contains introns () or exons (). For our training model,

we use a recently proposed optimizer of multi-class logarithmic loss function Adam 

[kingma2014adam]. The objective function that must be minimized is defined as follows:

(5)

where is the mini-batch size. A model has a possible score of for one species, where is the score of given non perturbed sequences.

4 Results

Figure 6:

Comparison of learning algorithms with random hiding algorithms (best viewed in color). (a) differences in accuracy for intron region (b) differences in accuracy for exon region (c) difference in accuracy for both region. [The performances of four supervised learning algorithms when detecting hidden messages are shown for six variable lengths of nucleotides (nts).]

4.1 Dataset

We simulated our approach using the Ensembl human genome dataset and human UCSC-hg38 dataset [kent2002human], which include sequences from 24 human chromosomes (22 autosomes and 2 sex chromosomes). The Ensembl human genome dataset has a two-class classification (coding, and non-coding) and the UCSC-hg38 dataset has a three-class classification (donor, acceptor, and non-site).

4.2 Input Representation

The machine learning approach typically employs a numerical representation of the input for downstream processing. Orthogonal encoding, such as one-hot coding [baldi2001bioinformatics], is widely used to convert DNA sequences into a numerical format. It employs -bit one-hot encoding. For , is encoded as described in Eq. (4). According to Lee et al. [lee2015dna], the vanilla one-hot encoding scheme tends to limit generalization because of the sparsity of its encoding (75% of the elements are zero). Thus, our approach encodes nucleotides into a 4D dense vector that follows the direct architecture of a normal neural network layer [chollet2015keras], which is trained by the gradient decent method.

4.3 Model Training

The proposed RNN-based approach uses unsupervised training for the autoencoder and supervised training for the fine-tuning. The first layer of unsupervised training uses 4 input units, 60 hidden RNNs units with 50 epochs and 4 output units that are connected to the second layer. The second layer of supervised training uses 4 input units that are connected to stacked LSTM layers with full version including forget gates and peephole connections.

The 4 input layers are used for 60 hidden units with 100 epochs, and the 4 output units are a fully connected output layer containing

units for -class prediction.

In our experiment, we used to classify sequences (coding or non-coding). For the fully connected output layer, we used the softmax function to classify sequences and the sigmoid function to classify sites for the activation. For our training model, we used a recently proposed optimizer of multi-class logarithmic loss function Adam [kingma2014adam]. The objective function that has to be minimized is as described in Eq (5)

. We used a batch size of 100 and followed the batch normalization 

[ioffe2015batch]

. We initialized weights according to a uniform distribution as directed by Glorot and Bengio 

[glorot2010understanding]. The training time was approximately 46 hours and the running time was less than 1 second (Ubuntu 14.04 on 3.5GHz i7-5930K and 12GB Titan X).

Figure 7:

Comparison of learning algorithms in terms of robustness (best viewed in color). Mean and variance of accuracy are measured for the fixed

DNA sequence length of 6000 for 500 cases by changing one percent of the hidden message. The shaded line represents the standard deviation of the inference accuracy.

4.4 Evaluation Procedure

For evaluation of performance, we used the score obtained from the softmax of the neural network. We exploited the state-of-the-art algorithm [mitras2013proposed] to embed hidden messages for the message hiding. We randomly selected DNA sequences from the validation set using the Ensembl human genome dataset. We obtained the score of the stego-free sequence from the validation set. In the next step, we embedded hidden messages to a selected DNA sequence from the validation set, and we obtained the score. Using the score distribution of the stego-free and steged sequences, we evaluated the different scores for the range . The output from softmax of the neural network is expected to have a similar score distribution as the unmodified genome sequences. However, the score distribution changes if messages are embedded. As shown in Figure 4(b) and Figure 4(c), modified sequences are distinguishable using our RNNs model.

4.5 Performance Comparison

We evaluated the performance of our proposed method based on four supervised learning algorithms (RNNs, SVM, random forests, and adaptive boosting)

to detect hidden messages. For the performance metric, we used the differences in accuracy.111, where , , , and represent the numbers of true positives, false positives, false negatives, and true negatives, respectively. Using the prediction performance data, we evaluated learning algorithms with respect to the following three regions; introns dedicated, exons dedicated, and both regions together.

For each algorithm, we generated simulated data for different lengths of DNA sequences (6000, 12000, 18000, 24000, 30000, and 60000) using the UCSC-hg38 dataset [kent2002human]. We also randomly selected 1000 cases for the fixed DNA sequence length for the modification rate 1 to 10%. Using selected DNA sequences, we obtained the average prediction accuracy of different numbers of samples against non-perturbed DNA sequences for 1000 randomly selected cases. In the next step, we obtain the prediction accuracy for the modified data generated according to the hiding algorithms. Using the averaged prediction accuracy for both the perturbed and non-perturbed cases, we evaluated the differences between the prediction accuracy rates for varying different numbers of samples. We carried out five-fold cross-validation to obtain the mean/variance of the differences in accuracy.

Detection performance of sequence alignment and denoising tools. Both Region (%)   Intron Region (%)   Exon Region (%) RNN  (proposed) 99.93 99.96 99.94 BLAST [altschul1990basic] 84.00 85.00 85.00 Coral [salmela2010correction] 0.00 0.00 0.00 Lighter [song2014lighter] 0.00 0.00 0.00

Figure 6 shows an experiment for each algorithm using six variable DNA sequence lengths. Each algorithm was compared to three different regions based on the six variable DNA sequence lengths. The experiments were conducted by changing from one to then percent of the hidden message. SVM showed good detection performance in the exon region, but showed inferior performance in the intron as well as both regions category. In the case of adaptive boosting, the detection performance was similar in both regions and in intron only categorie, but performed poorly in exon regions. In the case of the random forest, the cases with the exon and both regions showed good performance except for some modification rates. In the intron regions, the detection performance was similar to that of other learning algorithms. Notably, our proposed methodology based on RNNs outperformed all of the existing hidden messages detection algorithms for all genomic regions evaluated.

In addition, we examined our proposed methodology based on denoising methods using Coral [salmela2010correction] and Lighter [song2014lighter]. The UCSC-hg38 dataset was used to preserve local base structures and perturbed data samples were used as random noise. As shown in Table 4.5, the results showed that both Coral and Lighter missed detection for all modification rates in all regions. In addition, the sequence alignment method performed poorly. The results suggest that there is a 15 to 16% chance that hidden messages may not be detected in all three regions.

To validate the learning algorithms with respect to robustness, we tested them with a fixed DNA sequence length of 6000 with 500 cases for each modification rate to measure the mean and variance of the test accuracy. Figure 7 shows how the performance measures (mean and variance of accuracy differences) change for modification rates ranging from 1 to 10 in the intron, exon, and both regions categories. The plotted entries represents the the averaged mean over the 500 cases, and shade lines show the average of the variances over the 500 cases. The results indicate that hidden messages may not be detected if the prediction difference is less than the variance. The overall analysis with respect to the robustness showed that the learning algorithms of SVM, random forests and adaptive boosting performed poorly.

5 Discussion

The development of next-generation sequencing has reduced the price of personal genomics [schuster2008next], and the discovery of the CRIPSPR-Cas9 gene has provided unprecedented control over genomes of many species [hsu2014development]. While the technology is yet to be applied to simulations involving artificial DNA, human DNA sequences may become an area in which we can apply DNA watermarking. Our experiments using the real UCSC-hg38 human genome implicitly consider that unknown relevant sequences are also detectable because of the characteristics of similar patterns in non-canonical splice sites. The number of donors with GT pairs and acceptors with AG pairs were found to be 86.32% and 84.63%, respectively [lee2015boosted]. Existing steganography techniques modify several nucleotides. Considering few single nucleotide modifications, we can transform DNA steganography to the variant calling problem. In this regard, we believe that our methodology can be extended to the field of variant calling.

Although there are many advantages to using machine learning techniques to detect hidden messages [lyu2004steganalysis, erfani2016high, min2017deep], the following improvements are required: parameter tuning is dependent on the steganalyst, e.g., the training epochs, learning rate, and size of the training set; the failure to detect hidden messages cannot be corrected by the steganalyst. However, we expect that the future development of such techniques will resolve the limitations. According to Alvarez and Salzmann [alvarez2016learning]

, the numbers of layers and neurons of deep networks can be determined using an additional

class of methods, sparsity regularization, to the objective function. The sizes of vectors of grouped parameters of each neuron in each layer incur penalties if the loss converges. The affected neurons are removed if the neurons are assigned a value of zero.

Acknowledgments

This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (Ministry of Science and ICT) [2014M3C9A3063541, 2018R1A2B3001628], and the Brain Korea 21 Plus Project in 2018.

References