1 Introduction
DNA is perceived as a sequence over the letters , the alphabet of nucleotides. This sequence constitutes the code that acts as a blueprint for all processes taking place in a cell. But beyond merely reflecting primary sequence, DNA is a molecule, which implies that DNA assumes spatial structure and shape. The spatial organization of DNA is achieved by integrating (“recruiting”) other molecules, the histone proteins, that help to assume the correct spatial configuration. The combination of DNA and helper molecules is called chromatin; the spatial configuration of the chromatin, finally, defines the functional properties of local areas of the DNA (de Graaf and van Steensel, 2013).
Chromatin can assume several functiondefining epigenetic states, where states vary along the genome (Ernst et al., 2011). The key determinant for spatial configuration is the underlying primary DNA sequence: sequential patterns are responsible for recruiting histone proteins and their chemical modifications, which in turn give rise to or even define the chromatin states. The exact configuration of the chromatin and its interplay with the underlying raw DNA sequence are under active research. Despite many enlightening recent findings (e.g. Brueckner et al., 2016; The ENCODE Project Consortium, 2012; Ernst and Kellis, 2013)
, comprehensive understanding has not yet been reached. Methods that predict chromatin related states from primary DNA sequence are thus of utmost interest. In machine learning, many prediction methods are available, of which deep neural networks have recently been shown to be promising in many applications
(LeCun et al., 2015). Also in biology deep neural networks have been shown to be valuable (see Angermueller et al. (2016) for a review).Although DNA is primarily viewed as a sequence, treating genome sequence data as just a sequence neglects its inherent and biologically relevant spatial configuration and the resulting interaction between distal sequence elements. We hypothesize that a deep neural network designed to account for longterm interactions can improve performance. Additionally, the molecular spatial configuration of DNA suggests the relevance of a higherdimensional spatial representation of DNA. However, due to the lack of comprehensive understanding with respect to the structure of the chromatin, sensible suggestions for such higherdimensional representations of DNA do not exist.
One way to enable a neural net to identify longterm interactions is the use of fully connected layers. However, when the number of input nodes to the fully connected layer is large, this comes with a large number of parameters. We therefore use three other techniques to detect longterm interactions. First, most convolutional neural networks (CNNs) use small convolution filters. Using larger filters already at an early stage in the network allows for early detection of longterm interactions without the need of fully connected layers with a large input. Second, a deep network similar to the ResNet (He et al., 2015) or Inception (Szegedy et al., 2015) network design prevents features found in early layers from vanishing. Also, they reduce the size of the layers such that the final fully connected layers have a smaller input and don’t require a huge number of parameters. Third, we propose a novel kind of DNA representation by mapping DNA sequences to higherdimensional images using spacefilling curves. Spacefilling curves map a 1dimensional line to a 2dimensional space by mapping each element of the sequence to a pixel in the 2D image. By doing so, proximal elements of the sequence will stay in close proximity to one another, while the distance between distal elements is reduced.
The spacefilling curve that will be used in this work is the Hilbert curve which has several advantages. (i): [Continuity] Hilbert curves optimally ensure that the pixels representing two sequence elements that are close within the sequence are also close within the image (Bader, 2016; Aftosmis et al., 2004). (ii): [Clustering property] Cutting out rectangular subsets of pixels (which is what convolutional filters do) yields a minimum amount of disconnected subsequences (Moon et al., 2001). (iii): If a rectangular subimage cuts out two subsequences that are disconnected in the original sequence, chances are maximal that the two different subsequences are relatively far apart (see our analysis in Appendix A).
The combination of these points arguably renders Hilbert curves an interesting choice for representing DNA sequence as twodimensional images. (i) is a basic requirement for mapping shortterm sequential relationships, which are ubiquitous in DNA (such as codons, motifs or intronexon structure). (ii) relates to the structure of the chromatin, which  without all details being fully understood  is tightly packaged and organized in general. Results from Elgin (2012) indicate that when arranging DNA sequence based on Hilbert curves, contiguous areas belonging to identical chromatin states cover rectangular areas. In particular, the combination of (i) and (ii) motivate the application of convolutional layers on Hilbert curves derived from DNA sequence: rectangular subspaces, in other words, submatrices encoding the convolution operations, contain a minimum amount of disconnected pieces of DNA. (iii) finally is beneficial insofar as longterm interactions affecting DNA can also be mapped. This in particular applies to socalled enhancers and silencers, which exert positive (enhancer) or negative (silencer) effects on the activity of regions harboring genes, even though they may be far apart from those regions in terms of sequential distance.
1.1 Related Work
Since Watson and Crick first discovered the doublehelix model of DNA structure in 1953 (Watson and Crick, 1953), researchers have attempted to interpret biological characteristics from DNA. DNA sequence classification is the task of determining whether a sequence belongs to an existing class , and this is one of the fundamental tasks in bioinformatics research for biometric data analysis (Z. Xing, 2010). Many methods have been used, ranging from statistical learning (Vapnik, 1998) to machine learning methods (Michalski et al., 2013). Deep neural networks (LeCun et al., 2015) form the most recent class of methods used for DNA sequence classification (R.R. Bhat, 2016; Salimans et al., 2016; Zhou and Troyanskaya, 2015; Angermueller et al., 2016).
Both Pahm et al. (2005) and Higashihara et al. (2008)
use support vector machines (SVM) to predict chromatin state from DNA sequence features. While
Pahm et al. (2005) use the entire set of features as input to the SVM, Higashihara et al. (2008)use random forests to preselect a subset of features that are expected to be highly relevant for prediction of chromatin state to use as input to the SVM. Only
Nguyen et al. (2016) use a CCN as we do. There are two major differences between their approach and ours. First and foremost, the model architecture is different: the network in Nguyen et al. (2016)consists of two convolution layers followed by pooling layers, a fully connected layer and a sigmoid layer, while our model architecture is deeper, uses residual connections to reuse the learned features, has larger convolution filters and has small layers preceding the fully connected layers (see Methods). Second, while we use a spacefilling curve to transform the sequence data into an imagelike tensor,
Nguyen et al. (2016) keep the sequential form of the input data.Apart from Elgin (2012), the only example we are aware of where Hilbert curves were used to map DNA sequence into twodimensional space is from Anders (2009), who demonstrated the power of Hilbert curves for visualizing DNA. Beyond our theoretical considerations, these last two studies suggest there are practical benefits of mapping DNA using Hilbert curves.
1.2 Contribution
Our contributions are twofold. First, we predict chromatin state using a CNN that, in terms of architecture, resembles conventional CNNs for image classification and is designed for detecting distal relations. Second, we propose a method to transform DNA sequence patches into twodimensional imagelike arrays to enhance the strengths of CNNs using spacefilling curves, in particular the Hilbert curve. Our experiments demonstrate the benefits of our approach: the developed CNN decisively outperforms all existing approaches for predicting the chromatin state in terms of prediction performance measures as well as runtime, an improvement which is further enhanced by the convolution of DNA sequence to a 2D image. In summary, we present a novel, powerful way to harness the power of CNNs in image classification for predicting biologically relevant features from primary DNA sequence.
2 Methods
2.1 DNA sequence representation
We transform DNA sequences into images through three steps. First, we represent a sequence as a list of mers. Next, we transform each mer into a onehot vector, which results in the sequence being represented as a list of onehot vectors. Finally, we create an imagelike tensor by assigning each element of the list of mers to a pixel in the image using Hilbert curves. Each of the steps is explained in further detail below.
From a molecular biology point of view, the nucleotides that constitute a DNA sequence do not mean much individually. Instead, nucleotide motifs play a key role in protein synthesis. In bioinformatics it is common to consider a sequence’s mers, defined as the letter words from the alphabet {A,C,G,T} that together make up the sequence. In computer science the term gram is more frequently used, and is often applied in text mining (Tomovic et al., 2006). As an example, the sequence TGACGAC can be transformed into the list of 3mers {TGA, GAC, ACG, CGA, GAC} (note that these are overlapping). The first step in our approach is thus to transform the DNA sequence into a list of mers. Previous work has shown that 3mers and 4mers are useful for predicting epigenetic state (Pahm et al., 2005; Higashihara et al., 2008). Through preliminary experiments, we found that yields the best performance: lower values for result in reduced accuracy, while higher values yield a high risk of overfitting. Only for the Splice dataset (see experiments) we used to prevent overfitting, as this is a small dataset.
In natural language processing, it is common to use word embeddings as GloVe or word2vec or onehot vectors
(Goldberg, 2017). The latter approach is most suitable for our method. Each element of such a vector corresponds to a word, and a vector of length can thus be used to represent different words. A onehot vector has a one in the position that corresponds to the word the position is representing, and a zero in all other positions. In order to represent all mers in a DNA sequence, we need a vector of length , as this is the number of words of length that can be constructed from the alphabet {A,C,G,T}. For example, if we wish to represent all 1mers, we can do so using a onehot vector of length 4, where A corresponds to , C to , G to and T to . In our case, the DNA sequence is represented as a list of 4mers, which can be converted to a list of onehot vectors each of length .Our next step is to transform the list of onehot vectors into an image. For this purpose, we aim to assign each onehot vector to a pixel. This gives us a 3dimensional tensor, which is similar in shape to the tensor that serves as an input to image classification networks: the color of a pixel in an RGBcolored image is represented by a vector of length 3, while in our approach each pixel is represented by a onehot vector of length .
What remains now is to assign each of the onehot vectors in the list to a pixel in the image. For this purpose, we can make use of spacefilling curves, as they can map 1dimensional sequences to a 2dimensional surface preserving continuity of the sequence (Bader, 2016; Aftosmis et al., 2004). Various types of spacefilling curves are available. We have compared the performance of several such curves, and concluded that Hilbert curves yield the best performance (Appendix A). This corresponds with our intuition: the Hilbert curve has several properties that are advantageous in the case of DNA sequences, as discussed in the introduction section.
The Hilbert curve is a wellknown spacefilling curve that is constructed in a recursive manner: in the first iteration, the curve is divided into four parts, which are mapped to the four quadrants of a square (Fig. 1a). In the next iteration, each quadrant is divided into four subquadrants, which, in a similar way, each hold 1/16 of the curve (Fig. 1b). The quadrants of these subquadrants each hold 1/64 of the curve, etc (Figs. 1c and d).
By construction, the Hilbert curve yields a square image of size , where is the order of the curve (see Fig. 1). However, a DNA sequence does not necessarily have mers. In order to fit all mers into the image, we need to choose such that is at least the number of mers in the sequence, and since we do not wish to make the image too large, we pick the smallest such . In many cases, a large fraction of the pixels then remains unused, as there are fewer mers than pixels in the image. By construction, the used pixels are located in upper half of the image. Cropping the picture by removing the unused part of the image yields rectangular images, and increases the fraction of the image that is used (Figure 1e).
In most of our experiments we used sequences with a length of 500 base pairs, which we convert to a sequence of 500  4 + 1 = 497 4mers. We thus need a Hilbert curve of order 5, resulting in an image of dimensions (recall that each pixel is assigned a onehot vector of length 256). Almost half of the resulting 1024 pixels are filled, leaving the other half of the image empty which requires memory. We therefore remove the empty half of the image and end up with an image of size .
The data now has the appropriate form to input in our model.
2.2 Network architecture
Modern CNNs or other image classification systems mainly focus on grayscale images and standard RGB images, resulting in channels of length 1 or 3, respectively, for each pixel. In our approach, each pixel in the generated image is assigned a onehot vector representing a mer. For increasing , the length of the vector and thus the image dimension increases. Here, we use resulting in
channels, which implies that each channel contains very sparse information. Due to the curse of dimensionality standard network architectures applied to such images are prone to severe overfitting.
Here, we design a specific CNN for the kind of high dimensional image that is generated from a DNA sequence. The architecture is inspired by ResNet (He et al., 2015) and Inception (Szegedy et al., 2015). The network has layers and each layer implements a nonlinear function where is the index of the hidden layer with output . The function consists of various layers such as convolution (denoted by
), batch normalization (
), pooling () and nonlinear activation function (
).The first part of the network has the objective to reduce the sparseness of the input image (Figure 2), and consists of the consecutive layers . The main body of the network enhances the ability of the CNN to extract the relevant features from DNA spacefilling curves. For this purpose, we designed a specific Computational Block inspired by the ResNet residual blocks (He et al., 2015). The last part of the network consists of 3 fullyconnected layers, and softmax is used to obtain the output classification label. The complete model is presented in Table 2, and code is available on Github (https://github.com/Bojian/HilbertCNN/tree/master). A simplified version of our network with two Computational Blocks is illustrated in Figure 2.
Computation Block.
In the Computation Block first the outputs of two Residual Blocks and one identity mapping are summed, followed by a and an layer (Figure 2). In total, the computational block has 4 convolutional layers, two in each Residual Block (see Figure 3). The Residual Block first computes the composite function of five consecutive layers, namely , followed by the concatenation of the output with the input tensor. The residual block concludes with an .
The Residual Block can be viewed as a new kind of nonlinear layer denoted by , where and are the respective filter sizes of the two convolutional layers. and are the dimensions of the outputs of the first convolutional layer and the Residual Block, respectively, where ; this condition simplifies the network architecture and reduces the computational cost. The Computational Block can be denoted as with the two residual blocks defined as and . Note that here we chose the same and for both Residual Blocks in a Computational Block.
Implementation details.
Most convolutional layers use small squared filters of size , and , except for the layers in the first part of the network, where large filters are applied to capture long range features. We use Exponential Linear Units (ELU, Clevert et al. (2015)) as our activation function to reduce the effect of gradient vanishing: preliminary experiments showed that ELU preformed significantly better than other activation functions (data not shown).
For the pooling layers
we used Average pooling. Average pooling outperformed Max pooling in terms of prediction accuracy by more than
in general, as it reduces the high variance of the sparse generated images. Cross entropy was used as the loss function.
Layers  Description 


Input  
Conv1  conv, BN  
Conv2  conv, BN  
Activation  
AveragePool 


ComputationBlock  ,  
ComputationBlock  ,  
AveragePool 


ComputationBlock  ,  
ComputationBlock  ,  
ComputationBlock  ,  
AveragePooling 


BN,Activation  
AveragePool 


Classification layer1 


activation  
dropout with 0.5  
Classification layer2 


activation  
dropout with 0.5  


Name  Samples  Description  

H3  14965  H3 occupancy  
H4  14601  H4 occupancy  
H3K9ac  27782 


H3K14ac  33048 


H4ac  34095 


H3K4me1  31677 


H3K4me2  30683 


H3K4me3  36799 


H3K36me3  34880 


H3K79me3  28837 


Splice  3190 

3 Experiments
We test the performance of our approach using ten publicly available datasets from Pokholok et al. (2005). The datasets contain DNA subsequences with a length of 500 base pairs. Each sequence is labeled either as “positive” or “negative”, indicating whether or not the subsequence contains regions that are wrapped around a histone protein. The ten datasets each consider a different type of histone protein, indicated by the name of the dataset. Details can be found in Table 2.
A randomly chosen of the dataset is used for training the network, is used for validation and early stopping, and the remaining () is used for evaluation. We train the network using the AdamOptimizer (Kingma and Ba, 2017)^{1}^{1}1
The LSTM model was implemented in Keras
(Chollet et al., 2015), all other models were implemented in Tensorflow
(Abadi et al., 2015).. The learning rate is set to 0.003, the batchsize was set to 300 samples and the maximum number of epochs is 10. After each epoch the level of generalization is measured as the accuracy obtained on the validation set. We use early stopping to prevent overfitting. To ensure the model stops at the correct time, we combine the
measurement (Prechelt, 1998) of generalization capability and the NoImprovementInNSteps(NiiN) method (Prechelt, 1998). For instance, Nii2 means that the training process is terminated when generalization capability is not improved in two consecutive epochs.We compare the performance of our approach, referred to as HCNN, to existing models. One of these is the support vector machine (SVM) model by Higashihara et al. (2008), for which results are available in their paper. Second, in tight communication with the authors, we reconstructed the SeqCNN model presented in Nguyen et al. (2016) (the original software was no longer available), see Appendix C for detailed settings. Third, we constructed the commonly used LSTM, where the socalled 4mer profile of the sequence is used as input. A 4mer profile is a list containing the number of occurrences of all 256 4mers of the alphabet {A,C,G,T} in a sequence. Preliminary tests showed that using all 256 4mers resulted in overfitting, and including only the 100 most frequent 4mers is sufficient. Details of the LSTM architecture can be found in Table 8 in Appendix C.
In order to assess the effect of using a 2D representation of the DNA sequence in isolation, we compare HCNN to a neural network using a sequential representation as input. We refer to this model as seqHCNN. As in HCNN, the DNA sequence is converted into a list of kmer representing onehot vectors, though the mapping of the sequence into a 2D image is omitted. The network architecture is a “flattened” version of the one used in HCNN: for example, a 77 convolution filter in HCNN is transformed to a 491 convolution filter in the 1Dsequence model. As a summary of model size, the SeqCNN model contains 1.1M parameters, while both HCNN and seqHCNN have 961K parameters, and the LSTM has 455K parameters.
In order to test whether our method is also applicable to DNA sequence classification tasks other than chromatin state prediction only, we performed additional tests on the splicejunction gene sequences dataset from Lichman (2013). Most of the DNA sequence is unused, and splicejunctions refer to positions in the genetic sequence where the transition from an unused subsequence (intron) to a used subsequence (exon) or vice versa takes place. The dataset consists of DNA subsequences of length 61, and each of the sequences is known to be an introntoexon splicejunction, an exontointron splice junction or neither. As the dataset is relatively small, we used 1mers instead of 4mers. Note that the sequences are much shorter than for the other datasets, resulting in smaller images (dimensions ).
4 Results
The results show that SVM and SeqCNN were both outperformed by HCNN and seqHCNN; LSTM shows poor performance. HCNN and seqHCNN show similar performance in terms of prediction accuracy, though HCNN shows more consistent results over the ten folds indicating that using a 2D representation of the sequence improves robustness. Furthermore, HCNN yields better performance than seqHCNN in terms of precision, recall, AP and AUC (Table 5). It thus enables to reliably vary the tradeoff between recall and false discoveries. HCNN outperforms all methods in terms of training time (Table 4).
Dataset  SVM  LSTM  SeqCNN  seqHCNN  HCNN 

H3  87.34  
H4  87.33  
H3K9ac  79.19  
H3K14ac  75.06  74.79  
H4ac  77.06  
H3K4me1  73.47  73.21  
H3K4me2  74.27  
H3K4me3  74.54  74.45  
H3K36me1  77.18  77.03  
H3K79me1  81.66  81.63  
Splice  94.11 
Dataset  LSTM  seqCNN  seqHCNN  HCNN 

H3  35:43  95:23  6:47  3:40 
H4  45:32  95:53  5:12  3:12 
H3K9ac  76:06  173:18  17:24  7:40 
H3K14ac  81:21  180:56  17:42  13:24 
H4ac  93:32  181:33  24:48  17:32 
H3K4me1  93:44  192:20  18:30  10:38 
H3K4me2  94:22  188:13  18:23  14:38 
H3K4me3  96:03  162:32  20:40  11:33 
H3K36me3  93:48  161:12  21:52  16:37 
H3K79me3  64:28  158:34  14:25  10:13 
Splice  6:42  35:12  3:42  1:30 
Dataset  Recall  Precision  AP  AUC  

seqHCNN  HCNN  seqHCNN  HCNN  seqHCNN  HCNN  seqHCNN  HCNN  
H3  
H4  
H3K9ac  
H3K14ac  
H4ac  
H3K4me1  
H3K4me2  
H3K4me3  
H3K36me3  
H3K79me3  
Splice 
The good performance of HCNN observed above may either be attributable to the conversion from DNA sequence to image, or to the use of the Hilbert curve. In order to address this question, we adapted our approach by replacing the Hilbert curve with other spacefilling curves and compared their prediction accuracy. Besides the Hilbert curve, other spacefilling curves exist (Moon et al., 2001) (see Appendix A ). In Figure 4, we compare the performance of our model with different mapping strategies in various datasets as displayed. We find that the images generated by the spacefilling Hilbert curve provide the best accuracy on most datasets and the 1d sequence performs worst.
5 Discussion
In this paper we developed a CNN that outperforms the stateoftheart for prediction of epigenetic states from primary DNA sequence. Indeed, our methods show improved prediction accuracy and training time compared to the currently available chromatin state prediction methods from Pahm et al. (2005), Higashihara et al. (2008) and Nguyen et al. (2016)
as well as an LSTM model. Additionally, we showed that representing DNAsequences with 2D images using Hilbert curves further improves precision and recall as well as training time as compared to a 1Dsequence representation.
We believe that the improved performance over the CNN developed by Nguyen et al. (2016) (SeqCNN) is a result of three factors. First, our network uses larger convolutional filters, allowing the model to detect longdistance interactions. Second, despite HCNN being deeper, both HCNN and seqHCNN have a smaller number of parameters, allowing for faster optimization. This is due to the size of the layer preceding the fully connected layer, which is large in the method proposed by Nguyen et al. (2016) and thus yields a huge number of parameters in the fully connected layer. In HCNN on the other hand the number of nodes is strongly reduced before introducing a fully connected layer. Third, the use of a twodimensional input further enhances the model’s capabilities of incorporating longterm interactions.
We showed that seqHCNN and HCNN are not only capable of predicting chromatin state, but can also predict the presence or absence of splicejunctions in DNA subsequences. This suggests that our approach could be useful for DNA sequence classification problems in general.
Hilbert curves have several properties that are desirable for DNA sequence classification. The intuitive motivation for the use of Hilbert curves is supported by good results when comparing Hilbert curves to other spacefilling curves. Additionally, Hilbert curves have previously been shown to be useful for visualization of DNA sequences (Anders, 2009).
The main limitation of Hilbert curves is their fixed length, which implies that the generated image contains some empty spaces. These spaces consume computation resources; nevertheless, the 2D representation still yields reduced training times compared to the 1Dsequence representation, presumably due to the high degree of optimization for 2D inputs present in standard CNN frameworks.
Given that a substantial part of the improvements in performance rates are due to our novel architecture, we plan on investigating the details of how components of the architecture are intertwined with improvements in prediction performance in more detail. We also plan to further investigate why Hilbert curves yield the particular advantages in terms of robustness and false discovery control we have observed here.
References
 Abadi et al. (2015) M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G.S.Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng. TensorFlow: Largescale machine learning on heterogeneous systems, 2015. URL https://www.tensorflow.org/. Software available from tensorflow.org.
 Aftosmis et al. (2004) M. Aftosmis, M. Berger, and S. Murman. Applications of spacefillingcurves to cartesian methods for cfd. 42nd AIAA Aerospace Sciences Meeting and Exhibit, May 2004. doi: 10.2514/6.20041232.
 Anders (2009) S. Anders. Visualization of genomic data with the Hilbert curve. Bioinformatics, 25(10):1231–1235, 2009.
 Angermueller et al. (2016) C. Angermueller, T. Pärnamaa, L. Parts, and O. Stegle. Deep learning for computational biology. Molecular Systems Biology, 12(7):878, 2016.
 Bader (2016) M. Bader. Spacefilling curves: an introduction with applications in scientific computing. Springer, 2016.
 Bergstra and Bengio (2012) James Bergstra and Yoshua Bengio. Random search for hyperparameter optimization. Journal of Machine Learning Research, 13(Feb):281–305, 2012.
 Brueckner et al. (2016) L. Brueckner, J. van Arensbergen, W. Akhtar, L. Pagie, and B. van Steensel. Highthroughput assessment of contextdependent effects of chromatin proteins. Epigenetics and Chromatin, 9(1):43, 2016.
 Chollet et al. (2015) F. Chollet et al. Keras. https://github.com/fchollet/keras, 2015.
 Clevert et al. (2015) D.A. Clevert, T. Unterthiner, and S. Hochreiter. Fast and accurate deep network learning by exponential linear units (elus). arXiv preprint arXiv:1511.07289, 2015.
 de Graaf and van Steensel (2013) C.A. de Graaf and B. van Steensel. Chromatin organization: form to function. Current Opinion in Genetics and Development, 23(2):185–190, 2013.
 Elgin (2012) S.C.R. Elgin. http://slideplayer.com/slide/7605009/, 2012. Slides 5557.
 Ernst and Kellis (2013) J. Ernst and M. Kellis. Interplay between chromatin state, regulator binding, and regulatory motifs in six human cell types. Genome Research, 23(7):1142–1154, 2013.
 Ernst et al. (2011) J. Ernst, P. Kheradpour, T.S. Mikkelsen, N. Shoresh, Ward L.D., et al. Mapping and analysis of chromatin state dynamics in nine human cell types. Nature, 473(7345):43–49, 2011.
 Goldberg (2017) Y. Goldberg. Neural network methods for natural language processing. Morgan & Claypool, 2017.
 He et al. (2015) K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition, Dec 2015. URL https://arxiv.org/abs/1512.03385.

Higashihara et al. (2008)
M. Higashihara, J.D. RebolledoMendez, Y. Yamada, and K. Satou.
Application of a feature selection method to nucleosome data: Accuracy improvement and comparison with other methods.
WSEAS Transactions on Biology and Biomedicine, 2008.  Kingma and Ba (2017) D. P. Kingma and J. Ba. Adam: A method for stochastic optimization, Jan 2017. URL https://arxiv.org/abs/1412.6980.
 LeCun et al. (2015) Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature, 521(7553):436–444, 2015.
 Lichman (2013) M. Lichman. UCI machine learning repository, 2013. URL http://archive.ics.uci.edu/ml.

Michalski et al. (2013)
R. S. Michalski, J. G. Carbonell, and T.M. Mitchell.
Machine Learning: An Artificial Intelligence Approach
. Springer Berlin, 2013.  Moon et al. (2001) B. Moon, H.V. Jagadish, C. Faloutsos, and J.H. Saltz. Analysis of the clustering properties of the Hilbert spacefilling curve. IEEE Transactions on Knowledge and Data Engineering, 13(1):124–141, 2001. doi: 10.1109/69.908985.
 Nguyen et al. (2016) N.G. Nguyen, V.A. Tran, D.L. Ngo, D. Phan, F.R. Lumbanraja, et al. DNA sequence classification by convolutional neural network. Journal of Biomedical Science and Engineering, 9(5):280–286, 2016.
 Pahm et al. (2005) T.H. Pahm, D.H. Tran, T.B. Ho, K. Satou, and G. Valiente. Qualitatively predicting acetylation and methylation areas in dna sequences. Genome Informatics, 16(2):3–11, 2005.
 Pokholok et al. (2005) D.K. Pokholok, C.T. Harbison, S. Levine, M. Cole, N.M. Hannett, T.I. Lee, G.W. Bell, K. Walker, P. A. Rolfe, and E. Herbolsheimer. Genomewide map of nucleosome acetylation and methylation in yeast. Cell, 122(4):517–527, 2005. doi: 10.1016/j.cell.2005.06.026.
 Prechelt (1998) L. Prechelt. Automatic early stopping using cross validation: quantifying the criteria. Neural Networks, 11(4):761–767, 1998.
 R.R. Bhat (2016) X. Li R.R. Bhat, V. Viswanath. Deepcancer: Detecting cancer through gene expressions via deep generative learning. arXiv, 2016.
 Salimans et al. (2016) Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans, Jun 2016. URL https://arxiv.org/abs/1606.03498.

Szegedy et al. (2015)
C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,
V. Vanhoucke, and A. Rabinovich.
Going deeper with convolutions.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pages 1–9, 2015.  The ENCODE Project Consortium (2012) The ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature, 489(7414):57–74, 2012.
 Tomovic et al. (2006) A. Tomovic, P. Janicic, and V. Keselj. ngrambased classification and unsupervised hierarchical clustering of genome sequences. Computer Methods and Programs in Biomedicine, 81(2):137–153, 2006. doi: 10.1016/j.cmpb.2005.11.007.
 Vapnik (1998) V. N. Vapnik. Statistical learning theory., volume 1. Wiley, 1998.
 Watson and Crick (1953) J.D. Watson and F.H. Crick. Molecular structure of nucleic acids. Nature, 171(4356):737–738, 1953.
 Z. Xing (2010) E. Keogh Z. Xing, J. Pei. A brief survey on sequence classification. ACM, 2010.
 Zhou and Troyanskaya (2015) J. Zhou and O.G. Troyanskaya. Predicting effects of noncoding variants with deep learning–based sequence model. Nature methods, 12:931–934, 2015.
Appendix A Comparison of spacefilling curves with regard to longterm interactions
As noted before, longterm interactions are highly relevant in DNA sequences. In this section we consider these longterm interactions in four spacefilling curves: the reshape curve, the snake curve, the Hilbert curve and the diagsnake curve. See Fig. 5 for an illustration.
As can be seen in Fig. 5, mapping a sequence to an image reduces the distance between two elements that are far from one another in the sequence, while the distance between nearby elements does not increase. Each of the curves does have a different effect on the distance between faraway elements. In order to assess these differences, we use a measure that is based on the distance between two sequence elements as can be observed in the image. We denote this distance by , with the sequence and the curve under consideration. Then for the sequence we obtain

for the sequence;

for the reshape curve;

for the snake curve;

for the diagonal snake curve.

for the Hilbert curve;
We now introduce the following measure:
where the is the set of the weighted distances between all pairs of the elements in the sequence. Here, is a set containing the distance between any two sequence elements, weighted by their distance in the sequence:
Note that a low relative to implies that longterm interactions are strongly accounted for, so a high is desirable.
is evaluated for the four spacefilling curves as well as the sequence representation for sequences of varying lengths.The results show that the Hilbert curve yields the highest values for 6 and thus performs best in terms of retaining longdistance interactions.
Curve  Sequence Length : bits  

16  64  256  1024  4096  
Reshape  0.22  0.18  0.16  0.16  0.15 
Snake  0.28  0.20  0.17  0.16  0.16 
Diagsnake  0.22  0.17  0.16  0.15  0.15 
Hilbert  0.30  0.24  0.22  0.21  0.21 
Appendix B From sequence to image
Figure 6 shows the conversion from DNA sequence to image.
Appendix C Details of alternative neural networks
Layer  filter size  stride  output dim 
Convolution 1  7  2  60 
Activation  
Max pooling  3  3  60 
Convolution 2  5  2  30 
Activation  
Max pooling  3  2  30 
Dropout, 0.5  
FC layer  100  
Activation  
Dropout, 0.5  
Classifier  2 
Layer  Description  

Embedding  
Conv 1 


activation function is RELU 

Max pooling  max pooling layer  
BirLSTM 1  100 units  
BirLSTM 2  128 units  
Dropout  0.3 dropout rate  
Classifier  sigmoid 
Architecture  # Parameters 

seqCNN  1.1M 
biLSTM  455K 
HilbertCNN  961K 
Appendix D Hyperparameter optimization
Accuracy is one of the most intuitive performance measurements in deep learning and machine learning. We therefore optimized the hyperparameters such as the network architecture and learning rate based on maximum accuracy. The hyperparameters are optimized through random search
[Bergstra and Bengio, 2012] and using general principles from successful deep learning strategies and the following intuition. First, as our main goal was to capture longterm interactions, we chose a large kernel size for the first layer, for which various values were attempted and 7x7 gave the best performance. As is common practice in deep learning, we then opted for a smaller kernel size in the following layer. Second, in order to limit the number of parameters, we made use of residual blocks inspired by ResNet [He et al., 2015] and Inception [Szegedy et al., 2015]. Finally, we applied batch normalization to prevent overfitting.
Comments
There are no comments yet.