Comprehensive Evaluation of Deep Learning Architectures for Prediction of DNA/RNA Sequence Binding Specificities

01/29/2019 ∙ by Ameni Trabelsi, et al. ∙ Colorado State University 0

Motivation: Deep learning architectures have recently demonstrated their power in predicting DNA- and RNA-binding specificities. Existing methods fall into three classes: Some are based on Convolutional Neural Networks (CNNs), others use Recurrent Neural Networks (RNNs), and others rely on hybrid architectures combining CNNs and RNNs. However, based on existing studies it is still unclear which deep learning architecture is achieving the best performance. Thus an in-depth analysis and evaluation of the different methods is needed to fully evaluate their relative. Results: In this study, We present a systematic exploration of various deep learning architectures for predicting DNA- and RNA-binding specificities. For this purpose, we present deepRAM, an end-to-end deep learning tool that provides an implementation of novel and previously proposed architectures; its fully automatic model selection procedure allows us to perform a fair and unbiased comparison of deep learning architectures. We find that an architecture that uses k-mer embedding to represent the sequence, a convolutional layer and a recurrent layer, outperforms all other methods in terms of model accuracy. Our work provides guidelines that will assist the practitioner in choosing the best architecture for the task at hand, and provides some insights on the differences between the models learned by convolutional and recurrent networks. In particular, we find that although recurrent networks improve model accuracy, this comes at the expense of a loss in the interpretability of the features learned by the model. Availability and implementation: The source code for deepRAM is available at



There are no comments yet.


page 1

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

DNA- and RNA-binding proteins are involved in many biological processes including transcription, translation, and alternative splicing (Ferré et al. 2016; Gerstberger et al. 2014). Unfortunately, only some of these binding sites have been identified by biological experiments. Moreover, these experiments are expensive and time-consuming. In order to represent binding sites and detect new ones, Position Weight Matrices (PWMs) are the most common method to characterize the sequence specificity of a protein thanks to their simplicity and ease of interpretation (Stormo 2000). However, many studies suggest that sequence specificity can be better captured using more complex models (Rohs et al. 2010; Kazan et al. 2010; Siggers and Gordan 2013).

In recent years, deep neural networks have become the technique of choice for challenging tasks in computer vision (

Krizhevsky et al. 2012; LeCun et al. 2015), speech recognition (Hinton et al. 2012), machine translation (Sutskever et al. 2014), and computational biology (Angermueller et al. 2016). Methods based on Convolutional Neural Networks (CNNs) (LeCun et al. 1998) and Recurrent Neural Networks (RNNs) (Bullinaria 2013) have been proposed for the task of identifying protein binding sites in DNA and RNA sequences, and have achieved state-of-the-art performance (Alipanahi et al. 2015; Quang and Xie 2016; Hassanzadeh and Wang 2016; Shen et al. 2018).

DeepBind (Alipanahi et al. 2015) was the first deep learning approach for this task, and used a single layer of convolution and demonstrated the accuracy of these models, as well as their ability to learn signal detectors that recapitulate known motifs. The work of Zeng et al. 2016 further showed the value of CNNs and explored in more detail the effect of various architecture parameters such as the number of layers and operations such as pooling. Other studies opted for more complex architectures and introduced hybrid models that integrate both CNNs and RNNs. DeeperBind (Hassanzadeh and Wang 2016) and DanQ (Quang and Xie 2016

) for example, add Long Short-Term Memory (LSTM) layer(s) to the DeepBind architecture. The additional RNN layers are designed to improve binding accuracy prediction by learning long-range dependencies between the sequence features learned by the CNN layers. Purely RNN-based methods were also examined: the KEGRU method (

Shen et al. 2018

) used a layer of bidirectional Gated Recurrent Units (bi-GRUs), combined with a k-mer embedding representation of the input sequence to create an internal state of the network that allows it to capture long range dependencies and thus obtain good performance. Methods that are specific to RBP binding were also developed. For example, iDeepS which uses both CNN and RNN layers, identifies sequence and structural motifs simultaneously (

Pan et al. 2018).

Despite all these studies, it is still not clear which deep learning architecture performs best for detecting binding in DNA and RNA sequences. A fair and unbiased comparison can be very challenging due to many factors including the sensitivity of deep learning methods to the step of model selection (Lipton and Steinhardt 2018): deep neural networks have many hyper-parameters that require careful tuning, and differences in performance can be the result of the use of different model selection strategies. Therefore, a meaningful comparison requires the use of a coherent model selection strategy applied uniformly across all architectures. In this study, we conduct a systematic exploration of the performance of different architectures using CNNs and/or RNNs for the study of DNA and RNA sequence binding specificity prediction. For this purpose, we have designed a collection of different architecture variants, some of which correspond to published methods, by varying the network components, depth, and input layer representation. To ensure the objectivity of our evaluation, we used the same model selection strategy and made the pipeline fully automatic to avoid the need for any hand-tuning and thus remove any bias.

Our experiments use datasets collected from the Encyclopedia of DNA Elements (ENCODE) project (Consortium 2004) and verified binding site of RNA binding proteins (RBPs) derived from large-scale CLIP-seq experiments (Stražar et al. 2016

). We find that more complex architectures that combine RNNs and CNNs indeed provide improved performance over the vanilla CNN model, and that this advantage increases with increasing number of training examples that are available. However, the improvement in accuracy comes at the expense of the interpretability of the learned models and increased training times. Our results also demonstrate the advantage of using a k-mer embedding to represent the input sequence instead of the standard one-hot encoding, especially for RBP binding site prediction. Finally, We present an end-to-end deep learning toolkit called

deepRAM that provides a framework for training and evaluating deep learning architectures for DNA/RNA sequence analysis.

2 Methods

In this study, we present a comprehensive evaluation of different deep learning architectures for the task of predicting DNA and RNA protein binding sites. First, we present the benchmark datasets used in our study. Then, we present the architectures used in our experiments. Third, we provide the technical details of the model selection process that we followed to ensure unbiased model comparison. These methods are implemented as an open-source deep learning package called deepRAM, that allows users to evaluate different architectures for predicting DNA and RNA protein binding sites. Finally, We describe our method for extracting motifs from the learned models.

2.1 Datasets

The deep learning models are evaluated on data from ChIP-seq and CLIP-seq experiments. For ChIP-seq data we used data from 83 ChIP-seq experiments from the ENCODE project that assayed binding of diverse transcription factors. These datasets were used to evaluate deep learning architectures in (Alipanahi et al. 2015) and (Zhou and Troyanskaya 2015), and we use the same sequences as training/testing examples. The authors of (Alipanahi et al. 2015

) split the ChIP peak data into three categories A, B and C. A is the set of the top 500 even-numbered peaks. B is the set of the top 500 odd-numbered peaks and C is the set of remaining peaks. For model training, we use the peaks from the A and C and the peaks from B were used for testing. Positive examples in this binary classification task consist of 101 bp regions centered around each ChIP-seq peak. The negative examples were generated by shuffling the positive sequences while matching dinucleotide composition.

We also evaluate the ability of different architectures to identify RNA binding sites. We use the same benchmark human dataset used by the developers of iONMF (Stražar et al. 2016) which consists of 31 CLIP-seq experiments over 19 proteins. The data was obtained from (; original data was retrieved from DoRiNA (Blin et al. 2014) and iCount ( Positive sites represented nucleotides that were identified as being within clusters of interaction sites derived from CLIP-seq. Negative sites were extracted from genes not participating in the protein-RNA interaction process in any of the 31 experiments. Each experiment consists of 40,000 examples divided into 30,000 examples for training and 10,000 for model testing and evaluation.

2.2 Model architectures

Figure 1: Overview of the deep learning architectures evaluated in this work. These include CNN-only models known for their ability to detect motifs (left), RNN-only models (center) which excel at capturing long-term sequence dependencies, and hybrid CNN-RNN models. The input for all variants is either a one-hot encoding or a k-mer embedding of the DNA/RNA sequence obtained using word2vec.

In this section we describe the variety of deep learning architectures that can be applied to biological sequences (see Figure 1). In addition to comparing architectures, we compare two ways of representing the input sequence: either using a one-hot encoding or a k-mer embedding computed using word2vec (Mikolov et al. 2013; Asgari and Mofrad 2015). When using the one-hot encoding, the input sequence is represented by a matrix where

is the length of the sequence and each position in the sequence has a four element vector with a single nonzero element corresponding to the nucleotide in that position. For the The k-mer embedding representation (see Figure 

1), we first split the sequence into overlapping k-mers of length

using a sliding window with stride

and then we map each k-mer in the obtained sequence into d-dimensional vector space using the word2vec algorithm (Mikolov et al. 2013

). word2vec is an unsupervised learning algorithm which maps k-mers from the vocabulary to vectors of real numbers in a low-dimensional space. The embedding representation of k-mers is computed in such a way that their context is preserved, i.e. word2vec produces similar embedding vectors for k-mers that tend to co-occur across sequences.

Convolutional Networks.

CNNs for biological sequence data perform one dimensional convolution: they slide local signal detectors (filters) along the sequence and integrate their results at increasing spatial scales, generating a representation that is able to abstract away some of the variability observed in binding sites. Each convolutional module is composed of a convolutional layer and a pooling layer (see Figure 1

). A convolutional layer consists of one-dimensional convolution operation with a specified number of kernels or filters. The results of applying the filter at each position of the sequence is transformed using a non-linear activation function. We use the commonly used rectified linear units (ReLU), which keeps only positive matches and sets the remaining to 0 which helps avoid the vanishing gradient problem. More specifically, a convolution layer computes


where is the input matrix representing the sequence, is the index of the output position and is the index of the filter. Each convolutional filter is an weight matrix with being the window size and being the number of input channels (for the first convolution layer equals the input representation dimension (4 for one-hot encoding or for the word2vec representation); for higher-level convolutional layers

is the number of filters in the previous convolutional layer). Next, the output of convolution undergoes pooling, which aggregates the outputs from neighboring positions for each filter in order to achieve consistency and invariance to small shifts in the input sequence. In this work we use max-pooling which computes the maximum value of a fixed number of spatially adjacent overlapping windows over the convolutional layer’s output:


where is the output of the convolutional layer, is the pooling window size, is the index for output position and is the index of the filter being pooled.

Overview of the models compared in this work. ’+’ and ’-’ denote the presence and absence of the layer type respectively. ’(.)’ denotes the number of convolution layers if present. In the recurrent layers, if present, the type of RNN is specified. Layers DeepBind DeepBind* Dilated DanQ DanQ* DeepBind-E* KEGRU ECLSTM ECBLSTM Embedding - - - - - + + + + Convolution + + + + + + - + + Recurrent - - - bi-LSTM bi-LSTM - bi-GRU LSTM bi-LSTM * The Dilated architecture consists of three convolution layers, one non dilated followed by two dilated (dilation=2) convolution layers

The first convolutional layer can be thought of as a motif scanner where each filter is considered as a Position Weight Matrix (PWM) and the convolution operation is equivalent to scanning the PWM with a sliding window across the sequence. However, the weight matrices associated with convolutional filters are not required to be log-odd ratios. Additional layers of convolution and pooling enable the network to extract features from larger spatial ranges such as motif interactions, which allows it to represent more complex patterns than shallower networks. Deeper networks have more parameters and require more data for obtaining high levels of performance.

RNN-based models.

The second class of architectures we explored are RNN-only models. RNNs have an internal state that is updated as the network progresses along the input sequence. This internal memory allows RNNs to capture interactions between distant elements along the sequence, and are therefore commonly used in natural language processing (

Hirschberg and Manning 2015). Two types of RNN units were tested using deepRAM: LSTM units (Hochreiter and Schmidhuber 1997) and GRU units (Cho et al. 2014; Chung et al. 2014). A GRU unit given an input at position in the sequence performs the following operations:


where is element-wise multiplication, and are the two GRU gates called the update gate and reset gate, respectively, , and are weight matrices, and , and are the biases. is the hidden state which is used as memory to hold information on previous data the network has seen before, is the candidate memory state which is considered to potentially overwrite . The reset gate controls how much past information to forget and the update gate controls how much information to throw away and what new information to add. The gates and hidden states are vectors of real numbers of the same dimension, where the dimension is a tunable hyper-parameter.

LSTM units are more complex than GRU units, and we refer the readers to the original publications for details (Hochreiter and Schmidhuber 1997). The basic idea of using a gating mechanism in both LSTM and GRU architectures is to capture short term and long term dependencies in sequences. After the LSTM/GRU cell has iterated over the sequence, we output its hidden state at the last position which contains information about the entire sequence.

bi-RNN (bi-GRU/bi-LSTM) is an extension of the regular RNN which consists of a forward layer and a backward layer representing the positive and negative directions respectively. The forward layer is similar to a regular RNN layer run on the input sequence and the backward layer is another separate RNN layer run on the reverse of the input sequence. The output of the bi-RNN is then computed by concatenating the output vectors of the two layers together.

Hybrid models

The third variant in Figure 1.A are hybrid convolutional and recurrent deep neural networks. The convolution stage which is composed of one or more convolutional modules scans the sequence representation using a set of one-dimensional convolutional filters in order to capture sequence patterns or motifs. The convolutional stage is followed by an RNN stage which is capable of learning complex high-level grammar-like relationships by considering the orientations and spatial distances between the motifs.

The final module in all three types of models is composed of one or two fully connected layers to integrate information from the entire sequence followed by a sigmoid layer to compute the probability that the input sequence contains a DNA- or RNA-binding binding site.

Evaluated architectures

The deepRAM tool provides implementations of several existing architectures: DeepBind (Alipanahi et al. 2015) which uses a single-layer CNN layer, DanQ which uses a single-layer CNN and bidirectional LSTM (Quang and Xie 2016), KEGRU which uses k-mer embedding and GRU units (Shen et al. 2018) and dilated multi-layer CNN (Gupta and Rush 2017). To fully evaluate the range of deep learning architectures we considered additional variants denoted as DeepBind* (multi-layer CNN), DanQ* (DanQ with multiple layers of convolution), DeepBind-E* (multi-layer CNN with k-mer embedding), ECLSTM (k-mer embedding with single layer CNN and LSTM) and ECBLSTM (k-mer embedding with single layer CNN and bi-directional LSTM). These architectures are summarized in Table 2.2.

2.3 Model training, selection, and evaluation

Model Selection is perhaps the most challenging step in deep learning as the performance of deep learning algorithms is very sensitive to the calibration parameters (Lipton and Steinhardt 2018). A careful configuration and selection of the hyper-parameters is thus essential. For each dataset, we use automatic calibration that is based on randomly sampling 40 hyper-parameter settings from all possible combinations; for each setting, a model is trained using 3-fold cross-validation. We use the area under the ROC curve (AUC) to evaluate the performance of the model and each calibration set is scored by its average AUC in 3-fold cross-validation. Next, we use the selected best hyper-parameter set to train five new models using the full training data to avoid random initialization effects and then choose the model with the best training performance as the final selected model that is then used for prediction of sequences in the test set. This model selection strategy is based on the one used by the authors of DeepBind (Alipanahi et al. 2015).

In the training phase, we consider the number of learning steps as a hyper-parameter. For each of the 40 calibration sets, we train a model for a maximum of 40,000 learning steps and test it on the held out validation set every 5,000 learning steps. The iteration with the best validation accuracy is picked as the number of learning steps in which the model performed best on validation. The selected number of learning steps is added to the calibration set as an additional hyper-parameter. We select the iteration with the best validation score because we assume that the model starts to over-fit after the selected iteration. The number of filters in the first convolutional layer is chosen as part of model selection; the number of filters in each subsequent layer is increased by 50% compared to the layer before it.

Model training.

To train a given model, we minimize the cross-entropy objective function. Derivatives of the objective function with respect to the model parameters were computed by back-propagation. Minimizing the objective function is performed by Stochastic Gradient Descent (SGD) or Adagrad, and the choice is made as part of the model selection process. Examples were processed using a batch size of 128 in all experiments. We used multiple regularization schemes including dropout (applied to max pooling layers/RNN layers/hidden layers), weight decay, and early stopping. Details of the hyper-parameter space are summarized in Table 


We ran our experiments on an Ubuntu server with a TITAN X GPU with 12 GB of memory. Typical running times of each experiment for model selection is between one hour for a single layer CNN to almost four hours for a network that includes convolutional and bi-LSTM modules (see details in Table S3 in the supplementary file).

deepRAM Hyper-parameters, search space and sampling method. Calibration Parameters Search Space Sampling Embedding size 50 Fixed Embedding k-mer length 3 Fixed Embedding stride 1 Fixed motif length {10 , 24 }* Fixed Number of filters {16, 32} uniform Pooling window size 3 Fixed Pooling stride 1 Fixed RNN hidden size {20,50,80,100} uniform Neural Net hidden layer {Nan, 32units, 64units} uniform optimizer {SGD, Adagrad} uniform learning rate [1e-3,1e-1] log uniform learning momentum(SGD) [0.95,0.99] sqrt uniform number of learning steps [5,000:40,000]** evaluate all weight initialization {xavier, normal} uniform initial weight scale(motifs) [1e-6,1e-1] log uniform initial weight scale (RNN) [1e-6,1e-1] log uniform initial weight scale (NN) [1e-5,1e-1] log uniform weight decay [1e-10,1e-1] log uniform dropout expectation {0.4, 0.55, 0.7, 0.85, 1} uniform * 10 with k-mer embedding. 24 with one-hot encoding
** step= 5000

2.4 Motif extraction

In order to make models implemented using deepRAM easily interpretable, we extract motifs from the first convolutional layer following a similar methodology as in DeepBind (Alipanahi et al. 2015). To do so, we feed all test sequences through the convolution stage. For each filter, we extract all sequence fragments that activate the filter and use only activations that are greater than half of the filter’s maximum value over all sequences. Once all the sequence fragments are extracted, they are stacked and the nucleotide frequencies are counted to form a position frequency matrix (PFM). Sequence logos are then constructed using WebLogo (Crooks et al. 2004). Finally, these discovered motifs are aligned using TOMTOM (Gupta et al. 2007) against known motifs from CISBP-RNA (Ray et al. 2013) for RBPs and JASPAR (Mathelier et al. 2013) for transcription factors.

2.5 deepRAM

deepRAM is an end-to-end deep learning toolkit for predicting protein binding sites and motifs. It helps users run experiments using many state-of-the-art methods and addresses the challenge of selecting model parameters in deep learning models using a fully automatic model selection strategy. This helps avoid hand-tuning and thus removes any bias in running experiments, making it user friendly without losing its flexibility. While it was designed with ChIP-seq and CLIP-seq data in mind, it can be used for any DNA/RNA sequence binary classification problem.

deepRAM allows users the flexibility to choose a deep learning model by selecting its different components: input sequence representation (one-hot or k-mer embedding), whether to use a CNN and how many layers, and whether to use an RNN, and the number of layers and their type. For CNNs the user can choose to use dilated convolution as well. Once the model is trained, the learned motifs of the first convolutional module are automatically extracted and visualized using Weblogo, and then matched with known motifs using TOMTOM.

We implemented deepRAM

using PyTorch 1.0 (, which supports GPU acceleration. Our implementation has been packaged to make it runnable on any Unix-based system, and is available at:

3 Results

3.1 Deeper is better

Figure 2: (A) The distribution of AUCs across 83 ChIP-seq datasets. (B) Heatmap annotated with p-values of pairwise model comparison using the Wilcoxon signed-rank test for ChIP-seq datasets. (C) The distribution of AUCs across 31 datasets for predicting RBP binding sites. (D) Heatmap annotated with p-values of pairwise model comparison using the Wilcoxon signed-rank test for predicting RBP binding sites. In subfigures (A) and (C), the triangle represents the average AUC for the respective model, the annotated vertical line represents the median AUC whose value is indicated. The models are sorted by their average AUC values. In subfigures (B) and (D), the color red or blue at position in the heatmap indicates which model has a high average AUC, and its intensity indicates the magnitude of the difference.

We evaluate and compare the performance of the different models discussed in section 2.2 on the two tasks of predicting DNA- and RNA- protein binding sites (see Figure 2). Overall, all models perform well with all median AUCs greater than 0.90 on ChIP-seq data and greater than 0.91 on CLIP-seq data. The proposed ECBLSTM model (Embedding, Convolution, bi-LSTM) provides the most significant improvement over DeepBind with a median AUC of 0.930 compared with 0.902 for DeepBind on ChIP-seq data, and with a more pronounced gap for CLIP-seq data: 0.951 for ECBLSTM vs 0.914 for DeepBind. All the performance differences described here are statistically significant except when noted explicitly (see Figure 2). Detailed accuracy values for individual datasets are provided in Tables S1 and S2 in the supplementary file.

DeepBind is the simplest model considered here: it uses one-hot sequence encoding, and a single convolutional layer. The results shown in Figure 2 demonstrate that adding multiple convolutional layers, dilated convolution, and sequence embedding all provide improved performance over the original DeepBind. The addition of a recurrent module provides further improvement as seen by comparing the performance of ECBLSTM to a model called DeepBind-E* which has multiple convolutional layers and an embedding stage. This shows that adding recurrent connections to capture long-term dependencies between motifs detected by the convolutional layer leads to improved performance. The performance advantage of RNNs is further highlighted by comparing the performance of DanQ where the additional bi-directional LSTM layer has helped improve its performance over DeepBind.

We note that iDeepS which is specifically designed for RNA binding and uses a CNN over sequence and local secondary structure in combination with an LSTM module, achieved a median AUC of 0.917 for the CLIP-seq data, which is less than all the evaluated methods except DeepBind (see Table S4 in the supplementary file). All the deep learning methods performed better than iONMF which uses multiple sources of data, including k-mer frequency, secondary structure, GO annotations (see Table S4 in the supplementary file).

We note that in both tasks, our implementation of DeepBind achieved nearly identical performance to the original DeepBind implementation (see Figure S3 in the supplementary file).

3.2 k-mer embedding boosts model performance

We observe that using k-mer embedding to represent input sequences rather than one-hot encoding improves model performance, and more so for the RBP binding datasets. For example, among models with the same architecture, we see that ECBLSTM outperforms DanQ in both tasks (see Figure 2 and Supplementary Figures S2 and S3). We also observe that in the task of RNA-protein binding site prediction, all models that use embedding representation have median AUC higher than 0.94 while all models that use one-hot encoding have median AUC lower than 0.935 (Figure 2.C). These results suggest that one-hot encoding is perhaps not the optimal strategy for representation of DNA and RNA sequences. In contrast, k-mer embedding integrates the contextual information of k-mers by learning the statistical information of k-mer co-occurrence relationships in the input sequences.

In this work, we train the k-mer embedding algorithm for each dataset with and stride . Other studies (Shen et al. 2018; Min et al. 2017) have encouraged the use of larger values of stride and k-mer length and suggested that the use of small stride values () may affect negatively the performance of the embedding algorithm. In preliminary experiments, we found that using small values of stride and k-mer length ( and ) has the best performance.

3.3 Deeper is better with sufficient training data

Figure 3: (A) The distribution of AUCs in predicting DNA protein binding sites across 38 ChIP-seq experiments with less than 10,000 peaks . (B) The distribution of AUCs in predicting DNA protein binding sites across 45 ChIP-seq experiments with more than 10,000 peaks. Figure notation follows the description in Figure 2.

Based on the results shown in Figure 2, one may conclude that relatively complex models tend to perform better than simpler models. However, this statement is based on the evaluation of the overall performance across all experiments and do not take into consideration the effect of the number of training examples. To study this aspect, we divided the ENCODE ChIP-seq datasets into two groups according to the number of training examples. The first group consists of 38 datasets with less than 10,000 positive training samples, and the second group consists of 45 datasets with more than 10,000 positive training samples. We compare the performance of different models in these two groups and report the results in Figure 3. We observe considerably higher AUCs for the large datasets with median AUCs between 0.967 (DeepBind) and 0.993 (ECBLSTM) compared to median AUCs between 0.864 (DeepBind) and 0.879 (DeepBind-E*) for the small datasets. It is also worth noting that the effect of the number of training examples is more pronounced with hybrid models (see Figure 3 and Supplementary Figure S4). Indeed, ECBLSTM, ECLSTM and DanQ* tend to perform very strongly for large datasets (median AUCs above 0.983) while interestingly, they fell behind DeepBind-E* when used on smaller datasets. this suggests the need for sufficient training data for hybrid models. Complex models such as ECBLSTM still perform well even for the smaller datasets, demonstrating that our regularization procedure was effective in preventing over-fitting.

3.4 Dilated convolution

Dilated convolution uses filters with gaps to allow each filter to capture information across larger and larger stretches of the input sequence (Yu and Koltun 2015). Hence, dilated convolution finds usage in applications that benefit from modeling of a wider context without incurring the increased cost of using RNNs (Gupta and Rush 2017; Strubell et al. 2017; Kelley et al. 2018).

In this work, we evaluate a dilated model which consists of three convolutional modules with dilations equal to 1, 2 and 2 in the first, second and third layers, respectively. We find that dilated convolutional model outperforms DeepBind* with significant p-values in both tasks (Figure 2). In addition, the dilated convolutional model had slightly higher median AUC than DanQ in the RBP binding sites datasets, which suggests that dilated convolution can capture long range relationships similarly to LSTMs. These findings suggest that dilated convolution are a valuable architecture parameter to consider. This is likely to be even more pronounced for longer sequences such as those modeled using the Basenji method (Kelley et al. 2018).

3.5 Model interpretation and visualization

Figure 4: (A) Examples of motifs detected by first layer convolutional modules learned by DeepBind, DeepBind-E* and ECBLSTM for predicting DNA binding sites of CTCF, SRF, and FOS. E-values are displayed below each motif only if it matches with the known motifs for the same Transcription Factor. Known motifs from the JASPAR database are displayed at the top. (B) Histograms of the counts of convolutional filter activations that are considered for extracting motifs along the sequences for models of CTCF, SRF and FOS binding using DeepBind and ECBLSTM.

To explore the ability of selected architectures to capture informative motifs, we converted filters of the first convolutional layer to sequence motifs as described in section 2.4. As shown in Figure 4.A, DeepBind and DeepBind-E* are able to detect informative motifs that match well with known motifs from the JASPAR database. However, ECBLSTM turns out to perform poorly in detecting motifs compared to the two other models and most of its detected motifs are not informative despite the fact that it is the best performing model among all the models we compared. We hypothesize that when combined with RNNs, the CNN filters learn information that is geared towards providing the subsequent recurrent layer with the information it needs, which is of a different nature than the localized information learned by CNN-only models.

To further investigate the difference between the behaviour of hybrid models and CNN-only models, We explored the distribution of sequence fragments with positive activation values for a given filter with DeepBind and ECBLSTM in the positive and negative examples (Figure 4

.B ). As expected, the number of activated sequence fragments in positive sequences is much higher than in negative sequences in both methods. In addition, We observe that the activated sequence fragments in positive sequences using DeepBind are concentrated in the middle of the sequence, and are uniformly distributed for negative examples. However, using ECBLSTM the activated sequence fragments are distributed uniformly across the sequence for both positive and negative sequences. Noting that the centers of positive sequences correspond to the reported ChIP-seq peaks, we conclude that DeepBind is detecting sequence motifs that represent the binding event while ECBLSTM’s convolution stage is extracting features that span the whole sequence. This is in agreement with our finding that RNNs lead to a representation which has reduced interpretability compared to that of CNNs.

4 Conclusion

In this work we performed a thorough analysis and evaluation of the performance of commonly used deep learning architectures for DNA and RNA binding site prediction. This study aims at helping researchers to get a better understanding of the performance characteristics and advantages of different architectures to help them choose the right architecture for their work. Our experiments demonstrated the accuracy of hybrid CNN/RNN models; however, that requires the availability of sufficient training data, and these networks are harder to interpret and hence their usefulness in motif discovery might be limited. We have made the software used in our experiments available as an easy to use tool to evaluate and analyze various deep learning architectures for DNA/RNA binding prediction in a user-friendly package called deepRAM. We hope this work will stimulate further studies on visualizing and understanding deep models and enhance their usefulness for analyzing biological sequence data.


  • Alipanahi et al. [2015] Alipanahi, B., Delong, A., Weirauch, M. T., and Frey, B. J. (2015). Predicting the sequence specificities of DNA-and RNA-binding proteins by deep learning. Nature biotechnology, 33(8), 831.
  • Angermueller et al. [2016] Angermueller, C., Pärnamaa, T., Parts, L., and Stegle, O. (2016). Deep learning for computational biology. Molecular systems biology, 12(7), 878.
  • Asgari and Mofrad [2015] Asgari, E. and Mofrad, M. R. (2015).

    Continuous distributed representation of biological sequences for deep proteomics and genomics.

    PloS one, 10(11), e0141287.
  • Blin et al. [2014] Blin, K., Dieterich, C., Wurmus, R., Rajewsky, N., Landthaler, M., and Akalin, A. (2014). Dorina 2.0–upgrading the doRiNA database of RNA interactions in post-transcriptional regulation. Nucleic acids research, 43(D1), D160–D167.
  • Bullinaria [2013] Bullinaria, J. A. (2013). Recurrent neural networks. Neural Computation: Lecture, 12.
  • Cho et al. [2014] Cho, K., Van Merriënboer, B., Bahdanau, D., and Bengio, Y. (2014). On the properties of neural machine translation: Encoder-decoder approaches. arXiv preprint arXiv:1409.1259.
  • Chung et al. [2014] Chung, J., Gulcehre, C., Cho, K., and Bengio, Y. (2014). Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555.
  • Consortium [2004] Consortium, E. P. (2004). The ENCODE (ENCyclopedia of DNA elements) project. Science, 306(5696), 636–640.
  • Crooks et al. [2004] Crooks, G. E., Hon, G., Chandonia, J.-M., and Brenner, S. E. (2004). WebLogo: a sequence logo generator. Genome research, 14(6), 1188–1190.
  • Ferré et al. [2016] Ferré, F., Colantoni, A., and Helmer-Citterich, M. (2016). Revealing protein-lncRNA interaction. Briefings in Bioinformatics, 17(1), 106–116.
  • Gerstberger et al. [2014] Gerstberger, S., Hafner, M., and Tuschl, T. (2014). A census of human RNA-binding proteins. Nature Reviews Genetics, 15(12), 829.
  • Gupta and Rush [2017] Gupta, A. and Rush, A. M. (2017). Dilated Convolutions for Modeling Long-Distance Genomic Dependencies. arXiv preprint arXiv:1710.01278.
  • Gupta et al. [2007] Gupta, S., Stamatoyannopoulos, J. A., Bailey, T. L., and Noble, W. S. (2007). Quantifying similarity between motifs. Genome biology, 8(2), R24.
  • Hassanzadeh and Wang [2016] Hassanzadeh, H. R. and Wang, M. D. (2016). DeeperBind: Enhancing prediction of sequence specificities of DNA binding proteins. In Bioinformatics and Biomedicine (BIBM), 2016 IEEE International Conference on, pages 178–183. IEEE.
  • Hinton et al. [2012] Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, A.-r., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T. N., et al. (2012). Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal processing magazine, 29(6), 82–97.
  • Hirschberg and Manning [2015] Hirschberg, J. and Manning, C. D. (2015). Advances in natural language processing. Science, 349(6245), 261–266.
  • Hochreiter and Schmidhuber [1997] Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8), 1735–1780.
  • Kazan et al. [2010] Kazan, H., Ray, D., Chan, E. T., Hughes, T. R., and Morris, Q. (2010). RNAcontext: a new method for learning the sequence and structure binding preferences of RNA-binding proteins. PLoS computational biology, 6(7), e1000832.
  • Kelley et al. [2018] Kelley, D. R., Reshef, Y., Bileschi, M., Belanger, D., McLean, C. Y., and Snoek, J. (2018). Sequential regulatory activity prediction across chromosomes with convolutional neural networks. Genome research, pages gr–227819.
  • Krizhevsky et al. [2012] Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105.
  • LeCun et al. [1998] LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278–2324.
  • LeCun et al. [2015] LeCun, Y., Bengio, Y., and Hinton, G. (2015). Deep learning. nature, 521(7553), 436.
  • Lipton and Steinhardt [2018] Lipton, Z. C. and Steinhardt, J. (2018). Troubling trends in machine learning scholarship. arXiv preprint arXiv:1807.03341.
  • Mathelier et al. [2013] Mathelier, A., Zhao, X., Zhang, A. W., Parcy, F., Worsley-Hunt, R., Arenillas, D. J., Buchman, S., Chen, C.-y., Chou, A., Ienasescu, H., et al. (2013). JASPAR 2014: an extensively expanded and updated open-access database of transcription factor binding profiles. Nucleic acids research, 42(D1), D142–D147.
  • Mikolov et al. [2013] Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119.
  • Min et al. [2017] Min, X., Zeng, W., Chen, N., Chen, T., and Jiang, R. (2017). Chromatin accessibility prediction via convolutional long short-term memory networks with k-mer embedding. Bioinformatics, 33(14), i92–i101.
  • Pan et al. [2018] Pan, X., Rijnbeek, P., Yan, J., and Shen, H.-B. (2018). Prediction of RNA-protein sequence and structure binding preferences using deep convolutional and recurrent neural networks. BMC genomics, 19(1), 511.
  • Quang and Xie [2016] Quang, D. and Xie, X. (2016). DanQ: a hybrid convolutional and recurrent deep neural network for quantifying the function of DNA sequences. Nucleic acids research, 44(11), e107–e107.
  • Ray et al. [2013] Ray, D., Kazan, H., Cook, K. B., Weirauch, M. T., Najafabadi, H. S., Li, X., Gueroussov, S., Albu, M., Zheng, H., Yang, A., et al. (2013). A compendium of RNA-binding motifs for decoding gene regulation. Nature, 499(7457), 172.
  • Rohs et al. [2010] Rohs, R., Jin, X., West, S. M., Joshi, R., Honig, B., and Mann, R. S. (2010). Origins of specificity in protein-DNA recognition. Annual review of biochemistry, 79, 233–269.
  • Shen et al. [2018] Shen, Z., Bao, W., and Huang, D.-S. (2018). Recurrent Neural Network for Predicting Transcription Factor Binding Sites. Scientific reports, 8(1), 15270.
  • Siggers and Gordan [2013] Siggers, T. and Gordan, R. (2013). Protein–DNA binding: complexities and multi-protein codes. Nucleic acids research, 42(4), 2099–2111.
  • Stormo [2000] Stormo, G. D. (2000). DNA binding sites: representation and discovery. Bioinformatics, 16(1), 16–23.
  • Stražar et al. [2016] Stražar, M., Žitnik, M., Zupan, B., Ule, J., and Curk, T. (2016). Orthogonal matrix factorization enables integrative analysis of multiple RNA binding proteins. Bioinformatics, 32(10), 1527–1535.
  • Strubell et al. [2017] Strubell, E., Verga, P., Belanger, D., and Mccallum, A. (2017). Fast and accurate sequence labeling with iterated dilated convolutions. CoRR.
  • Sutskever et al. [2014] Sutskever, I., Vinyals, O., and Le, Q. V. (2014). Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104–3112.
  • Yu and Koltun [2015] Yu, F. and Koltun, V. (2015). Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122.
  • Zeng et al. [2016] Zeng, H., Edwards, M. D., Liu, G., and Gifford, D. K. (2016). Convolutional neural network architectures for predicting DNA–protein binding. Bioinformatics, 32(12), i121–i127.
  • Zhou and Troyanskaya [2015] Zhou, J. and Troyanskaya, O. G. (2015). Predicting effects of noncoding variants with deep learning–based sequence model. Nature methods, 12(10), 931.