DNA-binding proteins play an important role in gene regulation. It’s well-known that the transcription of each gene is controlled by a regulatory region of DNA relatively near the transcription start site. There are two fundamental components in transcription, the short DNA regulatory element, and its corresponding gene regulatory proteins. DNA binding sites are small but highly variable, which makes them difficult to detect. Several experimental methods were developed(e.g. ChIP-seq
) to solve this problem, but they are usually costly, and each has its artifacts, biases, and limitation. Based on sequence-based data, the problem of predicting DNA-protein binding is to model the sequence specificity of protein binding (i.e. connect a relationship between sequence-based data and binary labels of data). Specifically, the task is a classification problem given training DNA sequences and their binary labels to predict labels of given testing sequences in the dataset. Recent work on motif inference includes conventional machine learning-based methods(e.g. SVM, Random Forest)[16, 8, 11]
and deep learning-based methods(e.g. CNN, RNN)[2, 31, 25, 26]. CNN’s and RNNs have shown their superiority compared with conventional machine learning-based methods. However, when it comes to small datasets, the performance of the models is often limited. Besides, the models of CNN’s usually learn truncated motifs that aren’t desired.
On the other hand, the binary labels of the sequences are up to whether they have some specific regions called a motif. If we regard the ”A”, ”C”, ”G” and”T” as special kinds of characters, k-mer can be treated as words and DNA sequences can be viewed as sentences. The k-mer related to given motifs can be viewed as keywords and predicting DNA-protein binding is transformed into the problem of text classification.
In this paper, we propose a novel method based on Graph Convolutional Networks – DNA-GCN for predicting DNA-protein binding. In DNA-GCN, firstly we construct a single large graph from the whole dataset, and then GCN is utilized to obtain neighborhood information. By this, predicting DNA-protein binding is turned into a semi-supervised node classification problem. We choose a lot of different datasets with limited samples to evaluate our model. The model shows competitive performance on the task of predicting TF binding sites. All code is public in https://github.com/Tinard/dnagcn. In summary, our contributions in this paper are twofold. (1) We propose a novel graph convolutional network for predicting DNA-protein binding. To the best of our knowledge, we are the first to model the sequence specificity with a graphical model and utilize GCN to learn the sequences and k-mer embeddings. (2) The empirical results show that our proposed model has a competitive performance compared with the baseline models on many datasets with limited sequences. We suppose that our method could contribute to the study of DNA sequence modeling and other biological models.
2 Related Work
2.1 Deep learning for motif inference
The deep learning method for motif inference can be categorized into two groups – CNN-based and RNN-based methods. As for the CNN-based model(may contain RNN), DeepBind is the first CNN-based model to predict DNA-protein binding and since then deep learning has been widely utilized in this field for its great performance.  shows that deploying more convolutional kernels is always beneficial. iDeepA applies an attention mechanism to automatically search for important positions. DeeperBind and iDeepS add an LSTM layer on DeepBind to learn long dependency within sequences to further improve the prediction performance. As for the RNN-based model, KEGRU
identifies TF binding sites by combining Bidirectional Gated Recurrent Unit (GRU) network with k-mer embedding. Besides model selection, CONCISE and iDeep integrate other information(e.g. structured information) into predicting RBP-binding sites and preference. Other work includes data augmentation, circular filters and convolutional kernel networks[6, 19, 20].
2.2 Graph Convolutional Networks
In the past few years, graph convolutional network has attracted wide attention for its learning hierarchical information on graphs[17, 7, 12]. It has been shown that the GCN model is a special form of Laplacian smoothing which makes the features of nodes similar to their neighbors and makes the subsequent classification task much easier. The layer-wise propagation rule is formulated as:
where is the node representation matrix, is the trainable parameter matrix of the -th layer, is the origin feature matrix, is the adjacency matrix with increased self-connection, is the degree matrix of (i.e. ), and
is the activation function, such as. In addition, there are many variants of GCN, focusing on improving the performance of GCN and coping with the storage bottleneck of GCN[7, 12, 27, 29, 33]. In recent years, GCN has been used to handle many tasks, such as text classification, drug recommendation and laboratory test classification [30, 21, 18], and has shown better performance than baseline on different tasks.
2.3 Heterogeneous graph
The heterogeneity is an intrinsic property of a heterogeneous graph, i.e., various types of nodes and edges. Apparently, different types of nodes have different features which fall in different feature space. However, if we utilize Laplacian smoothing directly, different feature space is mixed which seems unreasonable. HAN utilizes an attention mechanism to generate node embedding by aggregating features from meta-path-based neighbors in a hierarchical manner. MedGCN assumes that there is no edge between nodes in the same type and the propagation rule is rewritten as
We assume that the number of types is , is the adjacency matrix between nodes type and . is the learnable weight matrix for type nodes in layer .
3.1 Sequence k-mer graph
First of all, we construct a large and heterogenous graph where are sets of nodes and edges respectively to describe the relationship between sequences and k-mers. As shown in Figure 1, nodes in the graph have two types: sequences nodes and k-mers nodes. The number of nodes in sequence k-mer graph is the number of sequences including training set, validation set, and testing set plus the number of possible k-mers. The weight of the edge between sequences and k-mers is the number of occurrences of k-mer multiplied by inverse sequence frequency(ISF) in the sequence. ISF is the logarithmically scaled inverse fraction of the number of sequences that contain the k-mer. We test two models with or without ISF and found that the former is better. We suppose that some common k-mers which isn’t related to the motif in real-world data may disrupt the performance and ISF can ease the effect of irrelevant k-mers. Point-wise mutual information(PMI) is utilized to calculate weights between two k-mers. Above all, the adjacent matrix of sequence k-mer graph is formulated as:
We can computed as
where is the number of sequences that contain k-mer , is the number of sequences that contain both k-mer and , and is the total number of sequences in the dataset. We set to be
, and because a motif length is between 6 and 20, the information of a motif may be spitted into several k-mers. We believe that if two k-mers co-occur in a sequence frequently, they are probably to co-decide whether a sequence contains a motif. From the formulation above, a positive PMI value indicates a high correlation of k-mers while a negative PMI value indicates little or no correlation. From the analysis, we set positive PMI values to be the weights of edges between k-mers.
After constructing the sequence k-mer graph, we feed the graph into GCN. Because the information of sequences is embedded into the graph, we set the feature matrix
for simplicity. At first, we feed the graph into a two-layer GCN, and the second layer node embeddings only have one dimension and then are fed into a softmax classifier for classification.
where and with partition function (i.e.
). The model has transformed into a semi-supervised model this time and the cross-entropy error over all labeled sequences determine loss function.
where is the set of sequence indices that have labels and
is the label indicator vector.
We give an ideal example. For the dataset with a specific protein, its positive samples contain the motif ”AACGTC” while negative samples don’t contain it. AACG, ACGT, CGTC are the key 4-mers which guide classification, the three k-mers is connected to all the positive sample. By training guided by labeled sample, the three k-mers can be trained to have the features that point to positive label, and then the model can predict the label of a testing sequence by whether the key k-mers is connected to target sequences (i.e. Features of key k-mers can be transferred to testing sequences or not). Overall, the information is transferred from labeled samples to k-mers, and then from k-mers to unlabeled sequences. From the analysis above, we need at least two times of Laplacian smoothing(i.e. two layers GCN) to construct our DNA-GCN. By experiments and experience that too many layers lead to over-smoothing features, we set the number of layers to be two.
We also notice that the sequence k-mer graph is a heterogeneous graph. As a result, feeding our graph into GCN directly seems to be unreasonable. The layer-wise propagation rule can be rewritten as
The formulation above means we feed every node into an identical single-layer perceptron, and from this point of view, we can feed each type of node into a specific perception. We assume thatnodes represent sequences, nodes represent k-mers and . can be partitioned into two blocks, and according to node type. and are weight matrices with the same shape. The layer-wise propagation rule for a heterogeneous graph is formulated as
We call the second method DNA-HGCN. It’s evident that in DNA-HGCN, the number of parameters double. As a result, DNA-HGCN is easier to overfit. We also utilize Simple Graph Convolutional Network(SGC) to build our model. SGC removes nonlinearities and collapses weight matrices between consecutive layers to reduce excess complexity while keeping the model’s performance.
3.3 Implementation of DNA-GCN
Our model is trained using Adam
optimizer with a learning rate of 0.001 for 10000 epochs.sequences are chosen from the labeled set to construct the validation set. We chose the best model according to the performance in the validation set. We set the embedding size of the first convolution layer as 100.
To evaluate the performance of our model, we chose the 50 ChIP-seq ENCODE datasets. Each of these datasets corresponds to a specific DNA-binding protein (e.g., transcription factor); its positive samples are 101bp DNA sequences which were experimentally confirmed to bind to this protein, and its negative samples were created by shuffling these positive samples. All these datasets were downloaded from http://cnn.csail.mit.edu/
. We didn’t test our performance in all available datasets for two reasons. On one hand, there is a tendency that when training sets contain more samples, better performance can be obtained by DeepBind. From this view, raising performance on these large datasets with high performance (i.e. AUC is about) is useless. On the other hand, our model needs to be learned with the presence of both training and test data. Moreover, the recursive neighborhood expansion across layers poses time and memory challenges for training with large graphs(i.e. large datasets). As a result, We selected 50 datasets with limited samples to test the performance of our model.
 introduces alternative feature sets using gapped k-mers and develops an efficient tree data structure for computing the kernel matrix. Compared to original kmer-SVM and alternative approaches, gkm-SVM predicts functional genomic regulatory elements and tissue-specific enhancers with significantly improved accuracy.
 is similar to the architecture of DeepBind. We utilize their best model with 128 convolutional kernels as our baseline. The overall performance of the CNN-based model is better than DeepBind in ENCODE datasets. Model 1 and model 2 refer to CNN-based model with 1 and 128 convolutional kernels, respectively.
4.3 Our model outperforms on many datasets
From Figure 2, the performance of DNA-HGCN among the datasets is sightly better than the best baseline. DNA-GCN-based three different models have close performance. DNA-GCN based on SGC sacrifices a little performance for its least calculation while DNA-HGCN has better performance with doubled parameters.
As for specific datasets[Figure 3], we found that the performance among the CNN-based model and GCN-based model is inconsistent, which shows that GCN and CNN predict DNA-protein binding from different views. The performance of the same kind of models is consistent, regardless of the GCN-based model or CNN-based model. We believe if two methods can be combined, the best performance can be arrived at.
In this paper, a novel method named DNA-GCN is proposed to predict DNA-protein binding. We build a heterogeneous sequence k-mer graph for the whole dataset and turn predicting the DNA-protein binding problem into a node classification problem. Our DNA-GCN can transit information from labeled sequences to key k-mers and predict labels of unlabeled sequences. By experiments, we show that a simple two-layer GCN brings up promising results by comparing numerous models on many datasets with a limited number of sequences.
One the other hand, we believe that we haven’t made the best use of GCN to predict DNA-protein binding. Much improvement may be achieved by adjusting the architectures and hyper-parameters on a given dataset. Although DNA-GCN can’t arrive at motif logos like CNN-based model, it can provide us with information about which k-mers are important in the classification. Above all, our model gives a brand new perspective to study the motif inference.
This work was supported by The National Key Research and Development Program of China (No.2016YFA0502303) and the National Natural Science Foundation of China (No.31871342).
-  (2016) Tensorflow: a system for large-scale machine learning.. In OSDI, Vol. 16, pp. 265–283. Cited by: §3.3.
-  (2015) Predicting the sequence specificities of dna-and rna-binding proteins by deep learning. Nature biotechnology 33 (8), pp. 831. Cited by: §1, §2.1.
-  (2017) Modeling positional effects of regulatory sequences with spline transformations increases prediction accuracy of deep neural networks. Bioinformatics 34 (8), pp. 1261–1269. Cited by: §2.1.
-  (2019) Neural networks with circular filters enable data efficient inference of sequence motifs. Bioinformatics. Cited by: §1, §2.1.
-  (2018) Simple tricks of convolutional neural network architectures improve dna–protein binding prediction. Bioinformatics. Cited by: §2.1.
-  (2019) Biological sequence modeling with convolutional kernel networks.. Bioinformatics (Oxford, England). Cited by: §2.1.
-  (2018) Fastgcn: fast learning with graph convolutional networks via importance sampling. arXiv preprint arXiv:1801.10247. Cited by: §2.2.
-  (2016) RNAcommender: genome-wide recommendation of rna–protein interactions. Bioinformatics 32 (23), pp. 3627–3634. Cited by: §1.
-  (2006) The relationship between precision-recall and roc curves. In Proceedings of the 23rd international conference on Machine learning, pp. 233–240. Cited by: §3.3.
-  (2004) ROC graphs: notes and practical considerations for researchers. Machine learning 31 (1), pp. 1–38. Cited by: §3.3.
-  (2014) Enhanced regulatory sequence prediction using gapped k-mer features. PLoS computational biology 10 (7), pp. e1003711. Cited by: §1, §4.2.
-  (2017) Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems, pp. 1024–1034. Cited by: §2.2.
-  (2016) DeeperBind: enhancing prediction of sequence specificities of dna binding proteins. In 2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 178–183. Cited by: §2.1.
-  (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §3.3.
-  (2016) Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907. Cited by: §1, §2.2.
-  (2016) LS-gkm: a new gkm-svm for large-scale datasets. Bioinformatics 32 (14), pp. 2196–2198. Cited by: §1.
Deeper insights into graph convolutional networks for semi-supervised learning. In
Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §2.2.
Classifying relations in clinical narratives using segment graph convolutional and recurrent neural networks (seg-gcrns). Journal of the American Medical Informatics Association 26 (3), pp. 262–268. Cited by: §2.2.
-  (2019) Deepprune: learning efficient and interpretable convolutional networks through weight pruning for predicting dna-protein binding. Frontiers in genetics 10, pp. 1145. Cited by: §2.1.
-  (2020) Expectation pooling: an effective and interpretable pooling method for predicting dna–protein binding. Bioinformatics 36 (5), pp. 1405–1412. Cited by: §2.1.
-  (2019) MedGCN: graph convolutional networks for multiple medical tasks. arXiv preprint arXiv:1904.00326. Cited by: §2.2, §2.3.
-  (2018) Prediction of rna-protein sequence and structure binding preferences using deep convolutional and recurrent neural networks. BMC genomics 19 (1), pp. 511. Cited by: §2.1.
-  (2017) RNA-protein binding motifs mining with a new hybrid deep learning based cross-domain knowledge integration approach. BMC bioinformatics 18 (1), pp. 136. Cited by: §2.1.
-  (2017) Attention based convolutional neural network for predicting rna-protein binding sites. arXiv preprint arXiv:1712.02270. Cited by: §2.1.
-  (2016) DanQ: a hybrid convolutional and recurrent deep neural network for quantifying the function of dna sequences. Nucleic acids research 44 (11), pp. e107–e107. Cited by: §1.
-  (2018) Recurrent neural network for predicting transcription factor binding sites. Scientific reports 8 (1), pp. 15270. Cited by: §1, §2.1.
-  (2017) Graph attention networks. arXiv preprint arXiv:1710.10903. Cited by: §2.2.
-  (2019) Heterogeneous graph attention network. arXiv preprint arXiv:1903.07293. Cited by: §2.3.
-  (2019) Simplifying graph convolutional networks. arXiv preprint arXiv:1902.07153. Cited by: §2.2, §3.2.
-  (2018) Graph convolutional networks for text classification. arXiv preprint arXiv:1809.05679. Cited by: §2.2.
-  (2016) Convolutional neural network architectures for predicting dna–protein binding. Bioinformatics 32 (12), pp. i121–i127. Cited by: §1, §2.1, §4.2.
-  (2008) Model-based analysis of chip-seq (macs). Genome biology 9 (9), pp. R137. Cited by: §1.
-  (2018) Dual graph convolutional networks for graph-based semi-supervised classification. In Proceedings of the 2018 World Wide Web Conference on World Wide Web, pp. 499–508. Cited by: §2.2.