ncrna-deep
Scripts and datasets to reproduce the experiment of the paper "Exploring non-coding RNA functions with deep learning tools"
view repo
Non-coding RNA (ncRNA) are RNA sequences which don't code for a gene but instead carry important biological functions. The task of ncRNA classification consists in classifying a given ncRNA sequence into its family. While it has been shown that the graph structure of an ncRNA sequence folding is of great importance for the prediction of its family, current methods make use of machine learning classifiers on hand-crafted graph features. We improve on the state-of-the-art for this task with a graph convolutional network model which achieves an accuracy of 85.73 Moreover, our model learns in an end-to-end fashion from the raw RNA graphs and removes the need for expensive feature extraction. To the best of our knowledge, this also represents the first successful application of graph convolutional networks to RNA folding data.
READ FULL TEXT VIEW PDFScripts and datasets to reproduce the experiment of the paper "Exploring non-coding RNA functions with deep learning tools"
model. The input consists of an RNA graph with a nucleotide on each node. Initially, an embedding layer maps each nucleotide into a continuous vector. After that, multiple graph convolutional layers are used to refine the features on each node by propagating information in the graph. The set2set pooling model is then used to aggregate the relevant information from the output of the last convolutional layer and produce a graph-wise representation. A fully-connected layer with softmax activation finally produces the output probability for each class.
RNA, together with DNA, is one of the fundamental carriers of genetic information. While the main function of RNA is in the production of proteins from instructions present in DNA code, RNA has been shown to also carry other important biological functions. In particular, recent findings on cancer research have shifted the attention from protein-coding RNAs to non-coding RNAs, as principal effectors and regulators of tumorigenesis and cancer development (He et al., 2019; Wu et al., 2018; Tian and Zhang, 2017; Yang et al., 2017). Moreover, certain RNAs have been shown to impact gene expression through fulfilling roles encompassing sensory and scaffolding capacities at various stages of the gene regulation process (Mercer and Mattick, 2013). We refer to this functional RNA, which is transcribed from DNA but not translated into proteins, as non-protein coding RNA (ncRNA).
At its most basic form, RNA is a sequence of four types of nucleotides: adenine (A), guanine (G), cytosine (C) and uracil (U). The sequence of nucleotides forms what is commonly called the RNA primary structure. However, it is the folding of the RNA sequence into its secondary structure which is more related to its function. The folding is generated by hydrogen bonds between complimentary base pairs. The most common occurring base pairs are A-U and G-C, also called Watson-Crick base pairs. However, RNA sometimes presents hydrogen bonds between different bases. All base pairs which do not follow Watson-Crick rules are called Wobble base pairs. Figure 1 shows the graph representing the folding of a short ncRNA sequence. We can observe both Watson-Crick pairs and Wobble pairs (G-U). Black edges represent phosphodiester bonds (between adjacent bases in the original sequence) while red edges represent hydrogen bonds.
A wide variety of different classes, or families, of ncRNA have been identified, which differ by function and structure. Since the identification of drugs targeting the regulatory circuits of ncRNA depends on knowing its family, there has been an increasing interest in the development of methods for ncRNA classification. More traditional methods, such as RNA-CODE (Yuan and Sun, 2013), are based on alignment strategies. Other methods, such as RNAcon (Panwar et al., 2014) and GraPPLe (Childs et al., 2009), use standard machine learning classifiers on manually extracted graph properties of the RNA secondary structure. These approaches have shown that graph properties (both local and global) reflect the functional information of different classes of RNAs and are therefore informative for the classification.
More recently, nRC (Fiannaca et al., 2017)
uses a convolutional neural network on graph features extracted using MoSS
(Borgelt et al., 2005), which finds frequent local sub-structures in a set of graphs. In particular, MoSS is used to extract up to 6483 binary features for each input graph, where each feature represents the presence or absence of a particular sub-structure. To the best of our knowledge, nRC represents the state-of-the-art approach for ncRNA classification.Deep learning has recently had a remarkable impact on multiple domains, including natural language processing and computer vision (LeCun et al., 2015). However, most of popular deep neural models, such as convolutional neural networks (CNNs) (Lecun et al., 1998), only work on grid-structured (Euclidean) data, and are not directly applicable to graphs. For this reason, nRC (Fiannaca et al., 2017) first extracts features from the ncRNA graphs before applying a CNN.
Recently, there has been growing interest in extending deep learning techniques to non-Euclidean data, including graphs (Bronstein et al., 2017). Several models for deep learning on graphs have been developed in the past few years, including graph convolution (Kipf and Welling, 2017), graph attention (Velickovic et al., 2018), mixture models (Monti et al., 2016) and neural message passing (Gilmer et al., 2017).
We are the first to apply graph convolutional networks on RNA folding data, achieving state-of-the-art results on the task of ncRNA classification with an accuracy of 85.73% and an F1-score of 85.61% over 13 classes. Our model is aware of different bond types and uses attention to aggregate information from the most important nodes for the final classification task. Moreover, since it learns directly from the RNA graphs, it removes the need for for manual features extraction.
Most graph convolutional networks model can be interpreted as following a standard framework of message passing (Gilmer et al., 2017). In particular, at each layer, the features of a node are updated by aggregating messages from its neighbors. Given a graph , with node features and edge features , the update at layer takes the form:
(1) | ||||
(2) |
where is a learnable function which computes the message from node to node , represents the neighbors of in the graph, represents the aggregation of all incoming messages for node , and is a learnable function which updates the features for node given its previous features and the incoming aggregated message.
In graph classification problems, it is also necessary to produce a global graph representation by aggregating the final features for all nodes. This operation is called global pooling and can be defined as:
(3) |
where is a learnable function which is permutation invariant with respect to the order of nodes, and represents the features of node after the last convolutional layer.
Our model is shown in figure 2. It takes as input a graph corresponding to a folded ncRNA sequence. Mathematically, the ncRNA classification task takes the form of a prediction on a graph with node features and edge features . In particular, is just a one-hot representation of the nucleotide of node , and is a one-hot representation of the edge type (either hydrogen bond or phosphodiester bond).
The model consists of one embedding layer which maps each nucleotide to a continuous vector representation, followed by a sequence of graph convolutional layers. In particular, we use a layer similar to the one used in (Gilmer et al., 2017), which is able to propagate information differently based on the edge type. Our convolutional layers take the form:
(4) |
where A is a 2-layer MLP with Leaky-ReLU as non linearity, which produces a projection matrix from edge features
. Since our edge featuresare one-hot encodings, this amounts to learning a different projection matrix for each edge type, allowing the model to spread information differently based on the bond between two nodes. Lastly,
is a matrix of learnable weights.After the last convolutional layer, a global pooling mechanism is used to obtain a single representation for the whole graph. In particular, we use the Set2Set model (Vinyals et al., 2016), which is a permutation invariant global pooling operator based on iterative content-based attention:
(5) | ||||
(6) | ||||
(7) | ||||
(8) | ||||
(9) |
where is the zero vector, and is the final representation of the graph. At each step , the output of the LSTM is used to compute attention scores over all nodes. The new input to the LSTM is the concatenation of the old input with a weighted average of the nodes features, where the weights are given by the attention scores. We use steps and 1 layer for the LSTM. The size of the hidden state is the same as the number of output features from the last convolutional layer.
A fully connected layer with softmax activation is finally used to produce the probabilities for each of the thirteen ncRNA classes.
Each convolutional layer is followed by batch norm (Ioffe and Szegedy, 2015)
, before using the Leaky-ReLU activation function. Dropout
(Srivastava et al., 2014) has also been used to regularize the model.Dataset | #Graphs | Avg. #Nodes | Avg. #Edges | #Classes |
---|---|---|---|---|
train | 5670 | 162.02 | 210.46 | 13 |
val | 650 | 163.30 | 212.12 | 13 |
test13 | 2600 | 149.15 | 193.25 | 13 |
test12 | 2400 | 147.52 | 191.13 | 12 |
Metric | Formula |
---|---|
Accuracy | |
Sensitivity | |
Specificity | |
Precision | |
F1-Score | |
MCC |
Model | Dataset | Accuracy | Sensitivity | Specificity | Precision | F1-score | MCC |
---|---|---|---|---|---|---|---|
nRC | test13 | 81.81% | 81.81% | 98.48% | 81.50% | 81.66% | 80.29% |
RNAGCN (ours) | test13 | 85.73% | 86.09% | 98.82% | 86.09% | 85.61% | 84.59% |
RNACon | test12 | 37.17% | 37.17% | 96.26% | 45.84% | 41.05% | 33.43% |
nRC | test12 | 81.04% | 81.04% | 98.42% | 82.11% | 81.57% | 79.46% |
RNAGCN (ours)) | test12 | 85.29% | 81.06% | 98.78% | 87.82% | 86.30% | 84.07% |
We used the datasets introduced in (Fiannaca et al., 2017), which consist of a training dataset of 6320 ncRNA sequences and a test dataset of 2600 sequences. Both dataset contain sequences from 13 different ncRNA classes: miRNA, 5S rRNA, 5.8S rRNA, ribozymes, CD-box, HACA-box, scaRNA, tRNA, Intron gpI, Intron gpII, IRES, leader and riboswitch. While the test dataset is perfectly balanced, the training dataset contains only 320 sequences from the IRES class, compared to 500 sequences for all other classes.
In line with (Fiannaca et al., 2017), we also report results on a different test dataset, obtained by removing all sequences belonging to the scaRNA class from the original test dataset. This allows for a comparison with RNACon (Panwar et al., 2014), which was not trained on the scaRNA class. We refer to the original test dataset with 13 classes as test13 and to the reduced test dataset with 12 classes as test12.
In order to tune our model, we further split the original training dataset in two: one validation set with 650 sequences (50 from each class) and a training set with the remaining 5670 sequences 111The code and the datasets are available at [Link Available Upon Acceptance]. The statistics of the final splits are shown in table 1.
For each sequence, we generate the corresponding folding graph using the ViennaRNA (Hofacker, 2003) package.
The hyperameters of the model have been tuned on the held-out validation set using early stopping with a patience of 30 epochs. For the optimization of the model we used Adam
(Kingma and Ba, 2015) with a learning rate of 0.0004.The best performing model on the validation set consists of 5 convolutional layers of dimension 80, the set2set model for global pooling, and uses a dropout rate of 0.1 for regularization.
We first tested our method on the independent test set with 13 classes (test13). The results are shown in the top part of table 3. While nRC obtains an accuracy of 81.81%, our model outperforms it with an accuracy of 85.73%. We also observe similar improvements on all other metrics.
The bottom part of table 3 shows instead the results on the test dataset with only 12 classes (test12). Our model outperforms both RNACon and nRC on all metrics.
We have presented RNAGCN, the first successful application of graph convolutional networks to RNA folding data, which achieves state-of-the-art results on the challenging task of ncRNA classification. Our model combines edge-aware convolutions and an attention-based pooling mechanism. With respect to existing approaches, our model comes with the additional benefit of being trained end-to-end and removing the need for manual feature extraction from the graph.