rna2rna: Predicting lncRNA-microRNA-mRNA Interactions from Sequence with Integration of Interactome and Biological Annotation Data

06/17/2019
by   Nhat Tran, et al.
The University of Texas at Arlington
1

Long non-coding RNA, microRNA, and messenger RNA enable key regulations of various biological processes through a variety of diverse interaction mechanisms. Identifying the interactions and cross-talk between these heterogeneous RNA classes is essential in order to uncover the functional role of individual RNA transcripts, especially for unannotated and newly-discovered RNA sequences with no known interactions. Recently, sequence-based deep learning and network embedding methods are becoming promising approaches that can either predict RNA-RNA interactions from a sequence or infer missing interactions from patterns that may exist in the network topology. However, the majority of these methods have several limitations, e.g., the inability to perform inductive predictions, to distinguish the directionality of interactions, or to integrate various sequence, interaction, and annotation biological datasets. We proposed a novel deep learning-based framework, rna2rna, which learns from RNA sequences to produce a low-dimensional embedding that preserves the proximities in both the interactions topology and the functional affinity topology. In this proposed embedding space, we have designated a two-part "source and target contexts" to capture the targeting and receptive fields of each RNA transcript, while encapsulating the heterogenous cross-talk interactions between lncRNAs and miRNAs. From experimental results, our method exhibits superior performance in AUPR rates compared to state-of-art approaches at predicting missing interactions in different RNA-RNA interaction databases and was shown to accurately perform link predictions to novel RNA sequences not seen at training time, even without any prior information. Additional results suggest that our proposed framework may have successfully captured a manifold for heterogeneous RNA sequences to be used to discover novel functional annotations.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 11

page 12

10/16/2020

Interpretable Structured Learning with Sparse Gated Sequence Encoder for Protein-Protein Interaction Prediction

Predicting protein-protein interactions (PPIs) by learning informative r...
07/08/2021

Network and Sequence-Based Prediction of Protein-Protein Interactions

Background:Typically, proteins perform key biological functions by inter...
04/10/2015

Diffusion Component Analysis: Unraveling Functional Topology in Biological Networks

Complex biological systems have been successfully modeled by biochemical...
12/22/2020

Drug-Target Interaction Prediction via an Ensemble of Weighted Nearest Neighbors with Interaction Recovery

Predicting drug-target interactions (DTI) via reliable computational met...
08/23/2021

Genetic Programming for Manifold Learning: Preserving Local Topology

Manifold learning methods are an invaluable tool in today's world of inc...
10/30/2017

Prototype Matching Networks for Large-Scale Multi-label Genomic Sequence Classification

One of the fundamental tasks in understanding genomics is the problem of...
07/23/2021

TargetNet: Functional microRNA Target Prediction with Deep Neural Networks

MicroRNAs (miRNAs) play pivotal roles in gene expression regulation by b...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Regulatory long non-coding RNAs (lncRNAs) and microRNAs (miRNAs) that influences gene expression post-transcriptionally by interacting to target messenger RNAs (mRNA) form a complex network of transcriptomic interactions. These heterogeneous families of noncoding RNAs are associated with nearly all cellular processes, including cell division, senescence, differentiation, stress response, immune activation, and apoptosis [1, 2, 3]

. Recently, lncRNAs are gaining considerable attention as the largest and most diverse class of non-coding RNA, encompassing nearly 30,000 discovered transcripts in human. They are classified as transcribed RNA molecules with a length of more than 200 nt which has a diverse influence upon the function of other non-coding RNAs as well as regulation of protein-coding RNAs. Among many of their known functional interaction mechanisms, lncRNAs are known to act as miRNA decoys, derepress gene expression by competing with miRNAs for shared mRNA targets, or directly regulate gene expression

[4]. Additional studies have also indicated that miRNAs can regulate lncRNAs by triggering decay [5], and moreover, some processed lncRNAs can even generate miRNAs [6]. These types of cross-talk functional interactions between lncRNAs and miRNAs to co-regulate gene expressions highlight the complexity of the non-coding RNA regulatory network. Despite that miRNA’s repression to target mRNAs has been well studied, the discovery of functional interactions for a large number of lncRNAs in the human transcriptome is still at a rather preliminary stage.

Determining the function of the individual lncRNAs remains a challenge as most of these RNA transcripts are currently unannotated, and their known interactions are sparse. Recent advances in RNA sequencing (RNA-Seq), deep sequencing (CLIP-seq, LIGR-Seq), and computational methods allow for an unprecedented analysis of such transcripts and have enabled researchers to generate large-scale interaction and annotation databases. However, the interaction networks generated from such data are often scant and incomplete in the number of lncRNAs covered. Although a large number of lncRNAs have been identified, only a few hundreds have had functional and molecular mechanisms determined to date, as observed in annotation databases such as lncRNAdb [7]. Thus, in silico prediction of RNA-RNA interactions have been widely applied in the task of predicting or inferring missing functional interactions, where experimental studies are in short supply due to time and cost.

To address this challenge, this paper explores the application of a recent advancement in machine learning called ”network embedding”. The methodology enables learning from the interaction topology and attributes of transcriptome-wide RNAs to accurately predict RNA-RNA functional interactions. Mathematically, our ground truth knowledge about RNA interactions can be represented by a directed adjacency matrix, whose rows and columns correspond to individual lncRNAs, miRNAs, and mRNAs. In this matrix, its binary (1 or 0) entries can indicate whether or not an RNA was observed to have a functional interaction to another RNA, supported from experimentally-validated interaction databases. The matrix is exceptionally sparse, especially among lncRNAs, i.e., out of millions of possible interactions, only a few thousands have been identified. Currently, a signification fraction of newly discovered lncRNAs lack any identified interactions or functional annotations besides its basic genomic information such as locus biotype and primary transcript sequence

[8]. These genes might support important biological cell functions and could potentially serve as targets for genomic, diagnostic, or therapeutic studies. Thus, in the effort to functionally characterize these ”hypothetical” lncRNAs, an essential task is to accurately predict their RNA-RNA interactions from sequence.

Many graph-theoretic methods have been applied to biological networks with the intuition that RNAs close together in the interaction topology are more likely to be involved in many of the same functions [9]. Typically, the approach is mining the neighborhood structure of nodes in the network topology, in order to suggest that two nodes are likely to be functionally similar if they share many of the same co-interacting neighbors. The positive results utilizing this approach [10] often give the motivation that perhaps if ground-truth functional annotation information can be incorporated, it can facilitate the process of suggesting the interactions of the presently unknown RNAs.

We propose an algorithm for simultaneously integrating various existing biological annotation data while identifying the complex patterns in the RNA transcript sequences that would allow for accurate prediction of these missing interactions. In this work, we present rna2rna, a novel framework that combines the network-based and deep learning-based approaches to extract a latent representation from nucleotide sequence in order to accurately predict RNA-RNA interaction and identify functional similarity. Our framework processes human lncRNA, miRNA, and mRNA on a transcriptome-wide scale, and was shown to be effective at predicting interactions for RNA with presently unknown interactions. In summary, the main contributions of our method include:

  • Learning a low-dimensional representation from heterogeneous RNA transcript sequences by leveraging existing biological annotation databases, using a strategy to learn an affinity between embeddings which preserves functional similarity between RNAs.

  • The model can make inductive predictions to novel RNA sequences of any length, for various tasks such as inferring missing interactions, clustering of RNA transcripts, and classification of functional families.

  • Proposing a two-part embedding space to represent the ”source context” and ”target context” of an individual RNA. The learned embeddings can simultaneously preserve directed cross-talk functional interactions and undirected functional affinities in the lncRNA-miRNA-mRNA topology.

  • Repurposing the Siamese Network architecture to new applications in networking embedding and link prediction.

To our knowledge, no other tool exists to predict heterogeneous lncRNA-miRNA, miRNA-lncRNA, lncRNA-mRNA, miRNA-mRNA, and mRNA-RNA interactions from sequence, while simultaneously integrating various biological annotation data to characterize RNA-RNA functional similarity.

2 Related Work

Several network embedding methods have been proposed to predict unobserved links in a network by learning the high-order proximity in its topology. The state-of-the-art network embedding methods, e.g., LINE [11] and node2vec [12] utilizes the second-order proximity, which assume that nodes sharing many of the same second-order neighbors have a high affinity to each other. By learning the neighborhood structure similarity between nodes, their semantic similarity can be identified and is used to predict novel connections. Although such techniques have demonstrated competitive link prediction performance in networks of various domains [13]

, their performance in predicting biological network interactions is rather subpar. The reasons for this include: 1) the gene-gene interaction network can be extremely sparse; 2) there is lack of consideration for directionality of gene-gene interaction; 3) a lack of an integrative approach to utilize sequence and annotation data; and/or 4) the predictions are transductive in nature, i.e., constrained to only connected nodes whose connections to other nodes existing in the training set is known. The method proposed in this paper tackles these limitations and aim to accurately estimate the association strength between every possible RNA-RNA pair.

A number of network embedding methods have been applied to biological networks to either predict gene-gene interactions or to infer biological functions. Due to the extreme sparsity of the known interaction network among lncRNAs to miRNAs and mRNAs, it is pertinent to unravel the functional association between lncRNAs by considering its gene/transcript annotation, functional family annotations, gene-disease association, and sequence similarity [14]. Several efforts in recent studies have already been made to meet the urgent need in this area. For example, Kishan et al. [15] uncovered the second-order proximity relationship between interacting genes by integration of the gene regulatory network and gene expression as side information. Additionally, Cho et al. [16] proposed a diffusion-based method to predict a protein’s function by propagating information through direct and indirect associations in the interaction network. It is important to note that our method differs from these techniques in that it incorporate heterogeneous directed RNA-RNA interaction types, as well as RNA sequence data and annotation attributes as side information. On the other hand, several structure-free sequence-based methods [17, 18, 19, 20] have also been proposed for prediction of protein binding sites, family classification, structure prediction, or RNA-RNA interaction prediction from RNA sequence. These methods utilize machine learning models to learn a latent feature representation of target sequences. Motivated by these works, our method aims to unravel the complex hidden features from the RNA sequence that plays a factor in characterizing its functional similarity and interaction to other RNAs.

3 Methods

3.1 Defining the Heterogeneous lncRNA-miRNA-mRNA Interaction Network

We formally define the heterogeneous network of lncRNA, miRNA, and mRNA interactions and functional similarity as two networks of directed and undirected edges. We denote the two networks, and , having the same set of nodes (also called vertices) and two set of edges and . The set of nodes can also be expressed as s.t. , where are the sets containing the lncRNA, microRNA, and mRNA heterogeneous nodes, respectively. The set of directed edges represent directed regulatory interactions that each specify a source and a target. The undirected edges represents the undirected functional affinity associated with the heterogeneous RNA nodes. Each edge is associated with a weight such that , indicating the strength of the connection between RNA and RNA . If , we consider the edge a positive interaction/affinity, and if , we consider the edge a negative (non)interaction/affinity. In this paper, we consider weights to be binary, indicating whether RNA and RNA has an interaction/affinity or not.

Furthermore, every node also has an associated RNA sequence in a condensed word embedding representation. An RNA sequence is denoted as , where is the sequence length of RNA , and the integer number at each entry indexes the four RNA nucleotides.

3.2 Directed lncRNA-miRNA-mRNA Interaction Edges by Integrating Various Data Sources

The directed edges represent the directed regulatory interactions between lncRNAs, miRNAs, and mRNAs. Some interactions can be considered as bi-directional, however, in this study, we interpret the directionality as the regulatory effect of one RNA transcript’s abundance causing a direct inhibition/repression/promotion to another RNA transcript’s abundance. For instance, we can effectively encode miRNA-lncRNA interactions (e.g., miRNA inducing lncRNA decay) to be separate from lncRNA-miRNA interactions (e.g., lncRNAs acting as miRNA decoys) by using directed edges to represent different types of functional interaction. In this study, the different types of interaction collected from various experimentally-verified interaction databases in the lncRNA-miRNA-mRNA interactome considered are:

  • lncRNA-miRNA interaction via miRNA-sponging decoy function of lncRNAs as competing endogenous RNA.

  • lncRNA-mRNA post-transcriptional gene regulation.

  • miRNA-lncRNA interaction by where a binding miRNA causes decay of a lncRNA transcript.

  • miRNA-mRNA post-transcriptional interactions causing degradation of target mRNAs.

  • mRNA-mRNA interaction in the gene regulatory network.

These heterogeneous interactions are combined into an integrated network, and the associated set of edges is , where the binary edge weight indicates whether a regulatory interaction from RNA node to RNA node has been observed in the literature.

Fig. 1: An illustration of the heterogeneous lncRNA-miRNA-mRNA tri-module network.

3.3 Undirected RNA-RNA Functional Affinity Edges

Although the directed interaction edges are given, the set of undirected functional affinity edges must be derived from various biological annotation data associated with the RNA nodes. Each node can have up to annotation data associated with it. For an annotation that is associated with a node

, we denote a feature vector

of binary entries representing the presence/absence of a particular attribute in this annotation field.

We aim to capture the functional similarity between two RNA nodes by calculating an affinity score as a similarity measure of characteristics, suggesting a resemblance in RNA function or structure. The approach we take to calculate an affinity between pairs of same-class RNA nodes is by examining the matching annotation attributes that both shares. For any categorical text annotation (e.g., disease association, transcript biotype, RNA structure family, or GO terms) that two RNA nodes and both have been annotated, the attributes in this annotation are first transformed to binary feature representation. For a categorical annotation denoted as , the binary 1-D feature vector associated with RNA node is denoted as . In this vector contains binary entries that indicate whether or not the RNA node has been associated with each of the total possible attributes of this annotation. Using the Sørensen-Dice coefficient score [21], a similarity score between two binary vectors for node and node for feature can be obtained by:

This similarity measure ranges and gives higher weight to the common attributes present in both RNAs than by the attributes present in only one RNA. Since most RNAs have null annotations, we computed the Dice coefficient score only between pairs of RNA nodes that have both been annotated.

To obtain an aggregate affinity score between a pair of RNA nodes across all similarity values, we utilized a modified version of the Gower’s Similarity score [22]. For each RNA-RNA pair, Gower’s similarity aggregates similarity scores across all the annotation features and perform a weighted average. Typically, a similarity score for a pair of RNA nodes that do not have any associated annotation would be considered a , but in this study, we remove these null pairwise similarity from consideration. Thus, Gower’s similarity will only aggregate the available pairwise similarity scores from annotation that exists between both nodes to compute the average. Among the RNA node pairs that have only one associated annotation pairwise similarity, we further compute the global pairwise sequence alignment [23] score between these pairs of sparsely annotated RNAs. The Needleman-Wunsch algorithm is used to computes the highest score for matching sequence alignment, normalized by the sequence length, to approximately measure the homology between transcript sequences. The RNA node pairs that do not have any matching annotations were not included in the pairwise calculation.

This Gower’s similarity score is computed between all pairs of same-class RNA nodes, and the resulting pairwise affinity matrix is

, where entries with being the set of annotations present in both nodes and . The entries will then be selected as edge weights that represent the functional similarity edges between node. Since our model currently only considers unweighted binary edges, we selected undirected edges with affinity score close to or higher than a chosen hard-threshold to be considered as a positive edge. In our experiments, the hard-threshold was arbitrarily chosen where the number of positive affinity edges covers no more than sparsity of the entire affinity matrix. Then, we also uniformly sampled a set of undirected affinity edges with a weight close to , indicating a negative edge that suggests functional dissimilarity between a pair of RNA nodes. The number of negative edges chosen such that the ratio of negative edges to positive edges is between and .

4 Network Embedding with Source-Target Contexts

A network embedding is mapping from each RNA node to a low-dimensional representation, denoted as a mapping function , where is the dimensionality of the embedding such that . The embedding associated with each node is learned such that, in this embedding space, nodes preserve some meaningful proximities to other nodes according to the given topology in the networks and . Given the learned embeddings for all of the nodes, , various downstream prediction tasks can be applied, such as graph reconstruction, visualization, clustering, link prediction, and node classification [13].

We aim to obtain a biologically meaningful embedding representation that simultaneously captures both the regulatory interactions and functional affinities between RNAs. In other words, we train the embeddings to fit both sets of edges from the networks and . To accomplish this, we propose the embedding space to have two components: source context and target context. That is, each embedding vector is represented as a concatenation of the ”source context”, , and ”target context”, . This embedding representation can simultaneously capture directed and undirected edges by the following definition of proximities:

First-order directed proximity to represent the directed regulatory interaction between node ’s source context and node ’s target context, with:

(1)

Second-order undirected proximity to represent the functional affinity between node and node , with:

(2)

The value of these proximities is the Euclidean distance, where the embeddings are selected such that if two nodes have a positive (directed or undirected) edge, its respective embeddings will be more similar, i.e. having a smaller distance. Otherwise, if two nodes have a negative or non-interactions, their embeddings should be more dissimilar, incurring a greater distance. Note, can take on a different value than .

Conceptually, the desired effects of applying both directed and undirected proximities using the source-context and target-context in the embedding space are two-fold. First, the embeddings can automatically possess the second-order directed proximity, where nodes having similar functional annotations (i.e., a high second-order undirected proximity) will also have a similar set of interacting partners (i.e., first-order directed proximities to other nodes). Likewise, the second-order undirected proximity between a pair of nodes having a similar set of directed interactions will also have a high functional affinity. These two desired effects are in line with the primary intuition that RNAs sharing the same interacting partners are more likely to be involved in many of the same functions. Furthermore, as each of two networks, and , are highly sparse and incomplete, training the complementary sets of edges and can help to connect the functional affinities between known RNA’s to novel RNA sequences. In the following sections, we demonstrate a methodology to simultaneously apply the two proximity definitions to characterize both the interactions and functional affinity to each RNA transcripts.

4.1 Representation Learning for RNA Sequences to Reconstruct the Interactions and Functional Topology

Fig. 2: The rna2na network embedding method utilizing Siamese architecture. Note, the convolutional recurrent network has only one sequence input, and one embedding output, while the siamese network takes in two sequence inputs and outputs one number for the (directed or undirected) proximity.

Aside from the interaction topology data, each RNA also has an associated transcript sequence, extracted into a one-hot vector representation denoted by , where is the length of the sequence. We propose the network embedding function

to be a deep neural network that maps RNA sequence input

to an embedding of dimension that preserves the proximities defined in Eq. 1, 2

. Motivated by its recent successes in facial recognition

[24] and speech modeling [25], we repurposed the Siamese network architecture as an interaction network embedding framework in our method rna2rna.

Originally proposed for signature verification [26]

, Siamese network is an architecture containing an identical pair of the same neural network which shares the same configuration and parameters. A pair of objects can be fed into the two subnetworks to be encoded, where its resulting embeddings can determine if the two objects are similar or dissimilar. Our goal is to decide the relationship between two RNA sequences by using a convolutional recurrent network to output a real-valued multi-dimensional vector that captures the hidden representation of RNA sequences. More specifically, the network learns to output the embeddings for a pair of RNA sequences, guided by edge

as the label that indicates whether the pair is functionally similar or dissimilar. For similar pairs of inputs, their embedding is expected to be closer in proximity, and with dissimilar pairs of inputs, their embeddings are to be farther in proximity. Additionally, for an interacting pair of RNAs, the directed edge would indicate whether RNA interacts with RNA using the corresponding directed proximity. In order for the output embeddings to preserve the proximities across all edges in both and

network topologies, we utilize the binary cross-entropy loss function

[27], defined as,

(3)

The network weights in

are trained with Stochastic Gradient Descent (SGD) with the standard back-propagation algorithm. Since the subnetworks yield two outputs and their weights are shared, the gradient is summed over the network processing input

and network processing input

. We utilized the RMSprop

[28] optimizer to train recurrent network model until convergence. At each SGD iteration, a batch of RNA nodes are sampled along with its associated sets of positive and negative, directed and undirected edges, described further in section 4.3.

4.2 Convolutional Recurrent Network to Obtain Embeddings from Variable-length RNA Sequences

The network inside the Siamese architecture (illustrated in Fig. 2

) encodes the variable-length RNA sequence inputs through a series of non-linear transformations. Each RNA sequence is transformed to a condensed word vector where each integer element indexes a A,C,T,G nucleotide. The lncRNA, miRNA, mRNA transcript sequences can vary widely in length, between 20 nt to a few thousand nt long. For the network to accept such input, the first layer is a 1-D convolutional and pooling layer that yields feature tensors with a timestep dimension that is proportional to the input sequence length. Then, these tensors are passed to the Bidirectional LSTM layer

[29], which outputs the fixed-size hidden states value after processing last timestep features, and is passed to the next fully-connected layers. With this architecture design, the network is not constrained to only RNA samples with a fixed sequence length that is specified at training time.

4.3 Model Optimization with Batch Sampling Strategy

At each training iteration, the model samples a set of training edges and train the neural network on the RNA sequences associated with these edges. The choice of sampling strategy is hugely important, as the computational complexity is a factor when determining the sampling method that gives the best estimation of the degree distribution in the ground-truth network. It is well-established that most biological interaction networks exhibit the scale-free topology property, apparent in the nodes’ degree distribution to follow a power-law degree distribution [30]. In other words, it was frequently observed that a few nodes may have a lot of interactions, and a lot of nodes may have very few interactions. If a set of edges were sampled uniformly, the sub-network would be biased toward nodes with a higher number of connections and nodes with a lower number of connections will be poorly represented. This problem is further exacerbated by the size imbalance between the interaction sets of well-studied RNA classes and newly-emerged RNA classes. Furthermore, as the size of the network grows, the training approach that iterates through all possible pairs of nodes may become quickly intractable. We instead implement a random node sampling strategy where we first randomly select a set of nodes, then train on the set of edges induced by this sub-graph. However, it was shown that sampling nodes uniformly at random does not retain the power-law distribution [31]

. Thus, we employed a biased sampling, where the probability of picking a node

is a function of its degree, . The probability function proposed by Riad et al. [32] is:

(4)

The sampling compression function is chosen to be the square-root function, where , to retain the power-law degree distribution while keeping the linear weighting as a ranking for the frequency of each node, .

When a batch of nodes is sampled without replacement from this distribution, each node and its set of positive edges is such that and . Also, its set of negative edges is where . In the case of undirected edges, both and are given, however, for directed edges, only are given. To obtain the negative directed edges, we then sample by adopting the approach of negative sampling as proposed in [33], where the ratio of negative edges to that of positive edges incident to each node is between and . To sample the set of nodes , we use from the distribution given in Eq. (4), normalized over nodes in . We allow the negative sampling ratio to be a free parameter to be tuned in our experiments.

Given , the sampled batch of nodes, and , the set of directed and undirected edges containing both positive and negative interactions incident to , we train the loss function with batch optimization with

(5)

where is the coefficient parameter to control the effect of the second-order undirected proximity.

4.4 Predicting Interaction or Functional Similarity Between Two RNAs

After training is complete, given two RNA sequence inputs and , the learned model can output the embeddings and , which is used to predict whether a relationship exists between them by computing the respective proximity. Either to predict the existence of an interaction or functional similarity, we use the proximity score or , respectively, and then compute a pairwise affinity using a Gaussian kernel:

In our experiments, we calculated all pairwise Euclidean distances and solved for given and . By fitting given the ground-truth edges and the predicted pairwise affinity matrix, this method can accurately approximate the distribution of interactions over the whole network.

5 Results

5.1 Large-scale Data Integration of lncRNA-miRNA-mRNA Interactions, Annotations, and Sequences

Interaction Training Sets Validation Sets
database Version # interactions # sources # targets Version # interactions # sources # targets
miRTarBase 6.0 [34] 377,318 1,618 miRNAs 14,666 mRNAs 7.0 [35] 64,749 12 miRNAs 702 mRNAs
DIANA-lncBase v2 [36] 53,926 631 miRNAs 2530 lncRNAs Predicted 337,031 0 miRNAs 0 lncRNAs
NPInter v2.0 [37] 85,335 12 lncRNAs 5023 mRNAs v3.0 [38] 123,054 499 lncRNAs 2346 mRNAs
lncRNA2Target v1.0 [39] 1308 79 lncRNAs 471 mRNAs v2.0 [40] 65,624 1037 lncRNAs 10,825 mRNAs
BioGRID v3.4[41] 313,724 13,318 mRNAs 19,429 mRNAs v3.5 [42] 33,522 178 mRNAs 178 mRNAs
TABLE I: Overview of interaction databases used for data selection, harmonization, and integration for prospective evaluations. The training sets are comprised from the interaction network of database versions released before 2015, while the validation sets are comprised of updates from the most recently released database versions. Note, the number source and target RNA nodes listed for validation set are from novel interactions only, which are disjoint from the training set.

We integrated various experimentally verified interaction databases to build a large-scale lncRNA-miRNA-mRNA interaction network. Additionally, various functional annotation, sequence, disease association, were also integrated to enable extraction of the undirected attribute affinity edges. To maximize the number of genes matched between the different databases, miRNA and mRNA transcripts are indexed by standard gene symbols specified by the MirBase [43] and HUGO Gene Nomenclature Community (HGNC)[44]. LncRNA transcripts are indexed by its Ensembl gene name provided by GENCODE Release 29 [45]. In total, there are 12725 lncRNAs, 1870 microRNAs, and 20284 mRNAs considered in this study, comprised of a comprehensive integration of the various databases illustrated in Table I.

To accomplish the primary task of predicting novel interactions not seen at training time, we propose an experimental setup using prospective evaluation. All models were trained exclusively using the prior version of each interactions databases. Then, we validate the link prediction model by using the set of new interactions from the latest database version update. This type of evaluation, rarely done in the literature, is extremely important as it allows us to mimic a realistic scenario where the task is to discover novel RNA-RNA interactions, based on our current knowledge.

5.1.1 Integration of multiple interaction databases.

In this section, we list the databases utilized for both training and prospective evaluation. In all databases, we selected only the human-specific regulatory RNA-RNA interactions and harmonized all miRNA, mRNA, and lncRNA gene names to standardized MirBase, HGNC, and GENCODE gene names.

  • microRNA-target interactions obtained from miRTarBase [35] database. For training, miRTarBase version 6.0 has a total of 377,318 interactions matched between 1618 microRNAs and 14,666 target mRNAs. For testing, version 7.0 has a total of 64,749 new interactions, of which includes interactions data for 12 novel miRNAs.

  • microRNA-lncRNA interactions obtained from experimentally verified databases DIANA-lncBase Experimental v2 [36]. There are a total of 53,926 matched interactions between 631 miRNAs and 2530 lncRNAs. Since this database does not have an updated version since the v2 release, we use the DIANA-lncBase Predicted module for evaluation and selected 337,031 interactions with a confidence score greater than .

  • ncRNA-RNA interactions from NPInter v2.0 [38], where we filtered only lncRNA-miRNA, lncRNA-mRNA, and miRNA-lncRNA interactions, which resulted in 85,355 interactions between 170 matching lncRNAs and 5023 mRNAs. NPInter v3.0 sharply increased in data and contained 123,486 new interactions between 499 novel lncRNAs and 2346 mRNAs.

  • lncRNA-mRNA interactions containing lncRNA-mRNA functional regulatory interaction from lncRNA2Target [40], where its latest version v2.0 contained a total of 65,655 interactions between 1037 lncRNAs and 28,866 genes, and its previous version v1.0 contains 1277 interactions between 79 lncRNAs and 471 mRNAs. Note that in this instance, lncRNA2Target v1.0 contained interaction data derived from microarray experiments while v2.0 is from high-throughput RNA-seq experiments, and there is no overlap between the two versions.

  • mRNA-mRNA interactions obtained from the BioGRID v3.4 database [41], which included more than 313,724 matched interactions among 19,429 mRNAs. For the validation set, BioGRID v3.5 contained 33,522 novel interactions that included 178 novel mRNAs.

After integration of the training databases, self-interactions and redundant interactions edges are removed, and only interactions between RNAs with an associated transcript sequence will be considered. In the validation databases, we selected only interactions that do not overlap with interactions from the training set.

5.1.2 Integration of annotation databases to extract undirected attribute affinity edges.

In this section, we outline the annotation databases utilized to provide functional attributes to individual RNAs. After all RNA-RNA pairwise functional affinities were computed, a number of undirected affinity edges were then added to the undirected interactions training set.

  • lncRNA annotations obtained from the GENCODE Release 28 [45] which contains the transcript biotype annotation. In addition, GO terms for 162 matched lncRNAs were obtained from RNAcentral [46] which aggregated data from NONCODE [47] and lncipedia [48]. Also, disease associations for 150 lncRNAs were obtained from the LncRNADisease database [49]. After computing the affinities for all lncRNA pairs and filtering second-order undirected affinities at a threshold, 65,864 undirected edges were added. With the negative sampling ratio set at per positive edge, a total of 329,320 negative edges were added to the undirected edges training set.

  • microRNA annotations containing miRNA family classified from its seed regions were obtained from the TargetScan Release 7.2 (March 2018) [50]. RNA structure family annotation obtained from Rfam 13.0 [51] and GO terms from RNAcentral were also included. In addition, disease associations for 553 miRNAs were obtained from HMDD miRNA-disease database [52]. After computing the affinities for all miRNA pairs and filtering at a threshold, 405 positive edges were added. With the negative sampling ratio set at per positive edge, 2025 negative edges were included.

  • mRNA annotations were obtained from the GENCODE Release 28 [45], and gene annotations were obtained from the HUGO Gene Names database [53].In addition, disease associations for 7577 mRNA genes were obtained from DisGeNet [54]. After computing the affinities for all mRNA pairs and filtering at a threshold, 362,362 edges were added. At the negative sampling ratio of per positive edge, 724,724 negative edges were included.

5.1.3 Preprocessing of RNA Transcript Sequences.

We collected genome-wide human reference lncRNA and mRNA primary sequences from GENCODE Release 29 [45] and miRNA hairpin sequences from miRBase [43]. A transcript is indexed by their gene name, however, many lncRNAs can encode multiple transcript isoform variants. Since some interaction databases do not identify the specific transcript when referring to the interacting lncRNA, at every training iteration, we uniformly sample from the set of different RNA isoforms for each RNA gene.

In order to speed up the training of the interactions between RNA sequence pairs in the convolutional network, we use batch optimization across multiple GPUs. Due to memory limitations, at training time, we must pad all sequences within the same training batch to one maximum length. The max length was chosen to be 8000 nt long, which would fully represent more than 95% of the longest RNA transcript variants without needing to truncate. For the RNA transcripts exceeding 8000 nt in length, it is important for the network to learn motifs from all regions of the sequence in order to generalize to very long RNA sequences. For this purpose, at each SGD iteration where sequences are sampled, we implement a strategy where a long RNA sequence is truncated either from the first portion or the last portion at random. After training, our model can then process variable-length sequences of any arbitrary length at prediction time. From empirical results, we found this technique does not diminish prediction performance but allows the model to generalize to any RNA sequence length at test time.

Fig. 3: Precision-Recall Curve in Graph Reconstruction evaluations to the ground-truth training set of various interaction databases.

5.2 Comparison Methods

Our experiments include comparative analysis in different evaluation tasks with existing state-of-the-art methods in both varieties of network-based and sequence-based embedding methods. A brief description of the different methods considered are:

  • node2vec [12]: A random walk-based method which preserves higher-order proximity as well as the community structure and structural equivalence between nodes.

  • LINE [11]

    : A method which jointly optimizes two functions for first- and second-order proximities by minimizing the Kullback-Leibler divergence between predicted joint probability distribution for each pair of vertices and the given distribution of training edges.

  • HOPE [55]: A method which uses a two-part embedding to reconstruct a directed adjacency matrix, while preserving asymmetric transitive proximity.

  • SDNE [56]

    : An autoencoder-based method that preserves neighborhood proximity between nodes given the network topology.

  • BioVec [17]

    : A word2vec-based model which learns a distributed representation of individual RNA nucleotide sequences by training from a corpus of 3-mers in each sequence. The corpuses of k-mers were calculated separately for each RNA classes of lncRNAs, miRNAs, and mRNAs.

In the following experiments, each method were assessed by learning a 128-dimensional embedding representation from the training network. All other free parameters are set according to the default value mentioned in the method’s respective papers.

5.3 Graph Reconstruction.

To assess whether the given methods can efficiently embed nodes from the network to a low-dimensional space while preserving all interactions, we evaluate whether each method can accurately reconstruct the original adjacency matrix of the network from the training set. After training each method on the training set of interaction databases, all pairwise proximities between the node’s embeddings were computed to reconstruct an estimated adjacency matrix. Then, we compute the Precision-Recall Curve of predicted (positive) interactions by validation with the ground-truth positive interactions from the training set. In addition to measuring the recall of the positive interactions, we also sample for non-positive interactions to measure the precision rate, which penalizes for false-positives. Candidate sampling was utilized, where for each node, a number of non-positive interactions are sampled in proportion to the number of positive interactions, where the ratio is

. Random edges were sampled according to the non-uniform distribution given by Eq. (4) to generate non-interactions for each node, while preventing accidental hits of existing positive interactions. Fig.

3 shows a precision-recall curve comparison analysis for the training set across different interaction databases. Additionally, Fig. 6 shows the power-law degree distribution of the reconstructed network for each method. Since it is also important for the predicted network to preserve the scale-free topology property from the original interaction network, this result shows rna2rna can reconstruct an interaction network with a power-law degree fit score approximately matching that of the ground-truth training network.

5.4 Novel Link Predictions.

Fig. 4: Precision-Recall Curves for Link Prediction. For each database, each line indicates the Precision-Recall evaluation of a network embedding method to the set of novel ground-truth interactions in that database.

To evaluate the predictive performance of our model at inferring novel RNA-RNA interactions not seen at training time, we perform a prospective evaluation on the validation set containing future version databases. We compose our training set for this prediction task by a union of all ground-truth interactions set from the miRTarBase 6.0, lncBase v2, NPInter v2.0, lncRNA2Target v1.0, and BioGRID v3.4 databases. All methods then train its model on this combined network. The undirected functional affinity edges were not included in the training data for any methods besides rna2rna. After the models have been trained, its estimated adjacency matrix is computed and evaluated on the novel interactions from miRTarBase v7.0, lncBase predicted module, NPInter v3.0, lncRNA2Target v2.0, and BioGRID v3.5 databases separately. The set of interactions from the validation set is entirely disjoint from the training set. For a test to differentiate between positive interactions and random noise interactions, we also uniformly sample a number of interactions from the set of all possible pairwise interactions to consider as negative interaction. To do this, we sample from the distribution defined by , where is from Eq. (4), and are the set of source and target RNA nodes respective of the database. This set is denoted as , and the number of negative interactions is sampled such that the ratio of negative to positive interactions is . At evaluation time, the set of ground truth validation edges and random noise

edges is used to calculate the precision and recall rates. The true positives are the correctly predicted true interactions, and the false positives are the predicted interactions that are present in the random noise

interactions set. Since most network embedding algorithms can yield a predicted probability of the connection, we show the precision-recall curve to evaluate the precision rate at different thresholds of the probability prediction. An area under the precision-recall curve (AUPR) is used to give a single number indicating the performance of the classifier, which is a good criterion considering it punishes for false-positive predictions. Fig. 4 highlights the comparison analysis across five different interaction databases.

In the comparison analysis, all methods were evaluated on the same set of positive and sampled negative interactions. It can be observed that LINE and SDNE do not tend to perform well in this heterogeneous lncRNA-miRNA-mRNA interactions network. For the performance evaluation of predicting the BioGRID mRNA-mRNA gene regulatory interaction set, the interactions between lncRNA and miRNA to mRNA are removed from consideration. It was observed that rna2rna also achieved a superior result in this subnetwork.

5.4.1 Inductive Link Prediction to Novel RNA Sequences

In addition to evaluation of predicting missing edges between connected nodes present in the training set, it is also important for our model to infer interactions for novel lncRNAs from sequence. We evaluated the link prediction performance for novel RNA sequences not present in the training set. In this experiment, there are 47 novel lncRNAs with interactions in the validation set that is not present in the training set. We attempt to recall these true interactions only from processing its RNA sequence input and computing their associated interaction to all existing miRNAs and mRNAs. We follow the same procedure proposed above to sample for random negative interactions. Since our method is the only method that can yield an embedding given a novel RNA sequence, the methods node2vec, LINE, HOPE, and SDNE cannot predict from this evaluation as their link prediction is transductive and constrained to only nodes in the largest connected component. After holding out 47 lncRNAs and attempting to recall 3086 its associated true interactions, our method has achieved an average precision score of , shown in Fig 5. Note that among the 3086 true interactions, which includes lncRNA-miRNA and lncRNA-mRNA interactions, they are comprised of interaction set from the LncRNA2Target v2.0 and NPInter v3.0 updates.

Fig. 5: Comparison results at link prediction for inductive inference to 47 novel lncRNA sequences not seen at training time.
Fig. 6: Comparison analysis of the power-law degree distribution fit score across multiple RNA-RNA interactions predicted by each methods. The ”Databases” bars indicate the scale-free topology fit score of the network composed by the ground-truth edges from lncBase, NPInter, lncRNA2Target, miRTaRBase, and BioGRID, respectively.

5.5 Inferring Functional Similarity From Embeddings

The source-target embedding is not only effective at encoding directed RNA-RNA interactions, but it can also capture the undirected functional affinity of RNAs. Since a pair of functionally similar RNAs would have a small Euclidean distance in the embedding space, we can expect a cluster of RNAs to have the same biological functions. To evaluate whether a network embedding method can effectively identify functional similarity, we performed K-means clustering on the learned embeddings and compared the predicted node’s clusters to the ground-truth RNA annotations. The ground-truth annotations used for evaluation are RNA functional family

[51], and RNA locus type annotations (e.g., sense intronic lncRNAs, lincRNAs, miRNAs, protein-coding, etc.). If an RNA is known to belong in more than one functional family, we select only the first annotation and discard the rest.

In comparison analysis, we first obtained the embeddings from each of the methods and performed K-Means clustering only on the nodes that have an associated functional annotation. The number of clusters in K-Means is the same as the total number of unique labels in a particular annotation. The evaluation measures used are Homogeneity (higher if nodes are of the same type in each cluster), Completeness (higher if all nodes of the same type are only in one cluster), and Normalized Mutual Information (a mean of the two previous scores). The clustering result of different methods are compared over the RNA family and RNA type annotations in Table II and Table III. The result shows that although there is a greater number of RNA nodes to assign to clusters, rna2rna embeddings can achieve the highest NMI score over the RNA functional family annotations. In Table III, other methods besides BioVec achieved a lower score, because the local structure of the interaction topology typically contains a mixture of RNA biotypes.

5.5.1 Training on Interactions Alone Can Reveal RNA Functional Similarity

Here, we test our hypothesis about whether two RNAs are functionally similar if they share the same interacting targets and interacting sources. Since two nodes would have similar embeddings if they have the same set of interacting partners, we can investigate whether training the embeddings from directed interactions alone can produce an embedding that effectively preserves the undirected functional affinity between RNAs. In this evaluation, we trained a rna2rna model on the directed RNA-RNA interaction edges only, while excluding the functional affinities edges. Results in Table III shows that despite holding out functional annotation information, the resulting RNA embeddings can still approximately preserve cluster structures when compared to ground-truth RNA type annotations.

Method Homogeneity Completeness NMI # nodes
node2vec 0.641 0.602 0.621 11735
LINE 0.689 0.614 0.650 11735
HOPE 0.525 0.571 0.570 11735
SDNE 0.613 0.588 0.600 11735
BioVec 0.376 0.467 0.417 14311
rna2rna* 0.508 0.530 0.519 14312
rna2rna 0.685 0.620 0.651 14312

rna2rna* denotes the model trained on the directed interactions data alone, without the undirected functional affinity information.

TABLE II: Clustering Comparison Over 2343 Ground-Truth RNA Functional Family Annotations.
Method Homogeneity Completeness NMI # nodes
node2vec 0.147 0.089 0.111 23940
LINE 0.268 0.158 0.199 23940
HOPE 0.109 0.111 0.110 23940
SDNE 0.079 0.076 0.078 23940
BioVec 0.391 0.298 0.338 32707
rna2rna* 0.178 0.138 0.155 32530
rna2rna 0.355 0.235 0.283 32530

rna2rna* denotes the model trained on the directed interactions data alone, without the undirected functional affinity information.

TABLE III: Clustering Comparison Over 24 Ground-Truth RNA Locus Type Annotations.
Genes lncRNAs KEGG Term Overlap P-value
ZNF177,ZNF175,ZNF607,ZNF606,ZNF72… AC022150.4 Herpes simplex virus 1 infection 269/492 2.47e-323
OR7G2,OR8I2,OR7G1,OR9K2,OR11H1,OR… AC131571.1,LINC00892,… Olfactory transduction 350/444 2.47e-323
HIST2H2AA3,HIST2H2AB,HIST1H2AE,HI… Systemic lupus erythematosus 53/133 6.935e-77
NOTCH1,HDAC1,CUL2,CBL,HIF1A,EGFR,… Pathways in cancer 25/530 2.667e-15
ZNF331,ZNF550,ZNF324B,ZNF490,ZNF7… Herpes simplex virus 1 infection 17/492 4.008e-14
GSK3B,HDAC2,PTGER3,PTEN,MAPK8,ERB… LINC00598 Pathways in cancer 25/530 1.138e-13
CHRND,PTGIR,EDNRB,MTNR1B,LPAR6,CH… UCA1 Neuroactive ligand-receptor inter… 12/338 9.679e-13
RPS15,RPS27,RPS16,RPL31,RPS6,RPL3… Ribosome 11/153 1.721e-10
IKBKB,RB1,CDKN1A,SHC1,CTBP1,CDK4,… Chronic myeloid leukemia 11/76 4.862e-10
IFNA5,IFNA7,IFNA14,IFNA1,IFNA2,IFNA8 Autoimmune thyroid disease 6/53 2.887e-09
RPS28,RPL21,RPS18,RPL11,RPL10L,RP… Ribosome 10/153 1.479e-07
PCNA,YWHAQ,CDKN2A,RAD21,MYC,CDK2… Cell cycle 10/124 1.551e-07
NFKBIA,PSMD14,PSMC3,PSMD4,PSMC4,P… Epstein-Barr virus infection 11/201 2.237e-07
KIR2DS4,KIR2DL1,KIR3DL3,KIR2DL3 Z99756.1,LINC02346,… Antigen processing and presentation 4/77 2.625e-07
MAPK9,PRKAB2,PRKAA1,PRKAA2,MAP3K1… Tight junction 10/170 4.690e-07
P2RY6,ADORA2A,P2RY2,NMUR1,DRD2,DRD4 Neuroactive ligand-receptor inter… 6/338 9.553e-07
CHRNG,HTR1E,PTGER1,KISS1R,NMBR,SSTR5 TMEM202-AS1, AL355297.4 Neuroactive ligand-receptor inter… 6/338 2.357e-06
MYH1,MYH2,MYH8,MYH4 Tight junction 4/170 2.404e-06
P2RY4,TAAR6,C5AR1,HTR5A,TRHR AC008125.1 Neuroactive ligand-receptor inter… 5/338 9.616e-06
NCOA1,MED1,CCND1,SIN3A,NCOA3,PLCG… Thyroid hormone signaling pathway 7/116 1.199e-05
RPS9,RPS5,MRPS11,RPL22,RPS3,RPL38… Ribosome 8/153 1.232e-05
ABCA3,ABCA4,ABCC10 ABC transporters 3/45 1.594e-05
GALR2,GCGR,MLNR,ADRA1B,NTSR2 Neuroactive ligand-receptor inter… 5/338 3.511e-05
PARD6B,PRKCI,CTTN,ACTN1,PRKCE,RAP… Tight junction 6/170 5.527e-05
LAMA5,TNXB,AGRN ECM-receptor interaction 3/82 6.231e-05
B3GALT6,CHST14 Glycosaminoglycan biosynthesis 2/53 8.240e-05
BHMT,GNMT Glycine, serine and threonine met… 2/40 1.165e-04
TIAM1,CAMK2D,CAMK2A,KDR,CD44,VAV2 Proteoglycans in cancer 6/201 1.457e-04
AOC3,PAH LINC01940, ALDH1L1-AS2 Phenylalanine metabolism 2/17 2.195e-04
NAGA,A4GALT Glycosphingolipid biosynthesis 2/45 2.957e-04
TABLE IV: Gene set enrichment analysis over 2000 k-mean clusters. Each row indicates the highest enriched functional term for a given cluster gene set comprised of mRNAs, miRNAs, and lncRNAs.

5.5.2 Clustering of RNA Embeddings Reveal Highly Enriched Gene Sets

Since rna2rna embeddings have demonstrated functionally similarity in the experiments above, an important next step is to assign putative biological functions to novel lncRNAs. To do this, we perform gene set enrichment analysis on clusters of RNAs, select the cluster with the highest enriched functional term, then associate the lncRNAs belonging in this cluster with this term. In this experiment, the embeddings are trained from both training set and validation set, which includes all known functional interactions and Gene Ontology annotation terms associated with the lncRNAs, miRNAs, and mRNAs. We performed k-means clustering over the embeddings of 32,741 different RNAs, where the number of clusters is 2000. We then performed enrichment analysis on these 2000 clusters using Enrichr [57] over the KEGG Human 2019 [58] terms, which includes both functional and disease pathways. Some of the highest enriched clusters are shown in Table IV. Among the 2000 clusters, 559 have an adjusted P-value of less than , and 139 have an adjusted P-value of less than . Interestingly, the highest scoring gene sets often contain some lncRNAs not previously associated with these functional terms. It warrants additional experimental studies to verify the functional associations of these lncRNAs.

5.5.3 Learned Projection of RNA Embeddings Demonstrates an Organized Distribution

We further visualize the learned 128-dimensional embedding to 2-dimensional space using t-SNE [59]. It can be expected that the RNA nodes in this manifold can preserve the local structure of the interactions and functional annotations, as well as exhibit good separation based on their transcript biotype classification. In Fig. 8, nodes are colored based on RNA biotype, and only the 5000 top-scoring interaction edges are shown to increase visibility. It is observed that the microRNAs are well separated from the rest of the nodes, but the mRNAs and lncRNAs may have some overlap, which is expected since the sequence structure of these two RNA classes is similar. In comparison to other methods, rna2rna can map a much higher number of lncRNAs and a more extensive variety of different RNA classes to the embedding representation.

5.6 Subnetwork of LncRNAs Shows Promising Novel Function Interactions

We visualize sub-networks of some well-studied lncRNAs including HOTAIR, GAS5, H19, CASC15, SNHG1, SNHG5, PINK1-AS, UCA1, XIST, and ZFAS1. Each of these subnetworks contains ground-truth interactions between the lncRNA and its miRNA and mRNA interacting neighbors, while predicted interactions are highlighted in red. An illustration of the HOTAIR subnetwork is shown in Fig. 7, while others are in the Supplementary Materials. For other well-studied lncRNAs such as H19, GAS5, and SNHG1, the number of interacting partners reached to nearly one thousand, so we selected only interactions supported by two or more databases for better visibility. In each visualization, the placement of the nodes is determined from the force-directed layout of the subnetwork, with the exception of the HOTAIR subgraph, which obtained node placements from the t-SNE transform. In SNHG5, CASC15, and HOTAIR, it was interesting that they are accompanied by another lncRNA partner. This partner has an interaction to/from the main lncRNA, and also shares some of the same neighbors that the lncRNA is connected to.

Fig. 7: HOTAIR predicted interaction subnetwork. Nodes placement are determined based on the learned network embeddings from our method, and are colored whether they are lncRNA, miRNA, and mRNA. Blue lines represent ground-truth directed regulatory interactions. Red lines represent the top-25 predicted interactions.
(a) node2vec (b) LINE
(c) HOPE (d) SDNE
(e) BioVec (f) rna2rna
Fig. 8: Visualization of the lncRNA-miRNA-mRNA regulatory interaction network across different methods. RNA nodes are mapped to a 2-D projection using t-SNE from the learned 128-D embeddings. Color of a node indicates the RNA locus type. Lines in the network indicate the top-5000 interactions predicted by each node.

6 Discussions

In this study, we have proposed a method to encode the heterogeneous lncRNA-miRNA-mRNA interaction network, being the union of lncRNA-miRNA, miRNA-lncRNA, lncRNA-mRNA, miRNA-mRNA, and mRNA-mRNA interactions databases. With the framework we have developed, heterogeneous annotation data, as well as functional interaction data, are integrated to enable characterization of RNA sequences using an embedding representation. While this method of integrating different functional annotation sources is simple, its purpose is to allow for characterizing the functional affinities for an extensive number of RNAs, even among sparsely annotated ones. While very few lncRNAs have been annotated for all of its attributes, especially functional annotation or disease association attributes, most have already been annotated with the basic transcript biotype and a transcript sequence. Since we did not constrain the calculations to only RNAs that have all non-empty annotations, we can utilize a less stringent affinity scoring method where a similarity measure between sparsely annotated RNAs can be calculated.

To the best of our knowledge, the first-order proximities we proposed involving source-target embeddings is novel compared to the first-order proximity proposed in any other network embedding methods. It has a direct purpose at the task of modeling the regulatory interactions between biological entities. Considering the node2vec, LINE, and SDNE methods that model the first-order proximity without considering the direction of the edge, the directed regulatory interactions may produce the embeddings to represent functional similarity incorrectly. For instance, suppose there exists a directed edge to represent microRNA targeting mRNA . If we model undirected first-order proximity, the resulting embedding representation and would be selected to be similar. This would be misleading because although we know microRNA targets mRNA , mRNA does not target microRNA , they belong to different classes of RNA transcripts, and is unlikely to be involved in the same biological functions. By modeling each node’s embedding representation with both and separately, we can conceptualize a representation for a biological entity by modeling its functional targeting information and receptive field information.

In the prospective evaluations of recalling the interactions from a future version of various databases, it was shown our method could achieve comparable, and in some cases, superior performance, with other state-of-the-art methods. It is important to note that rna2rna was able to achieve this accuracy even when predicting more interactions over a more extensive range of RNA nodes since it can obtain embeddings for 32530 unique interacting lncRNA, miRNA, and mRNA nodes. In other methods, only the interactions among a subset of 17905 RNA nodes were considered for link prediction analysis. This is because most other network embedding methods typically only consider the nodes within the largest connected component of the network, while rna2rna can provide a functionally consistent embedding for all nodes in the network that’s associated with an RNA sequence. Since it can also handle sequences of various length, rna2rna can provide this mapping for a wide range of RNA transcripts of different structures.

Additionally, since our method was able to map the functional affinity between RNA nodes belonging in disconnected components in the interaction topology, we hypothesize rna2rna could effectively map individual RNA’s to a functional manifold in the embedding representation. It is observed in the t-SNE visualization of the embeddings in Fig. 8(f) that there is a clear separation between miRNAs, lncRNAs, and mRNAs, albeit overlaps between lncRNAs and mRNAs. It needs to be noted that although no negative undirected edges between RNAs of different locus types (e.g., lncRNAs v.s. miRNAs) were sampled to explicitly indicate different RNA types to have dissimilar embeddings, the network can still make a distinction between their functional roles. This shows that rna2rna’s source-target embedding is an effective representation that can encode an RNA’s biological function only by its given directed interactions.

7 Conclusions and Future Work

Our main contribution proposes a highly versatile architecture aimed at predicting interactions between heterogeneous RNA transcripts while characterizing the functional landscape of non-coding RNAs. Although rna2rna have demonstrated promising performance at various interaction prediction and clustering tasks in experimental results, we believe further improvements to the framework can help it achieve even better performance and usability. Firstly, it cannot be easy to identify the specific binding motif from the learned convolutional filters for a given RNA-RNA functional interaction. We can consider further implementation of an attention-based network architecture [60] for future works. Additionally, rna2rna’s calculation of the RNA-RNA functional affinity using the Dice distance can be improved, as it simply counts the number of matching functional terms a pair of RNA shares. In this aspect, we plan to apply a method that can calculate a semantic functional similarity, even between non-matching terms. Moreover, while rna2rna was designed to tackle the task of broadening the general knowledge in the human non-coding transcriptome, we also look forward to a modification of the model to allow analysis of the interactome within a specific biological context such as a tissue type or disease condition. Toward this end, we can integrate RNA expression data to identify specific RNA-RNA interacting pairs within a sample cohort.

In conclusion, we intend this method to be the groundwork for further down-stream analysis tasks, where various other downstream genomic prediction tasks such as prediction of gene annotation, gene-disease association, and discovery of unknown gene cluster families can be readily applicable by directly processing the learned embeddings. Further works to this framework can provide an invaluable tool to support significant discoveries in systems biology, especially for newly identified lncRNAs.

Acknowledgments

The authors are grateful for the GAANN support from the Department of Education under the grant P200A150192.

References

  • [1] M. Kitagawa, K. Kitagawa, Y. Kotake, H. Niida, and T. Ohhata, “Cell cycle regulation by long non-coding rnas,” Cellular and molecular life sciences, vol. 70, no. 24, pp. 4785–4794, 2013.
  • [2] S. Geisler and J. Coller, “Rna in unexpected places: long non-coding rna functions in diverse cellular contexts,” Nature reviews Molecular cell biology, vol. 14, no. 11, p. 699, 2013.
  • [3] A. Fatica and I. Bozzoni, “Long non-coding rnas: new players in cell differentiation and development,” Nature Reviews Genetics, vol. 15, no. 1, p. 7, 2014.
  • [4] J.-H. Yoon, K. Abdelmohsen, and M. Gorospe, “Functional interactions among micrornas and long noncoding rnas,” in Seminars in cell & developmental biology, vol. 34.   Elsevier, 2014, pp. 9–14.
  • [5] E. Leucci, F. Patella, J. Waage, K. Holmstrøm, M. Lindow, B. Porse, S. Kauppinen, and A. H. Lund, “microrna-9 targets the long non-coding rna malat1 for degradation in the nucleus,” Scientific reports, vol. 3, p. 2535, 2013.
  • [6] B. K. Dey, K. Pfeifer, and A. Dutta, “The h19 long noncoding rna gives rise to micrornas mir-675-3p and mir-675-5p to promote skeletal muscle differentiation and regeneration,” Genes & development, vol. 28, no. 5, pp. 491–501, 2014.
  • [7] P. P. Amaral, M. B. Clark, D. K. Gascoigne, M. E. Dinger, and J. S. Mattick, “lncrnadb: a reference database for long noncoding rnas,” Nucleic acids research, vol. 39, no. suppl_1, pp. D146–D151, 2010.
  • [8] T. Derrien, R. Johnson, G. Bussotti, A. Tanzer, S. Djebali, H. Tilgner, G. Guernec, D. Martin, A. Merkel, D. G. Knowles et al., “The gencode v7 catalog of human long noncoding rnas: analysis of their gene structure, evolution, and expression,” Genome research, vol. 22, no. 9, pp. 1775–1789, 2012.
  • [9] H. N. Chua, W.-K. Sung, and L. Wong, “Exploiting indirect neighbours and topological weight to predict protein function from protein–protein interactions,” Bioinformatics, vol. 22, no. 13, pp. 1623–1630, 2006.
  • [10] U. Karaoz, T. Murali, S. Letovsky, Y. Zheng, C. Ding, C. R. Cantor, and S. Kasif, “Whole-genome annotation by using evidence integration in functional-linkage networks,” Proceedings of the National Academy of Sciences, vol. 101, no. 9, pp. 2888–2893, 2004.
  • [11] J. Tang, M. Qu, M. Wang, M. Zhang, J. Yan, and Q. Mei, “Line: Large-scale information network embedding,” in Proceedings of the 24th International Conference on World Wide Web.   International World Wide Web Conferences Steering Committee, 2015, pp. 1067–1077.
  • [12] A. Grover and J. Leskovec, “node2vec: Scalable feature learning for networks,” in Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining.   ACM, 2016, pp. 855–864.
  • [13] P. Goyal and E. Ferrara, “Graph embedding techniques, applications, and performance: A survey,” arXiv preprint arXiv:1705.02801, 2017.
  • [14] X. Chen, C. C. Yan, C. Luo, W. Ji, Y. Zhang, and Q. Dai, “Constructing lncrna functional similarity network based on lncrna-disease associations and disease semantic similarity,” Scientific reports, vol. 5, p. 11338, 2015.
  • [15] K. Kishan, R. Li, F. Cui, and A. Haake, “Gne: A deep learning framework for gene network inference by aggregating biological information,” bioRxiv, p. 300996, 2018.
  • [16] H. Cho, B. Berger, and J. Peng, “Diffusion component analysis: unraveling functional topology in biological networks,” in International Conference on Research in Computational Molecular Biology.   Springer, 2015, pp. 62–64.
  • [17] E. Asgari and M. R. Mofrad, “Continuous distributed representation of biological sequences for deep proteomics and genomics,” PloS one, vol. 10, no. 11, p. e0141287, 2015.
  • [18] D. Quang and X. Xie, “Danq: a hybrid convolutional and recurrent deep neural network for quantifying the function of dna sequences,” Nucleic acids research, vol. 44, no. 11, pp. e107–e107, 2016.
  • [19]

    X. Pan and H.-B. Shen, “Learning distributed representations of rna sequences and its application for predicting rna-protein binding sites with a convolutional neural network,”

    Neurocomputing, vol. 305, pp. 51–58, 2018.
  • [20] T. Fukunaga and M. Hamada, “Riblast: an ultrafast rna–rna interaction prediction system based on a seed-and-extension approach,” Bioinformatics, vol. 33, no. 17, pp. 2666–2674, 2017.
  • [21] L. R. Dice, “Measures of the amount of ecologic association between species,” Ecology, vol. 26, no. 3, pp. 297–302, 1945.
  • [22] J. C. Gower, “A general coefficient of similarity and some of its properties,” Biometrics, pp. 857–871, 1971.
  • [23] T. F. Smith and M. S. Waterman, “Comparison of biosequences,” Advances in applied mathematics, vol. 2, no. 4, pp. 482–489, 1981.
  • [24] F. Schroff, D. Kalenichenko, and J. Philbin, “Facenet: A unified embedding for face recognition and clustering,” in

    Proceedings of the IEEE conference on computer vision and pattern recognition

    , 2015, pp. 815–823.
  • [25] G. Synnaeve, T. Schatz, and E. Dupoux, “Phonetics embedding learning with side information,” in Spoken Language Technology Workshop (SLT), 2014 IEEE.   IEEE, 2014, pp. 106–111.
  • [26] J. Bromley, I. Guyon, Y. LeCun, E. Säckinger, and R. Shah, “Signature verification using a” siamese” time delay neural network,” in Advances in neural information processing systems, 1994, pp. 737–744.
  • [27] G. Koch, R. Zemel, and R. Salakhutdinov, “Siamese neural networks for one-shot image recognition,” in ICML Deep Learning Workshop, vol. 2, 2015.
  • [28] T. Tieleman and G. Hinton, “Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude,” COURSERA: Neural networks for machine learning, vol. 4, no. 2, pp. 26–31, 2012.
  • [29]

    S. Hochreiter and J. Schmidhuber, “Long short-term memory,”

    Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
  • [30] R. Albert, “Scale-free networks in cell biology,” Journal of cell science, vol. 118, no. 21, pp. 4947–4957, 2005.
  • [31] M. P. Stumpf, C. Wiuf, and R. M. May, “Subnets of scale-free networks are not scale-free: sampling properties of networks,” Proceedings of the National Academy of Sciences, vol. 102, no. 12, pp. 4221–4224, 2005.
  • [32] R. Riad, C. Dancette, J. Karadayi, N. Zeghidour, T. Schatz, and E. Dupoux, “Sampling strategies in siamese networks for unsupervised speech representation learning,” arXiv preprint arXiv:1804.11297, 2018.
  • [33] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed representations of words and phrases and their compositionality,” in Advances in neural information processing systems, 2013, pp. 3111–3119.
  • [34] C.-H. Chou, N.-W. Chang, S. Shrestha, S.-D. Hsu, Y.-L. Lin, W.-H. Lee, C.-D. Yang, H.-C. Hong, T.-Y. Wei, S.-J. Tu et al., “mirtarbase 2016: updates to the experimentally validated mirna-target interactions database,” Nucleic acids research, vol. 44, no. D1, pp. D239–D247, 2015.
  • [35] C.-H. Chou, S. Shrestha, C.-D. Yang, N.-W. Chang, Y.-L. Lin, K.-W. Liao, W.-C. Huang, T.-H. Sun, S.-J. Tu, W.-H. Lee et al., “mirtarbase update 2018: a resource for experimentally validated microrna-target interactions,” Nucleic acids research, vol. 46, no. D1, pp. D296–D302, 2017.
  • [36] M. D. Paraskevopoulou, I. S. Vlachos, D. Karagkouni, G. Georgakilas, I. Kanellos, T. Vergoulis, K. Zagganas, P. Tsanakas, E. Floros, T. Dalamagas et al., “Diana-lncbase v2: indexing microrna targets on non-coding transcripts,” Nucleic acids research, vol. 44, no. D1, pp. D231–D238, 2015.
  • [37] J. Yuan, W. Wu, C. Xie, G. Zhao, Y. Zhao, and R. Chen, “Npinter v2. 0: an updated database of ncrna interactions,” Nucleic acids research, vol. 42, no. D1, pp. D104–D108, 2013.
  • [38] Y. Hao, W. Wu, H. Li, J. Yuan, J. Luo, Y. Zhao, and R. Chen, “Npinter v3. 0: an upgraded database of noncoding rna-associated interactions,” Database, vol. 2016, 2016.
  • [39] Q. Jiang, J. Wang, X. Wu, R. Ma, T. Zhang, S. Jin, Z. Han, R. Tan, J. Peng, G. Liu et al., “Lncrna2target: a database for differentially expressed genes after lncrna knockdown or overexpression,” Nucleic acids research, vol. 43, no. D1, pp. D193–D196, 2014.
  • [40] L. Cheng, P. Wang, R. Tian, S. Wang, Q. Guo, M. Luo, W. Zhou, G. Liu, H. Jiang, and Q. Jiang, “Lncrna2target v2. 0: a comprehensive database for target genes of lncrnas in human and mouse,” Nucleic acids research, vol. 47, no. D1, pp. D140–D144, 2018.
  • [41] A. Chatr-Aryamontri, R. Oughtred, L. Boucher, J. Rust, C. Chang, N. K. Kolas, L. O’Donnell, S. Oster, C. Theesfeld, A. Sellam et al., “The biogrid interaction database: 2017 update,” Nucleic acids research, vol. 45, no. D1, pp. D369–D379, 2017.
  • [42] R. Oughtred, C. Stark, B.-J. Breitkreutz, J. Rust, L. Boucher, C. Chang, N. Kolas, L. O’Donnell, G. Leung, R. McAdam, F. Zhang, S. Dolma, A. Willems, J. Coulombe-Huntington, A. Chatr-aryamontri, K. Dolinski, and M. Tyers, “The BioGRID interaction database: 2019 update,” Nucleic Acids Research, vol. 47, no. D1, pp. D529–D541, 11 2018. [Online]. Available: https://doi.org/10.1093/nar/gky1079
  • [43] S. Griffiths-Jones, R. J. Grocock, S. Van Dongen, A. Bateman, and A. J. Enright, “mirbase: microrna sequences, targets and gene nomenclature,” Nucleic acids research, vol. 34, no. suppl_1, pp. D140–D144, 2006.
  • [44] S. Povey, R. Lovering, E. Bruford, M. Wright, M. Lush, and H. Wain, “The hugo gene nomenclature committee (hgnc),” Human genetics, vol. 109, no. 6, pp. 678–680, 2001.
  • [45] J. Harrow, A. Frankish, J. M. Gonzalez, E. Tapanari, M. Diekhans, F. Kokocinski, B. L. Aken, D. Barrell, A. Zadissa, S. Searle et al., “Gencode: the reference human genome annotation for the encode project,” Genome research, vol. 22, no. 9, pp. 1760–1774, 2012.
  • [46] R. Consortium, “Rnacentral: a comprehensive database of non-coding rna sequences,” Nucleic acids research, p. gkw1008, 2016.
  • [47] D. Bu, K. Yu, S. Sun, C. Xie, G. Skogerbø, R. Miao, H. Xiao, Q. Liao, H. Luo, G. Zhao et al., “Noncode v3. 0: integrative annotation of long noncoding rnas,” Nucleic acids research, vol. 40, no. D1, pp. D210–D215, 2011.
  • [48] P.-J. Volders, K. Helsens, X. Wang, B. Menten, L. Martens, K. Gevaert, J. Vandesompele, and P. Mestdagh, “Lncipedia: a database for annotated human lncrna transcript sequences and structures,” Nucleic acids research, vol. 41, no. D1, pp. D246–D251, 2012.
  • [49] G. Chen, Z. Wang, D. Wang, C. Qiu, M. Liu, X. Chen, Q. Zhang, G. Yan, and Q. Cui, “Lncrnadisease: a database for long-non-coding rna-associated diseases,” Nucleic acids research, vol. 41, no. D1, pp. D983–D986, 2012.
  • [50] V. Agarwal, G. W. Bell, J.-W. Nam, and D. P. Bartel, “Predicting effective microrna target sites in mammalian mrnas,” elife, vol. 4, p. e05005, 2015.
  • [51] I. Kalvari, J. Argasinska, N. Quinones-Olvera, E. P. Nawrocki, E. Rivas, S. R. Eddy, A. Bateman, R. D. Finn, and A. I. Petrov, “Rfam 13.0: shifting to a genome-centric resource for non-coding rna families,” Nucleic acids research, vol. 46, no. D1, pp. D335–D342, 2017.
  • [52] Z. Huang, J. Shi, Y. Gao, C. Cui, S. Zhang, J. Li, Y. Zhou, and Q. Cui, “Hmdd v3. 0: a database for experimentally supported human microrna–disease associations,” Nucleic acids research, 2018.
  • [53] T. A. Eyre, F. Ducluzeau, T. P. Sneddon, S. Povey, E. A. Bruford, and M. J. Lush, “The hugo gene nomenclature database, 2006 updates,” Nucleic acids research, vol. 34, no. suppl_1, pp. D319–D321, 2006.
  • [54] J. Piñero, À. Bravo, N. Queralt-Rosinach, A. Gutiérrez-Sacristán, J. Deu-Pons, E. Centeno, J. García-García, F. Sanz, and L. I. Furlong, “Disgenet: a comprehensive platform integrating information on human disease-associated genes and variants,” Nucleic acids research, p. gkw943, 2016.
  • [55] M. Ou, P. Cui, J. Pei, Z. Zhang, and W. Zhu, “Asymmetric transitivity preserving graph embedding,” in Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining.   ACM, 2016, pp. 1105–1114.
  • [56] D. Wang, P. Cui, and W. Zhu, “Structural deep network embedding,” in Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining.   ACM, 2016, pp. 1225–1234.
  • [57] M. V. Kuleshov, M. R. Jones, A. D. Rouillard, N. F. Fernandez, Q. Duan, Z. Wang, S. Koplev, S. L. Jenkins, K. M. Jagodnik, A. Lachmann et al., “Enrichr: a comprehensive gene set enrichment analysis web server 2016 update,” Nucleic acids research, vol. 44, no. W1, pp. W90–W97, 2016.
  • [58] M. Kanehisa and S. Goto, “Kegg: kyoto encyclopedia of genes and genomes,” Nucleic acids research, vol. 28, no. 1, pp. 27–30, 2000.
  • [59] L. v. d. Maaten and G. Hinton, “Visualizing data using t-sne,” Journal of machine learning research, vol. 9, no. Nov, pp. 2579–2605, 2008.
  • [60] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in neural information processing systems, 2017, pp. 5998–6008.