BASiNETEntropy: an alignment-free method for classification of biological sequences through complex networks and entropy maximization

03/24/2022
by   Murilo Montanini Breve, et al.
UTFPR
0

The discovery of nucleic acids and the structure of DNA have brought considerable advances in the understanding of life. The development of next-generation sequencing technologies has led to a large-scale generation of data, for which computational methods have become essential for analysis and knowledge discovery. In particular, RNAs have received much attention because of the diversity of their functionalities in the organism and the discoveries of different classes with different functions in many biological processes. Therefore, the correct identification of RNA sequences is increasingly important to provide relevant information to understand the functioning of organisms. This work addresses this context by presenting a new method for the classification of biological sequences through complex networks and entropy maximization. The maximum entropy principle is proposed to identify the most informative edges about the RNA class, generating a filtered complex network. The proposed method was evaluated in the classification of different RNA classes from 13 species. The proposed method was compared to PLEK, CPC2 and BASiNET methods, outperforming all compared methods. BASiNETEntropy classified all RNA sequences with high accuracy and low standard deviation in results, showing assertiveness and robustness. The proposed method is implemented in an open source in R language and is freely available at https://cran.r-project.org/web/packages/BASiNETEntropy.

READ FULL TEXT VIEW PDF

page 1

page 6

10/09/2021

Complex Network-Based Approach for Feature Extraction and Classification of Musical Genres

Musical genre's classification has been a relevant research topic. The a...
03/06/2014

A Novel Method for Comparative Analysis of DNA Sequences by Ramanujan-Fourier Transform

Alignment-free sequence analysis approaches provide important alternativ...
09/26/2020

ProDOMA: improve PROtein DOMAin classification for third-generation sequencing reads using deep learning

Motivation: With the development of third-generation sequencing technolo...
09/23/2022

BioKlustering: a web app for semi-supervised learning of maximally imbalanced genomic data

Summary: Accurate phenotype prediction from genomic sequences is a highl...
11/23/2021

Evaluating importance of nodes in complex networks with local volume information dimension

How to evaluate the importance of nodes is essential in research of comp...
06/30/2022

Classification of network topology and dynamics via sequence characterization

Sequences arise in many real-world scenarios; thus, identifying the mech...
11/11/2020

Leveraging the Defects Life Cycle to Label Affected Versions and Defective Classes

Two recent studies explicitly recommend labeling defective classes in re...

1 Introduction

In the 19th century, the Swiss biochemist Friedrich Miescher (1844-1895) isolated an acid containing phosphorus and nitrogen from a cell. Twenty years after his discovery, Richard Altmann named this compound nucleic acid, as we know it today [1]. Because of these scientific advances, in 1953, James D. Watson and Francis H. Crick published “A Structure for Deoxyribose Nucleic Acid” [2], the first mention of DNA structure in the scientific field, and it would be one of the most important contributions to biology.

Because of its indispensable participation in the creation and maintenance of life, nucleic acids have gained space and notoriety in various fields of knowledge, such as in computer science, since the amount of information present in the agglomerates of these particles is enormous, which makes it impractical to analyze manually. Since Phage-X174 was sequenced in 1977, a huge amount of organisms have been sequenced and stored in databases. In this context, the need for a computer application in biology created a new term, Bioinformatics [3], aiming to define the study of computational processes in biotic systems. Since then, bioinformatics methods have become essential for analysis of biological data [4]. Nowadays, computer programs such as NCBI BLAST are routinely used to perform a comparative analysis of biological sequences over 9.9 trillion base pairs and 2.1 billion nucleotide sequences [5].

Regarding genetics and genomics, bioinformatics methods are mainly involved in the sequencing and annotation of an organism’s DNA assemblies, and their observed mutations [6, 7, 3]. DNA is composed by nucleic acids that stores genetic information by combining four types of nitrogenous bases (adenine (A), thymine (T), guanine (G) and cytosine (C)), which will form distinct DNA molecules according to the structural organization of its bases. The information in the genome largely defines the functionality that a gene exerts in the organism [8], hereditary characteristics are also passed on through the genome [9].

On the other hand, RNA is the other type of nucleic acid composed of four different nucleotide subunits joined by phosphodiester bonds, and one of its functions is to serve as a template for proteins production (translation). RNA sequences are analysed to determine which genes code proteins and also to compare genes within a species or between different species, which can show similarities in protein functions or relationships between species. There are several classes of RNA, such as messenger RNAs (mRNAs) that are protein-coding, ribosomal RNAs (rRNA), transporter RNAs (tRNAs), non-coding RNAs (ncRNAs) among others[10]. In particular, the ncRNA recently received much attention because of its functionalities and diversity, such as small non-coding (sncRNA) and long non-coding (lncRNA) and their respective subclasses [11].

Long non-coding RNAs (lncRNAs) are emerging as an important component in the cancer context, indicating potential roles in oncogenic and tumour suppressor pathways [12]

. Small non-coding RNAs (sncRNAs) are part of non-coding regulatory oligo-nucleotides with broad physiological and morphological functions. They control the genetic programming of cells and can modulate differentiation and death processes

[13].

In fact, the ncRNAs are still little known, however it is already possible to point out important functionalities in many biological processes, transcriptional regulation of gene expression, mediate epigenetic modifications of DNA and also their relationships in cancer and other human diseases, such as neurological, cardiovascular and autoimmune disorders [14, 15, 16].

Therefore, the correct identification of RNA classes between mRNA, lncRNA and sncRNA from unknown biological sequences may contribute to a better understanding of their functionalities and mechanisms. This work addresses this context by presenting a new method for the classification of biological sequences through complex networks and entropy maximization.

2 Biological Background

The central dogma of molecular biology reveals the structure of DNA and explains the flow of genetic information, from DNA to RNA, to make a functional product, a protein [17]. Thus, the information in DNA is transcribed to synthesise RNA molecules. RNAs that encode proteins are defined as messenger RNA (mRNA) and are widely studied given their protein-coding function. On the other hand, about 1.5% of the human genome is transcribed into mRNA, thus a great part of the transcriptome does not possess the ability to code proteins, which are defined as non-coding RNA (ncRNA) [18]. In generality, non-coding RNA can be classified given the size of the transcript, being classified into small ( 200 nucleotides) (sncRNA) and long ( 200 nucleotides) (lncRNA) [19]. Figure 1 presents an overview of the expanded central dogma of molecular biology.

Fig. 1: Expanded central dogma of molecular biology.

A large proportion of non-coding RNA is present in the nucleus of cells, when compared with mRNA. In particular, a large proportion of lncRNA are transcribed from RNA polymerase II, showing a transcription mechanism similar to the mRNA [20]. Unlike lncRNA, sncRNAs are generated at various stages of transcription, and may be a product of the separation of introns, in the generation of inhibitors and other stages of transcription [21].

sncRNA act in many developmental processes, including cell maintenance, development and differentiation, transcriptional and post-transcriptional gene silencing [22]. The dysregulation of some sncRNA is associated with several diseases such as cancer, neurological and cardiac diseases among others [14, 23], thus sncRNAs are used as biomarkers for the detection of some of these diseases [21].

The function of lncRNAs can be classified into three major groups based on their localization in the cell, being a regulation of chromatin states and gene expression, influence on nuclear structure, and regulation of proteins and other RNA molecules [18]. Besides acting in regulation, some diseases are related to lncRNA, such as hematopoiesis, some types of cancer, and diseases related to immune and neurological responses [20].

3 Related Works

Traditionally, biological sequence analysis is performed using alignment tools and even using computational techniques, such as dynamic programming and optimization methods, has a high computational cost, making its application in large volumes of data unfeasible [24]. Several computational tools using the alignment approach were proposed in the last years to solve RNAs prediction and classification such as Coding Potential Calculator (CPC) [25], LncRNA-ID [26] and PLncPRO [27].

To overcome this limitation, alignment-free methods have been proposed, and these methods combine data mining, pattern recognition, and machine learning techniques to solve biological problems. In particular, for RNA classification different alignment-free approaches have been proposed

[18].

Considering the context of RNA classification using the alignment-free approach, some approaches have been proposed in the literature. Among the main approaches available in the literature are PLEK [28], CPC2 [29], and BASiNET [30].

PLEK (predictor of long non-coding RNAs and messenger RNAs based on an improved k-mer scheme) [28]

is an alignment-free method aiming to classify the RNA sequences into long non-coding and coding RNA. Assuming a k-mer frequency ranging from 1 to 5, features are extracted from the raw sequence using a sliding window with a step of one nucleotide. For each k value, the pattern number increases, i.e., for 1-mer have 4 patterns, for 2-mer have 16 patterns, until 1,024 patterns for 5-mer. Each k-mer pattern is incremented using the frequency of occurrence. These frequencies are calibrated and used as features in the support vector machine (SVM) algorithm to build a binary classification model to identify lncRNAs from mRNAs.

Using a feature list, the CPC2 method [29]

identifies the effective features employing the recursive feature elimination (RFE) approach, and the Fickett TESTCODE score, open reading frame (ORF) length, ORF integrity, and isoelectric point are used as features in the SVM algorithm to classify the coding and non-coding RNA. To create the feature list, the existence of known protein sequences is necessary, i.e., the feature extraction is dependent on data other than the nucleotides sequences. As the CPC2 method depends on known sequences, it has a limitation regarding to extract features from

de novo sequencing of new organisms or proteins.

More recently, BASiNET [30] was proposed as an alignment-free approach to classify biological sequences based on the use of complex networks and thresholds. The method is based on the mapping of a biological sequence to a weighted complex network, from which topological measurements are extracted for its characterisation [31]

. Then, thresholds are applied iteratively to remove the less frequent edges and to produce topological measurements at each application of the thresholds. The thresholds are applied until there are edges (maximum 200 iterations). Ten topological measures are extracted at each iteration, thus producing up to 2000 different measurements for each biological sequence. The produced complex network measurements are merged into a single feature vector, one for each biological sequence, and it is adopted as the input for the classification of RNA sequences. However, this huge amount of features leads to high dimensionality in the feature space and increases the computational complexity in terms of processing time and memory.

4 Complex Networks

Complex networks are graphs with non-trivial topologies, which have a set of vertices (nodes) that are connected through edges [32]. In fact, many interactions in the real world have connections that can be represented for complex networks, such as social relationships [33], eletric power grid [34], internet [35]

, computer vision

[36, 37, 38, 39, 40] and biological systems [41, 42, 43], among others.

A graph is composed of a set of nodes (or vertices) and a set of edges that represent the connections among the nodes. Depending on the application, edges may have a direction and an associated weight. When there is an edge between two nodes, they are adjacent. Figure 2 presents and example of graph, in which the nodes C and A are not adjacent, however, the nodes C and G, and G and A are adjacent. A graph is usually represented by its adjacency matrix, as shown in Table I, which is a n-by-n matrix whose value in row i and column j gives the weight of edges from the i-th to the j-th nodes [31].

C

G

A

1

1
Fig. 2: An example of weighted graph generated from the sequence: CGA.
A C T G
A 0 0 0 1
C 0 0 0 1
T 0 0 0 0
G 1 1 0 0
TABLE I: Adjacency matrix of the graph (Figure 2).

The complex network theory presents well-defined topologies that describe the structure and dynamics of an network. In face of this, it is possible to extract measurements to characterize its topological structure [44, 31]. For the representation of the topology, in this work were adopted 10 measures commonly used in the current literature for the characterization of complex networks. The adopted measurements are briefly described below.

  • Average shortest path length (ASPL): is the average of all the minimum paths in a complex network. [44]. Its value depends on how concentrated the complex network is, i.e., it is low for very concentrated networks.

  • Clustering coefficient (CC): has the purpose of calculating the probability that nodes are connected to another node that are also connected to each other. For example, applying this measure to a social network would have the effect of calculating the probability that two people who know a third person, which also know each other. Transitivity can be seen as the formation of triangles in complex networks

    [44].

  • Maximum degree (MAX): is the measure that presents the node with the largest number of connected edges.

  • Minimum degree (MIN): is the measure that presents the node with the smallest number of connected edges.

  • Average degree (DEG): is the average number of connections of the network nodes.

  • Assortativity (ASS): is the probability that nodes with similar degrees are connected. Positive assortativity values mean that nodes of similar degree tend to connect and negative assortativity values mean the opposite [45].

  • Average standard deviation (ASD): is the standard deviation of the nodes degree, high values of the average standard deviation show that network has unbalanced numbers of connections, i.e., there are nodes that connect more than others.

  • Average betweenness centrality (BET): is a standard measure of node centrality that show how relevant a node is by considering the number of shortest paths going through it [31, 44].

  • Frequency of motifs with size 3 (MT3): motifs are subgraphs or patterns with various shapes that exist within a network. This measure presents the count of how many motifs of size 3 occur within a network.

  • Frequency of motifs with size 4 (MT4): similar to previous measure, is the count of how many motifs of size 4 occur within a network.

5 Entropy

The concept of entropy was introduced in 1865 by Rudolf Clausius in thermodynamics, considering only macroscopic demonstrations [46]. A few years later, in 1877, Ludwig Boltzmann showed that entropy can be expressed in terms of probabilities associated with the microscopic configuration of a system [47], which came to be known in the literature as Boltzmann-Gibbs entropy. Later in 1948, entropy was applied to Information Theory by Claude Shannon [48] and also often called Shannon’s entropy. Entropy is often used to indicate the amount of information in a given source, and is also used to measure the disorder (uncertainty) of a data set [49]

. Consider a random variable

that can take on a discrete value. The Shannon entropy [50], like the Boltzmann-Gibbs entropy, is defined in terms of the probabilities of the possible occurrences of this random variable , as follows:

(1)

so that

In Equation 1 the average of the logarithms of the probabilities of the occurrences () weighted by their probabilities is taken, being assumed . Thus, entropy represents a measure of uncertainty associated with a variable, i.e., the higher the entropy of a variable, the greater the uncertainty in predicting the value of that variable.

Since then, entropy has been used in various fields of knowledge, from classical thermodynamics, where it was first proposed, to statistical physics and the information theory. Over time, this term, found far-reaching applications in chemistry and physics, and currently the entropy also takes part in the study of biological systems and their relationship to life. Several methods with different application using the entropy concept emerged. One of the most important was the development of the maximum entropy (ME) method by the physicist Edwin Thompson Jaynes [51]. It was showed that statistically maximizing entropy to observe how gas molecules were distributed would be equivalent to simply maximizing Shannon’s entropy [52, 51]

. In fact, the ME principle can be applied to measure the amount of uncertainty contained in a probability distribution

[53].

Let be the observed frequencies of a discrete distribution and let

(2)

where is the total number of samples, is the number of events (possible outcomes or states of a system) and is the probability of the - outcome.

Considering a discrete distribution containing two classes and , its respective entropy can be defined as follows:

(3)
(4)
(5)

Given the entropy of each class and its is possible to build the distribution of the sum of entropies , then ME can be defined as follows:

(6)

In this context, its is possible to identify the separability of a system regarding its probability distribution, i.e., to find into the distribution that produces the maximum ME, leading to the maximum uncertainty between the classes and [53].

Many applications were proposed over time in a wide variety of scientific research based on ME principle [54, 55, 56]. In particular, there are bioinformatics and computational biology methods based on ME principle [57, 58, 59, 60, 61], which have proven very suitable and becoming increasingly useful.

6 Materials and Methods

6.1 Materials

In order to assess the proposed method, two datasets were adopted. These datasets were commonly used and also allow the results to be compared openly with other methods. The first dataset was obtained from PLEK [28] and the second dataset was obtained from CPC2 [29]. These two datasets contain different RNAs classes from different species.

The specification of species, classes and the number of sequences per organisms is showed in Tables II and III. Regarding the RNA classes, two are available in PLEK dataset: messenger RNA (mRNA) and long non-coding RNA (lncRNA) and three are available in CPC2 dataset: mRNA, lncRNA and small non-coding RNA (sncRNA). The datasets contain 13 different species, including vertebrates, plant, worm and insect. There are three shared species between the two data sets: Homo sapiens, Mus musculus and Danio rerio.

Species Class of RNA Sequences
Homo Sapiens mRNA 4127
ncRNA 22389
Mus musculus mRNA 26062
ncRNA 2963
Danio rerio mRNA 14493
ncRNA 419
Bos taurus mRNA 13190
ncRNA 182
Gorilla gorilla mRNA 33025
ncRNA 367
Macaca mulatta mRNA 5709
ncRNA 359
Pan troglodytes mRNA 1906
ncRNA 1166
Pongo abelii mRNA 3401
ncRNA 392
Sus scrofa mRNA 3978
ncRNA 241
Xenopus tropicalis mRNA 8874
ncRNA 279
TABLE II: PLEK dataset [28].
Species Class of RNA Sequences
Homo sapiens mRNA 6142
lncRNA 7485
sncRNA 4534
Mus musculus mRNA 10638
lncRNA 6460
sncRNA 5791
Danio rerio mRNA 2344
lncRNA 1163
sncRNA 365
Arabidopsis
thaliana
mRNA 13986
lncRNA 2562
sncRNA 1291
Caenorhabditis
elegans
mRNA 3551
lncRNA 1582
sncRNA 7888
Drosophila
melanogaster
mRNA 3680
lncRNA 2776
sncRNA 780
TABLE III: CPC2 dataset [29].

6.2 Methods

The development of this work is based on the BASiNET [30] method. In this way, graphs (complex networks) were adopted, which were generated from the RNA sequences from PLEK [28] and CPC2 [29] datasets.

To produce these complex networks, it was necessary to set two parameters, the Step and the Word sizes. The function of the Step size parameter is to define the distance that will be travelled in the sequence after an edge has been formed. Meanwhile, the Word size parameter refers to how many nucleotides will be considered at each node. This process is presented in Figure 3. In this work, similar to the BASiNET, the and were adopted.

Fig. 3: Overview of mapping RNA sequences onto graphs.

With the graphs produced, it is proposed an approach to select the most informational edges from the RNA class. Based on maximum entropy principle (Sec. 5), the edges of all networks produced for each RNA class were appended composing a single network. Then, by considering the edges frequency (i.e., weights), a histogram was produced. Thus, similarly that image thresholding method [62], the aim is to find the threshold () that maximizes the sum of the entropies of two distinct parts (informational and non-informational) edges of each RNA class. In this way, a filter for the network nodes is proposed, by considering the 4096 possibilities (edges), resulting from a 64x64 matrix (word size = 3), to identify what edges are important and what are not important for each RNA class.

Fig. 4: Overview of the building entropy distribution and its maximum entropy .

The first step of the modelling is to estimate the probabilities of each class

and , i.e., and . Then the histogram is traversed from to estimate and from to estimate , in an iterative way (Sec. 5). Then, the entropy and can be also estimated to build the respective distribution and to find its maximum entropy , i.e., the threshold to identify the informational and non-informational edges of a RNA class. Figure 4 presents an overview of the proposed approach.

Fig. 5: Overview of the proposed edge filtering through maximum entropy.

The selected edges as informational by the maximum entropy are considered for each RNA class and the non-informational edges are removed, producing a filtered complex network for each class. Figure 5 shows the filtering process.

Fig. 6: Overview of the proposed method and its steps.

Thus, a filtered complex network is produced for each RNA class and the adopted measurements are extracted for its characterization (Sec. 4). Considering that each complex network measurement has a different numerical interval, for example, the average shortest path length measure is usually found on the tens scale. Meanwhile, other measures can reach values in the hundreds or even thousands scale, which can make some measures more relevant than others to the classifier. Therefore, it is necessary to pre-process the adjust their values so that they are comparable to each other. In this work, a scaling factor is adopted, commonly known as Min-Max, which scales the values of the measures in the interval between 0 and 1. The Equation 7 defines the rescale of the values.

(7)

where is the total number of samples, is the i-th measure value, is the minimum value of the measure and is the maximum value of the measure.

It is important to highlight that the proposed method is based on the identification of the most informational edges of each class, a training is required for each of the classes involved. Figure 6 presents the overview of the proposed method.

7 Results and Discussion

In order to evaluate the proposed method, two datasets presented in Sec. 6.1 were considered. The results were compared with important methods such as: PLEK [28], CPC2 [29], BASiNET [30] and also with BASiNET* without considering its iterative threshold step, i.e., producing the same amount of features as the proposed method. In addition, since the adopted datasets are used in other work available in the literature, the experimental results can be directly compared.

Species
Class of RNA PLEK CPC2 BASiNET* BASiNET BASiNETEntropy
Homo Sapiens mRNA 96.7 94.3 54.4 99.9 99.6
ncRNA 99.3 93.9 95.6 100.0 100.0
Mus musculus mRNA 88.1 94.7 79.9 100.0 99.9
ncRNA 89.9 99.9 75.9 99.9 100.0
Danio rerio mRNA 91.3 96.6 99.8 100.0 100.0
ncRNA 90.9 94.0 47.9 98.9 99.5
Bos taurus mRNA 94.8 95.9 91.2 100.0 100.0
ncRNA 99.5 100.0 99.8 98.9 99.5
Gorilla gorilla mRNA 83.8 91.6 97.6 100.0 99.7
ncRNA 99.7 100.0 97.6 100.0 100.0
Macaca mulatta mRNA 85.0 94.2 99.1 100.0 100.0
ncRNA 100.0 100.0 94.7 100.0 99.7
Pan troglodytes mRNA 87.1 93.9 97.7 100.0 99.8
ncRNA 99.9 100.0 88.3 99.8 99.7
Pongo abelii mRNA 98.0 94.4 99.9 100.0 99.9
ncRNA 100.0 100.0 98.7 99.2 99.7
Sus scrofa mRNA 85.1 94.9 99.3 99.9 100.0
ncRNA 98.3 98.3 76.3 99.6 99.6
Xenopus tropicalis mRNA 94.5 96.5 99.1 100.0 100.0
ncRNA 100.0 100.0 95.0 100.0 100.0
Average accuracy per class mRNA 90.44 94.70 92.59 99.98 99.89
ncRNA 97.75 98.61 86.98 99.63 99.77
Overall Average Accuracy 94.10 96.66 89.78 99.81 99.83
Standard Deviation mRNA 5.28 1.46 14.71 0.04 0.14
ncRNA 3.91 2.51 16.26 0.46 0.21
TABLE IV: Classification results considering the mRNA and ncRNA classes of sequences from PLEK dataset [28].
Species Class of RNA PLEK CPC2 BASiNET * BASiNET BASiNETEntropy
Homo sapiens mRNA 97.0 95.9 80.25 100.0 99.4
lncRNA 97.6 92.8 58 100.0 99.9
sncRNA 100.0 100.0 53.7 100.0 100.0
Mus musculus mRNA 89.2 93.9 89.2 100.0 99.7
lncRNA 91.7 95.0 99.8 99.9 100.0
sncRNA 100.0 100.0 72.14 99.9 99.9
Danio rerio mRNA 94.4 95.5 93.3 99.5 99.7
lncRNA 79.2 88.1 99.5 98.9 99.5
sncRNA 100.0 100.0 79.2 98.7 99.5
Arabidopsis
thaliana
mRNA 63.1 99.7 96.8 99.7 99.9
lncRNA 99.6 95.3 86.8 99.7 99.8
sncRNA 100.0 100.0 97.3 100.0 100.0
Caenorhabditis
elegans
mRNA 53.0 96.5 77.8 100.0 100.0
lncRNA 98.4 99.9 100 99.4 99.2
sncRNA 100.0 100.0 88.7 99.9 100.0
Drosophila
melanogaster
mRNA 82.8 94.6 99.9 98.5 97.3
lncRNA 87.5 91.9 84.6 97.3 99.4
sncRNA 100.0 100.0 77.4 99.7 100.0
Average accuracy per class mRNA 79.92 96.02 89.52 99.62 99.33
ncRNA 96.17 96.92 83.09 99.45 99.77
Overall Average Accuracy 90.75 96.62 85.23 99.51 99.62
Standard Deviation mRNA 17.92 2.03 8.15 0.58 0.68
ncRNA 6.67 4.18 15.13 0.81 0.29
TABLE V: Classification results considering the mRNA, lncRNA and sncRNA classes of sequences from CPC2 dataset [29].
Species Class of RNA PLEK CPC2 BASiNET* BASiNET BASiNETEntropy
Homo sapiens mRNA 90.0 94.8 88.9 99.4 99.7
ncRNA 55.0 94.1 82.9 99.2 100.0
Danio rerio mRNA 100.0 96.1 99.7 100.0 99.7
ncRNA 40.3 94.1 20.5 99.9 100.0
Mus musculus mRNA 91.6 96.1 99.8 100 99.9
ncRNA 95.8 93.1 95.0 99.9 99.7
Average accuracy per class mRNA 93.85 95.67 96.13 99.80 99.76
ncRNA 63.70 93.76 66.13 99.67 99.90
Overall Average Accuracy 78.78 94.72 81.13 99.73 99.83
Standard Deviation mRNA 4.40 0.61 6.26 0.34 0.09
ncRNA 23.48 0.47 39.9 0.40 0.14
TABLE VI: Classification results considering mRNA and ncRNA sequences for shared species from PLEK [28] and CPC2 [29] datasets. The CPC2 sequences were adopted for the training step and the PLEK sequences were adopted for the classification (test) step.

The first experiment was performed by considering the PLEK dataset. The proposed method was applied in order to extract the complex network topological measurements (features). Then, the Random Forest

[63] was adopted as a classifier. The competitor methods were performed by considering its default parameters. All methods were performed by considering the 10-fold cross-validation.

Table IV presents the results of the classifications considering the classes: mRNAs and ncRNAs. It can be seen that the proposed method showed the highest average accuracy among all the methods compared and lower standards deviations. In particular, it is important to note that BASiNET extracts 2,000 features with the application of the threshold step, leading to a feature space with high dimensionality. When not considering the threshold step, (BASiNET*), it clearly presents a decrease in accuracy, especially considering the ncRNA class. TThe proposed method shows that correctly identify which edges were important to define the topological structure of the network, and as a result, to characterize each class of RNA. Therefore, extracting only 10 features showed remarkable results, contributing to the correct identification of the RNA sequences and also with the reduction of the feature space. Thus, leading to a simpler and more efficient method for feature extraction from RNA sequences and their classification.

The second experiment was performed considering the CPC2 dataset in which the proposed method was applied in the same way as the previous experiment as well as the competitor methods. Table V presents the results of the classifications considering the classes mRNAs, lncRNAs and sncRNAs. Again, the BASiNETEntropy showed higher accuracies compared with competitor methods and also with the lowest values of standard deviation, reinforcing the assertiveness and robustness of the proposed method. It is important to point out that BASiNET* again presents a decrease in accuracy, reinforcing the importance of its threshold step when compared with BASiNET.

The third experiment was performed considering a cross-validation between the shared species in PLEK and CPC2 datasets. The RNA classes considered in this experiment were ncRNA and mRNA, so the lncRNA and sncRNA classes from the CPC2 dataset were grouped into a single ncRNA class. In this way, the RNA sequences of each class and species in the CPC2 dataset were adopted for the training step. The respective RNA sequences in PLEK dataset were adopted in the classification (test) step. Therefore, the goal of this experiment was to test the generalization of the methods when trained with sequences from one dataset and tested with sequences from another dataset, considering the same species.

Table VI presents the results of the cross-validation between the datasets considering the three shared species: Homo sapiens, Danio rerio and Mus musculus. It is possible to notice that PLEK method showed a significant decrease in accuracy considering the ncRNA class from Homo Sapiens and Danio rerio species. On the other hand, the CPC2 method presents similar results to the previous ones, showing its suitability in generalizing the classification of RNA sequences. The BASiNET* method (without the threshold step) presented the lowest accuracy for the ncRNA class considering the Danio rerio, showing again the importance of its threshold step when compared to its original version. BASiNET and BASiNETEntropy were the most assertive methods, with a slight superiority of the proposed method in assertiveness and robustness.

In summary, the results showed that BASiNET* without considering its threshold step had the lowest average accuracy in experiment 1 and 2 and the highest standard deviation. PLEK had the lowest average accuracy in experiment 3, when a cross-validation between data from the adopted datasets was performed, showing low generalization. CPC2 proved stable in all experiments with high accuracy rates and low standard deviation. BASiNET and BASiNETEntropy methods clearly outperform the competitor methods, showing the highest accuracy values and lowest standard deviation in all experiments.

The results show that proposed entropy maximization approach reduces the complexity in terms of dimensionality, extracting only 10 features, while maintaining high accuracy values and low standard deviation. Therefore, the proposed approach proved its efficiency in simplifying the characterization of RNA sequences, maintaining a high assertiveness and robustness in their classification.

8 Conclusion

The classification of biological sequences has become increasingly challenging because of the amount and variety of sequences currently generated [64]. Traditional methods based on the alignment between sequences require a high computational cost to be performed, being unfeasible for comparing large amounts of data.

This work presents the BASiNETEntropy method as an alignment-free machine learning approach for classifying biological sequences, in particular into different RNA classes. This work is based on a previous method, BASiNET, and presents a significant improvement by eliminating the threshold step that extracts 2,000 features, leading to a high dimensionality in the feature space. Therefore, a filter step based on entropy maximization principle is proposed to select the most representative edges of each class, significantly reducing the number of extracted features.

Two important datasets from the literature were adopted for the assessment of the proposed method, comparing the results with the methods PLEK, CPC2, BASiNET and BASiNET* (removing its threshold step). The proposed method showed higher accuracies among all competitor methods considering the two adopted datasets. In addition, BASiNETEntropy presented the smallest variations, showing its robustness. Even when cross-validation was performed between the adopted datasets, considering the shared species, the proposed method showed better generalization when classifying the sequences with greater assertiveness among the compared methods. Besides, the proposed method opens the possibility of training and application in other classes of biological sequences.

The BASiNETEntropy method was implemented in open source (R language). The source code, confusion matrices, as well as all the materials for the complete replication of this work are available at https://github.com/fabriciomlopes/BASiNETEntropy. The software package can also be downloaded at https://cran.r-project.org/web/packages/BASiNETEntropy.

9 Acknowledgments

MMB thanks the Universidade Tecnológica Federal do Paraná (UTFPR) for the scholarship (PIBIC 2020/2021). This study was financed in part by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - Brasil (CAPES) - Finance Code 001 and the Fundação Araucária e do Governo do Estado do Paraná/SETI (Grant number 035/2019, 138/2021 and NAPI - Bioinformática).

References

  • [1] R. Dahm, “Friedrich miescher and the discovery of dna,” Developmental Biology, vol. 278, no. 2, pp. 274 – 288, 2005.
  • [2] M. FEUGHELMAN, R. LANGRIDGE, W. E. SEEDS, A. R. STOKES, H. R. WILSON, C. W. HOOPER, M. H. F. WILKINS, R. K. BARCLAY, and L. D. HAMILTON, “Molecular structure of deoxyribose nucleic acid and nucleoprotein,” Nature, vol. 175, no. 4463, pp. 834–838, May 1955.
  • [3] P. Hogeweg, “The Roots of Bioinformatics in Theoretical Biology,” PLoS Computational Biology, vol. 7, no. 3, p. e1002021, Mar. 2011.
  • [4] D. Posada, Bioinformatics for DNA sequence analysis.   Springer, 2009.
  • [5] E. W. Sayers, M. Cavanaugh, K. Clark, K. D. Pruitt, C. L. Schoch, S. T. Sherry, and I. Karsch-Mizrachi, “GenBank,” Nucleic Acids Research, vol. 49, no. D1, pp. D92–D96, 11 2020.
  • [6] J. B. Hagen, “The origins of bioinformatics,” Nature Reviews Genetics, vol. 1, no. 3, pp. 231–236, Dec. 2000.
  • [7] E. S. Lander, “Initial impact of the sequencing of the human genome,” Nature, vol. 470, no. 7333, pp. 187–197, Feb. 2011.
  • [8] A. Varki, “Comparing the human and chimpanzee genomes: Searching for needles in a haystack,” Genome Research, vol. 15, no. 12, pp. 1746–1758, Dec. 2005.
  • [9] Y. I. Wolf, I. B. Rogozin, N. V. Grishin, and E. V. Koonin, “Genome trees and the tree of life,” Trends in Genetics, vol. 18, no. 9, pp. 472–479, Sep. 2002.
  • [10] B. Alberts, A. Johnson, J. Lewis, D. Morgan, M. Raff, K. Roberts, P. Walter, J. Wilson, and T. Hunt, Molecular biology of the cell.   WW Norton & Company, 2017.
  • [11]

    N. Amin, A. McGrath, and Y.-P. P. Chen, “Evaluation of deep learning in non-coding rna classification,”

    Nature Machine Intelligence, vol. 1, no. 5, pp. 246–256, 2019.
  • [12] E. A. Gibb, C. J. Brown, and W. L. Lam, “The functional role of long non-coding rna in human carcinomas,” Molecular Cancer, vol. 10, no. 1, p. 38, Apr 2011.
  • [13] O. V. Klimenko, “Small non-coding rnas as regulators of structural evolution and carcinogenesis,” Non-coding RNA Research, vol. 2, no. 2, pp. 88–92, 2017.
  • [14] M. Esteller, “Non-coding RNAs in human disease,” Nature Reviews Genetics, vol. 12, no. 12, pp. 861–874, Nov. 2011.
  • [15] J. L. Rinn and H. Y. Chang, “Genome regulation by long noncoding rnas,” Annual review of biochemistry, vol. 81, pp. 145–166, 2012.
  • [16] X. Dai, S. Zhang, and K. Zaleta-Rivera, “Rna: interactions drive functionalities,” Molecular biology reports, vol. 47, no. 2, pp. 1413–1434, 2020.
  • [17] F. Crick, “Central dogma of molecular biology,” Nature, vol. 227, no. 5258, pp. 561–563, 1970.
  • [18] H. Zheng, A. Talukder, X. Li, and H. Hu, “A systematic evaluation of the computational tools for lncRNA identification,” Briefings in Bioinformatics, p. bbab285, aug 2021.
  • [19] T. Qin, J. Li, and K.-Q. Zhang, “Structure, Regulation, and Function of Linear and Circular Long Non-Coding RNAs,” Frontiers in Genetics, vol. 11, p. 150, Mar. 2020.
  • [20] L. Statello, C.-J. Guo, L.-L. Chen, and M. Huarte, “Gene regulation by long non-coding RNAs and its biological functions,” Nature Reviews Molecular Cell Biology, vol. 22, no. 2, pp. 96–118, Feb. 2021.
  • [21] G. Romano, D. Veneziano, M. Acunzo, and C. M. Croce, “Small non-coding RNA and cancer,” Carcinogenesis, vol. 38, no. 5, pp. 485–491, Apr. 2017.
  • [22] R. J. Taft, K. C. Pang, T. R. Mercer, M. Dinger, and J. S. Mattick, “Non-coding rnas: regulators of disease,” The Journal of Pathology, vol. 220, no. 2, pp. 126–139, 2010.
  • [23] X. Chen, C. C. Yan, X. Zhang, and Z.-H. You, “Long non-coding RNAs and complex diseases: from experimental results to computational models,” Briefings in Bioinformatics, p. bbw060, Jun. 2016.
  • [24] C. R. De Pierri, R. Voyceik, L. G. C. Santos de Mattos, M. G. Kulik, J. O. Camargo, A. M. Repula de Oliveira, B. T. de Lima Nichio, J. N. Marchaukoski, A. C. da Silva Filho, D. Guizelini, J. M. Ortega, F. O. Pedrosa, and R. T. Raittz, “SWeeP: representing large biological sequences datasets in compact vectors,” Scientific Reports, vol. 10, no. 1, p. 91, Dec. 2020.
  • [25] L. Kong, Y. Zhang, Z.-Q. Ye, X.-Q. Liu, S.-Q. Zhao, L. Wei, and G. Gao, “CPC: assess the protein-coding potential of transcripts using sequence features and support vector machine,” Nucleic Acids Research, vol. 35, no. suppl_2, pp. W345–W349, 07 2007.
  • [26] R. Achawanantakun, J. Chen, Y. Sun, and Y. Zhang, “LncRNA-ID: Long non-coding RNA IDentification using balanced random forests,” Bioinformatics, vol. 31, no. 24, pp. 3897–3905, 08 2015.
  • [27] U. Singh, N. Khemka, M. S. Rajkumar, R. Garg, and M. Jain, “PLncPRO for prediction of long non-coding RNAs (lncRNAs) in plants and its application for discovery of abiotic stress-responsive lncRNAs in rice and chickpea,” Nucleic Acids Research, vol. 45, no. 22, pp. e183–e183, 10 2017.
  • [28] A. Li, J. Zhang, and Z. Zhou, “Plek: a tool for predicting long non-coding rnas and messenger rnas based on an improved k-mer scheme,” BMC Bioinformatics, vol. 15, no. 1, p. 311, Sep 2014.
  • [29] Y.-J. Kang, D.-C. Yang, L. Kong, M. Hou, Y.-Q. Meng, L. Wei, and G. Gao, “CPC2: a fast and accurate coding potential calculator based on sequence intrinsic features,” Nucleic Acids Research, vol. 45, no. W1, pp. W12–W16, 05 2017.
  • [30] E. A. Ito, I. Katahira, F. F. d. R. Vicente, L. F. P. Pereira, and F. M. Lopes, “BASiNET—BiologicAl Sequences NETwork: a case study on coding and non-coding RNAs identification,” Nucleic Acids Research, vol. 46, no. 16, pp. e96–e96, 06 2018.
  • [31] L. d. F. Costa, F. A. Rodrigues, G. Travieso, and P. R. Villas Boas, “Characterization of complex networks: A survey of measurements,” Advances in Physics, vol. 56, no. 1, pp. 167–242, Jan. 2007.
  • [32] A.-L. Barabasi, Linked: How Everything Is Connected to Everything Else and What It Means.   Plume, 2003.
  • [33] A. Nunes da Silva, Junior, M. M. Breve, J. P. Mena-Chalco, and F. M. Lopes, “Analysis of co-authorship networks among brazilian graduate programs in computer science,” PLOS ONE, vol. 17, no. 1, pp. 1–17, 01 2022.
  • [34] R. Albert, I. Albert, and G. L. Nakarado, “Structural vulnerability of the north american power grid,” Physical review E, vol. 69, no. 2, p. 025103, 2004.
  • [35] S. Maslov, K. Sneppen, and A. Zaliznyak, “Detection of topological patterns in complex networks: correlation profile of the internet,” Physica A: Statistical Mechanics and its Applications, vol. 333, pp. 529–540, 2004.
  • [36] A. R. Backes, D. Casanova, and O. M. Bruno, “Texture analysis and classification: A complex network-based approach,” Information Sciences, vol. 219, no. 0, pp. 168–180, 2013.
  • [37] G. V. L. de Lima, T. R. Castilho, P. H. Bugatti, P. Saito, and F. M. Lopes, “A complex network-based approach to the analysis and classification of images,” in Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications, ser. Lecture Notes in Computer Science, A. Pardo and J. Kittler, Eds., vol. 9423.   Springer International Publishing, 2015, pp. 322–330.
  • [38]

    J. G. S. Piotto and F. M. Lopes, “Combining surf descriptor and complex networks for face recognition,” in

    2016 9th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI), Oct 2016, pp. 275–279.
  • [39] G. V. de Lima, P. T. Saito, F. M. Lopes, and P. H. Bugatti, “Classification of texture based on bag-of-visual-words through complex networks,” Expert Systems with Applications, vol. 133, pp. 215 – 224, 2019.
  • [40] J. G. de Souza Piotto and F. M. Lopes, “A feature extraction approach based on lbp operator and complex networks for face recognition,” in Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications, J. M. R. S. Tavares, J. P. Papa, and M. González Hidalgo, Eds.   Cham: Springer International Publishing, 2021, pp. 440–450.
  • [41] F. M. Lopes, R. M. Cesar-Jr, and L. d. F. Costa, “Gene expression complex networks: Synthesis, identification, and analysis,” Journal of Computational Biology, vol. 18, no. 10, pp. 1353–1367, 2011.
  • [42]

    F. M. Lopes, D. C. M. Jr., J. Barrera, and R. M. C. Jr., “A feature selection technique for inference of graphs from their known topological properties: Revealing scale-free gene regulatory networks,”

    Information Sciences, vol. 272, no. 0, pp. 1–15, 2014.
  • [43] M. M. Breve and F. M. Lopes, “A simplified complex network-based approach to mrna and ncrna transcript classification,” in Advances in Bioinformatics and Computational Biology.   Cham: Springer International Publishing, 2020, pp. 192–203.
  • [44] S. Boccaletti, V. Latora, Y. Moreno, M. Chavez, and D.-U. Hwang, “Complex networks: Structure and dynamics,” Physics Reports, vol. 424, no. 4, pp. 175 – 308, 2006.
  • [45] B. Panwar, A. Arora, and G. P. Raghava, “Prediction and classification of ncrnas using structural information,” BMC Genomics, vol. 15, no. 1, p. 127, Feb 2014.
  • [46] R. Clausius, The mechanical theory of heat.   London: Macmillan, 1879.
  • [47] L. Boltzmann, Theoretical physics and philosophical problems: selected writings, 1st ed.   Netherlands: Springer, 1974.
  • [48] C. E. Shannon, “A mathematical theory of communication,” Bell System Technical Journal, vol. 27, pp. 379–423, 623–656, July, October 1948.
  • [49] C. M. Bishop, Neural networks for pattern recognition.   Oxford University Press, 1995.
  • [50] C. E. Shannon and W. Weaver, The mathematical theory of communication.   University of Illinois Press, 1963.
  • [51] E. T. Jaynes, “Information theory and statistical mechanics,” Phys. Rev., vol. 106, pp. 620–630, May 1957.
  • [52] C. E. Shannon, “A mathematical theory of communication,” The Bell System Technical Journal, vol. 27, no. 3, pp. 379–423, 1948.
  • [53]

    S. Guiasu and A. Shenitzer, “The principle of maximum entropy,”

    The mathematical intelligencer, vol. 7, no. 1, pp. 42–48, 1985.
  • [54] J. N. Kapur and H. K. Kesavan, “Entropy optimization principles and their applications,” in Entropy and energy dissipation in water resources.   Springer, 1992, pp. 3–20.
  • [55] J. R. Banavar, A. Maritan, and I. Volkov, “Applications of the principle of maximum entropy: from physics to ecology,” Journal of Physics: Condensed Matter, vol. 22, no. 6, p. 063101, 2010.
  • [56] A. K. Singh, D. Senapati, T. Mukherjee, and N. K. Rajput, “Adaptive applications of maximum entropy principle,” in Progress in Advanced Computing and Intelligent Engineering, C. R. Panigrahi, B. Pati, P. Mohapatra, R. Buyya, and K.-C. Li, Eds.   Singapore: Springer Singapore, 2021, pp. 373–379.
  • [57] F. Morcos, A. Pagnani, B. Lunt, A. Bertolino, D. S. Marks, C. Sander, R. Zecchina, J. N. Onuchic, T. Hwa, and M. Weigt, “Direct-coupling analysis of residue coevolution captures native contacts across many protein families,” Proceedings of the National Academy of Sciences, vol. 108, no. 49, pp. E1293–E1301, 2011.
  • [58] E. Granot-Atedgi, G. Tkačik, R. Segev, and E. Schneidman, “Stimulus-dependent maximum entropy models of neural population codes,” PLoS computational biology, vol. 9, no. 3, p. e1002922, 2013.
  • [59] W. Boomsma, J. Ferkinghoff-Borg, and K. Lindorff-Larsen, “Combining experiments and simulations using the maximum entropy principle,” PLOS Computational Biology, vol. 10, no. 2, pp. 1–9, 02 2014.
  • [60] G. A. Barros-Carvalho, M.-A. Van Sluys, and F. M. Lopes, “An efficient approach to explore and discriminate anomalous regions in bacterial genomes based on maximum entropy,” Journal of Computational Biology, vol. 24, no. 11, p. 1125–1133, 2017.
  • [61] S. Bottaro, T. Bengtsen, and K. Lindorff-Larsen, “Integrating molecular simulation and experimental data: a bayesian/maximum entropy reweighting approach,” in Structural Bioinformatics.   Springer, 2020, pp. 219–240.
  • [62] J. Kapur, P. Sahoo, and A. Wong, “A new method for gray-level picture thresholding using the entropy of the histogram,” Computer Vision, Graphics, and Image Processing, vol. 29, no. 3, pp. 273–285, 1985.
  • [63] A. Liaw and M. Wiener, “Classification and regression by randomforest,” R News, vol. 2, no. 3, pp. 18–22, 2002. [Online]. Available: https://CRAN.R-project.org/doc/Rnews/
  • [64] W. De Coster, M. H. Weissensteiner, and F. J. Sedlazeck, “Towards population-scale long-read sequencing,” Nature Reviews Genetics, vol. 22, no. 9, pp. 572–587, 2021.