MCP: a Multi-Component learning machine to Predict protein secondary structure

06/17/2018
by   Leila Khalatbari, et al.
0

The Gene or DNA sequence in every cell does not control genetic properties on its own; Rather, this is done through translation of DNA into protein and subsequent formation of a certain 3D structure. The biological function of a protein is tightly connected to its specific 3D structure. Prediction of the protein secondary structure is a crucial intermediate step towards elucidating its 3D structure and function. Traditional experimental methods for prediction of protein structure are expensive and time-consuming. Therefore, various machine learning approaches have been proposed to predict the protein secondary structure. Nevertheless, the average accuracy of the suggested solutions has hardly reached beyond 80 sequence-structure relation, noise in input protein data, class imbalance, and the high dimensionality of the encoding schemes that represent the protein sequence. In this paper, we propose an accurate multi-component prediction machine to overcome the challenges of protein structure prediction. We devise a multi-component designation to address the high complexity challenge in sequence-structure relation. Furthermore, we utilize a compound string dissimilarity measure to directly interpret protein sequence content and avoid information loss. In order to improve the accuracy, we employ two different classifiers including support vector machine and fuzzy nearest neighbor and collectively aggregate the classification outcomes to infer the final protein secondary structures. We conduct comprehensive experiments to compare our model with the current state-of-the-art approaches. The experimental results demonstrate that given a set of input sequences, our multi-component framework can accurately predict the protein structure. Nevertheless, the effectiveness of our unified model an be further enhanced through framework configuration.

READ FULL TEXT VIEW PDF

Authors

06/17/2018

MCP: a multi-component learning machine for prediction of protein secondary structure

Proteins biological function is tightly connected to its specific 3D str...
03/17/2015

ProtVec: A Continuous Distributed Representation of Biological Sequences

We introduce a new representation and feature extraction method for biol...
01/15/2022

StemP: A fast and deterministic Stem-graph approach for RNA and protein folding prediction

We propose a new deterministic methodology to predict RNA sequence and p...
04/04/2022

Multi-Scale Representation Learning on Proteins

Proteins are fundamental biological entities mediating key roles in cell...
05/01/2019

Machine Learning for Classification of Protein Helix Capping Motifs

The biological function of a protein stems from its 3-dimensional struct...
03/02/2022

FastFold: Reducing AlphaFold Training Time from 11 Days to 67 Hours

Protein structure prediction is an important method for understanding ge...
05/18/2018

Combining Cost-Sensitive Classification with Negative Selection for Protein Function Prediction

Motivation: Computational methods play a central role in annotating the ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In this paper, we focus on interpreting of the sequential string data which is crucial to many applications including bioinformatics and molecular biology. Accordingly, this interpretation problem is studied in the context of computational biology. Prediction of protein secondary structure from the input string sequences is the ground to foster protein function determination and design better drugs [50, 46, 13]. Hence, understanding of the protein sequences to predict protein structure has become an important use-case in molecular biology. Given protein sequences composed of amino-acid molecules, our aim is to predict the secondary structure that each amino-acid adopts through a classification paradigm. In Figure 1, the top string chain illustrates a protein sequence. Each letter of this sequence represents an amino-acid molecule. Our aim is to assign each amino-acid molecule to one of three classes of protein secondary structure named as -helix (H), -sheet (E) and coil (C).

Figure 1: sequence-structure mapping

Template-based methods and machine learning strategies are two computational approaches for prediction of protein secondary structure. Template-based methods neither yield higher accuracy compared to machine learning methods nor perform well on non-homologous proteins [31]

. As a consequence, effective machine learning-based strategies are much more preferable. Feature extraction from protein sequences is the first step when applying a machine learning approach. However, the extracted features may not reflect all the information a sequence contains and thus can lead to information loss

[30, 33, 10]. Nevertheless, from biological perspective, protein sequence contains indispensable information to adopt certain structures [31, 33, 34]. While predicting such structures is an appealing task, challenges abound. First, the relationship between a sequence and its corresponding structure is quite complex [50, 27]. Second, the selected features dramatically influence the learner’s effectiveness [33]. Additionally, the training data including protein sequences and their related known structures are partially noisy. Lastly, known as class imbalance phenomena, the amino-acid samples are not distributed equally in three classes of the protein structures [2].
To address the challenges, we devise a Multi-Component Predictor (MCP) which is capable to directly process amino-acid sequences into accurate protein structures. While every component participates in the enhancement and correction of the prediction results, the multi-component property of our solution can learn various information from the input protein sequences. The MCP framework processes the textual context of protein sequences which yields three advantages. First, it avoids information loss through processing the primary sequence data and bypassing the feature extraction procedure. Second, our proposed framework can eliminate the negative effects of certain feature subsets that can further promote the effectiveness of the learner. Third, Given the input protein sequences, the MCP framework can interpret a latent natural language via employing of the dissimilarity measures. Such latent language can reveal hidden relationships between protein sequences.
Our proposed framework employs two efficient algorithms of Support Vector Machine (SVM) and Fuzzy K-Nearest Neighbor (FKNN) in parallel. The edit distance function forms the edit kernel for SVM that infers the dissimilarity among input sequences. Furthermore, we embed a compound dissimilarity measure called into FKNN module.

works based on n-gram scores, LZ scores, and a new parameter called dissimilarity rate (

). The output of each learner passes through a filtering component to refine the biologically meaningless structures. The corrected output from filtration enters the aggregation pool that can further make a consensus among the decisions of edit-SVM and -FKNN .
A worthwhile advantage of the proposed method is its extensive flexibility which promotes the development of more accurate and advanced versions based on the primary solution. For instance, one might want to add more classifiers and expand the ensemble size with different base learners. It is also possible to utilize the dissimilarity measure () in SVM kernel and subsequently analyze the outcomes more in details. One might pass numerical features (i.e. protein sequence profiles) to a specific learner and the string sequences to another to collectively investigate the resultant differences. Fuzzification of the SVM module can enhance the aggregation process and consequently enrich the final prediction results. Also, a weighted form of the compound dissimilarity measure can lead to accuracy enhancement. Finally, our proposed solution can incorporate a dynamic parameter optimization module which can significantly improve the effectiveness of the prediction results.
Our contributions in this study are threefold:

  • We devise a flexible multi-component prediction framework (MCP) which can directly process the latent contexts of the input protein sequences and suggest accurate output secondary structures. Our model can be further generalized to perform on an arbitrary number of classes.

  • We employ two different classifiers of edit-SVM and -FKNN and collectively aggregate the filtered classification outcomes to infer the final prediction results.

  • We achieve better accuracy in prediction of protein secondary structure using various modules of classification, aggregation and dissimilarity measures.

The remainder of this paper is organized as follows. In section 2, we briefly summarize related work in the literature of protein secondary structure prediction. In section 3, we provide the preliminary definitions, define problems and present the overview of our framework. In section 4, we elucidate the procedure of building the proposed model and provide the underlying techniques. We also prove that are model can be generalized to perform on an arbitrary number of classes. In section 5, we examine the effectiveness of the competitor baselines and report the experimental results. In section 6, we offer a conclusion and discuss the future work.

2 Related Work

Predicting the structure using a set of string sequences has been well-studied in biology [15, 43, 6, 31, 25, 13, 13]. Neglecting external knowledge-bases [19]

, machine learning-based methods such as ensemble models, and deep learning approaches have been recently exploited to increase the accuracy of the prediction results. In this section, we discuss the related work in two major aspects: First,

Statistical and mining methods, Second, Machine Learning-based models. Table 1, briefly demonstrates an overview of the literature.

Category of methods methods Reference
Probabilistic and mining Chou-Fesman [11]
GOR [14]
Hidden Markov Model (HMM) [8, 9]
Decision-tree [37, 18]
Natural Language Processing (NLP) [31]
Distance-based learners [16, 6, 46]
Machine learning Neural networks and deep learning [5, 4, 2, 12, 49, 40, 44, 50]
Support vector machines (SVM) [55, 56, 39, 32]
Multi-component approaches [9, 12, 49, 53, 41, 2, 5, 25, 4]
[39, 50, 55, 56, 18, 44, 7, 54]
Table 1: An overview of the literature of protein secondary structure prediction

Probabilistic and mining methods: Primary probabilistic methods [11, 14]

are based on empirical analytics and mainly compute the tendency of each amino-acid in protein sequence to form a particular secondary structure (i.e. probabilities are calculated based on the frequency of each amino-acid in each secondary class). GOR

[14] extends Chou-Fesman [11] to improve prediction performance through including the context of each amino-acid. In probabilistic terms, it computes the conditional probability for each amino-acid to adopt a certain secondary structure, given that its neighbors have formed that structure. Since the structure of an amino-acid is correlated with its neighbors, in this paper, we also use a sliding window to incorporate the neighboring context in the prediction process. A more recent probabilistic approach for protein secondary structure prediction is Hidden Markov Model (HMM)[8, 9]. The HMM graphical models well adapt to one-dimensional sequence processinsg. When applying HMMs, the states of the graph will be secondary classes and structures are determined using output probabilities of HMM [8]. Chen et al. [9] employ Markov Model of the third order (as a feature extraction method) to generate a sequence encoding scheme that is subsequently fed into the SVM to predict the protein structure. In our approach, secondary structures are decided from probabilities that are calculated through the fuzzy membership functions.
Tree-based methods are also used in mining approaches [52]. Mossos et al. [37] extracts the rules from FS-Tree - with a modified support - to leverage sequence-structure mappings. In another tree-approach, [18]

applies SVM to eliminate the noise and the outlier data. He et al.

[18] firstly pass the refined data to a decision tree. Subsequently, they predict the proteins structures using the extracted rules from the decision tree. Moreover, the NLP-based methods [24] consider the textual context of the input sequences to infer the missing structures. [31]

exploit n-gram patterns to create a dictionary of synonymous words that can be later used to compute sequence similarities. In our work, we also employ a compound measure including an n-gram metric which estimates the dissimilarities among sequence chunks. Nevertheless, the distance-based classifiers have also been employed in structure prediction

[16]

. K-Nearest Neighbor algorithm (KNN) and its fuzzified versions

[6, 46] are the most popular distance-based learners. Similarly our multi-component framework takes advantage of the fuzzified KNN.
Machine learning-based approaches: In the field of sequence-structure mapping, there are three major groups of machine learning models: neural networks and deep learning paradigms, support vector machines and multi-component learners. The Neural Networks (NN) are the first generation of machine-learning approaches that are used for protein structure prediction. In practice, employing a well-decided architecture of neural networks leads to a fine estimation of class boundaries. The latest and the most effective version of neural networks is Deep Neural Networks (DNN). Deep networks learn different levels of information abstraction through multiple hidden layers [40]

. The mainstream deep networks comprise recurrent neural networks

[5, 4]

, feedforward multilayer perceptron

[12] and deep convolution neural fields [40]. According to the literature [36, 5, 42]

, different versions of recurrent neural networks greatly suit processing sequence data. For an instance, bidirectional recurrent neural networks are capable to utilize the information along the entire sequence. While time is a multi-aspect entity

[23, 22, 21, 20], the long short-term recurrent neural networks can also retain the information over long periods of time [40]. Deep convolutional neural fields involves more sequence information in its learning process and takes into account the interdependencies of the adjacent context[50]. The more recent studies both in shallow [2, 49] and deep neural networks [44], combine the predictions of a number of such networks in an ensemble fashion. Spencer et al. [44]

propose an ensemble of three DNNs with a cascade architecture. The model is trained using the restricted Boltzmann machine that works with the real-valued data and the contrastive divergence. Despite the strength of neural networks, the choice of proper architecture parameters such as the number of neurons, layers, and activation functions remains an issue. This can significantly affect the prediction outcome. Moreover, there is a chance that the algorithm falls into a local minima. Since, the support vector machines can resolve the parameter selection and the local minima issues, we employ the SVM component in our framework.


SVMs are among the most accurate learners in the literature of protein secondary structure prediction [7, 56]. Because of the optimization nature, SVM models perform more accurately than NNs in many applications [7, 56]. However SVM kernel should be tuned properly. Zangooei et al. [55, 56] use a dynamic weight allocation function to assign weights to three ubiquitous kernels and later fuse them into a single kernel. Furthermore, a parallel hierarchical grid search is applied to tune the kernel parameters. According to [55], converting the classification to a regression problem and then employing Support Vector Regression (SVR) can further enhance the prediction accuracy. Therefore, aiming to determine the final protein structure, [55]

utilizes a non-dominated sorting genetic algorithm to map the real-valued SVR outputs to integer values. The SVM models are also employed as ensemble components

[39, 32]. Nguyen et al. [39] develop an SVM-based cascade architecture where the second layer produces the final prediction results through combining the outputs from the first layer.

Multi-component methods employ a variety of complementary modules to unveil the relationship between the input and output vectors. Some categories of such computational modules are various learners, optimization strategies, distance or dissimilarity measures, evolutionary algorithms, and etc. In practice, the multi-component machines perform more competently than single learners. The reason relies on the fact that each component can overcome a part of the challenge. Several methods reviewed in this section were developed in an ensemble or multi-component manner. The first group of multi-component approaches

[9, 12, 49, 53, 41] employ complementary modules beside the learning algorithms to promote the prediction results. Similarly, aiming to foster prediction accuracy, the second category [2, 5, 25, 4, 39] exploit multiple classifiers of the same type with various features. The third class of Multi-component approaches [50, 55, 56, 18, 44, 7, 54] combine classifiers of different types [1] [35] that are at times equipped with the complementary components. As every module in a multi-component framework is able to enhance or rectify a learner’s prediction outcomes, in this work we devise a multi-component framework to better address the challenges of the secondary structure prediction.

3 Problem Statement

In this section, we offer primary concepts, notations, and the framework overview.

3.1 Preliminary concepts

Since we study the problem of knowledge extraction from sequential string data in the context of molecular biology, we commence with biological concepts. The Gene or DNA sequence in every cell does not control genetic properties on its own; rather, this is done through the translation of DNA sequence into protein and formation of a certain structure. Hence, the proteins are the functional units of the cells whose functions are tightly connected to their structure.

Definition 1

(protein sequence) A protein sequence , of the length , is a string composing of amino-acids (i.e. ) participating in protein’s formation. Each amino-acid molecule in a protein sequence is represented by an alphabetic letter (). There are 20 different types of amino-acids in nature ().

Definition 2

(protein secondary structure) The secondary structure of a protein is formed by the local compositions of neighboring amino-acids through peptide bonds. During this chemical reaction, the element of water is removed and what is left of the amino-acid molecules is called amino-acid residues. Thus we refer to amino-acid residue as residue from hereafter. Every residue is assigned with a secondary structure (i.e. ). There are three classes of protein secondary structure, named as -helix, -sheet, and the coil. The secondary structures are represented by three letters. Therefore, , and . Hence, a protein sequence with characters will correspond to a sequence of secondary structure with the length of . Each character in the structure sequence (i.e. ) is associated with its corresponding amino-acid (i.e. ) in the primary protein sequence.

Definition 3

(the dissimilarity rate) the dissimilarity rate is the number of unique non-identical characters divided by the number of unique identical characters that a pair of sequences share. We disregard the position of characters.

The secondary structure of a residue is strongly influenced by the type of its neighboring residues [28]. Therefore, to effectively predict a residue’s secondary structure, its neighbors must be involved. Hence, we employ the simple but effective sliding window approach to include the adjacent residues in the prediction process. Accordingly, the residue whose structure is going to be predicted will be placed in the middle of the window.

3.2 Problem definition

Given a set of protein sequences , our aim is to model the mapping . Therefore, we infer the similarity between the sequences in via a compound dissimilarity measure. Concurrently, to address the complexity of the function , we devise our unified framework as a multi-component learning machine.

Problem 1

(sequence similarity inference) Given a set of protein sequences , our aim is to infer similarity between each pair of sequences in through a Compound Dissimilarity measure (CD).

Problem 2

(multi-component learning) Given our compound dissimilarity measure (CD) and the protein sequences , our goal is to devise a multi-component learning machine to take as input and consume protein sequence dissimilarities. Each module of the multi-component machine is expected to enhance the prediction accuracy of the secondary structure (i.e. ) and facilitate an effective aggregation among the decisions of the classifiers.

3.3 Framework Overview

Figure 2 illustrates our multi-component framework for prediction of protein secondary structure from the input protein sequences. Since each component contributes to error correction and enhances the prediction accuracy, the multi-component framework can better address the mapping procedure (). The strength of a multi-component learning machine mainly stems from the diversity of its components and the effectiveness of its aggregation method. Consequently, we utilize a pair of structurally diverse classifiers (SVM and FKNN) to form the learning core.

Figure 2: The framework of our proposed approach

Initially, a sliding window of size chunks the input protein sequences into the fixed-length set of strings (). Including neighbors of the central residue from the sliding window involves the long-range interactions among amino-acids. These interactions are a valuable information for prediction of protein secondary structure.
In the learning phase, to infer the sequence-structure mapping (i.e. ), the set of is fed into two parallel classifiers, -FKNN and edit-SVM. As these classifiers are capable to directly learn from , we bypass the feature extraction process. The -FKNN processes using the compound dissimilarity measure named as which is computed using three parameters of LZ-scores, n-gram scores and the dissimilarity rate (i.e. ) between sequence pairs (). Each of the three composing elements of infers the sequence dissimilarities from a different aspect. LZ complexity scores are able to explore sequence order information, as well as repeated patterns or the degree of randomness [46, 33]. From another perspective, the n-gram scores can capture local similarities between sequences which reflect sequence variations during evolution [31]. Furthermore, the dissimilarity rate reflects the extent of the difference in the type of amino-acid molecules composing each . Fusing these scores into a dissimilarity measure can better infer the sequence-structure relation. The edit kernel enables SVM to effectively handle string sequence data (Section 6.1). Since -FKNN and edit-SVM learn in parallel, better efficiency is provided. The output of each learning module passes through a filtering component to eliminate the biologically meaningless structures. In the aggregation pool, five various aggregation rules are accommodated. Each aggregation rule makes a consensus between the decisions of -FKNN and edit-SVM in a different fashion. The fuzzy property of KNN can further enhance the aggregation process. The final secondary structure (i.e. ) can be obtained through filtering of the aggregation results.
Q. What necessity for sequence processing? the effective measure of dissimilarity in sequence processing has a few advantages over processing of the extracted numerical feature vectors. First, it prevents information loss from the rich protein sequences. Second, the learner’s performance varies on different sets of numerical features and it is not always feasible to find the optimal feature set. Additionally, numerical features or encoding schemes (i.e. numerical representations for string protein sequences) may lead to high dimensional feature vectors which can negatively affect the learner’s performance [33]. From the biological perspective, every protein sequence contains all essential information for structure adoption [31, 33, 34]. Hence, we utilize the compound dissimilarity measure alongside with the edit kernel to process the string sequences effectively.

4 Related material

In this section we intend to introduce the whole procedure of secondary structure prediction. After that we briefly describe the utilized data sets. Then the evaluation metrics and the test train methods are introduced. Ultimately a source for access to codes and data sets are provided.

4.1 Step-by-step prediction procedure

To construct an effective sequence-based secondary structure predictor there are five steps to go over carefully as followed in []:
1) To select valid and appropriate test and train benchmark data sets which are recent and used by many other methods in the related field. The data sets must satisfy some conditions. For instance, the proteins should not be homologous, there must be enough number of proteins to draw valid conclusions from the results and more than one data set should be employed to verify the outcome. A popular data set provides a comprehensive comparative framework with the proposed method.
2) To properly encode the input biological sequence to further feed to a predictor. The encoded format must preserve as much information as possible in comparison with the original data and also have high correlation with the target label. It is important to mention that the input representation greatly affects the performance of predictor and consequently the final results. Hence the representation of protein sequences for a predictor is a challenge and must be thoughtfully addressed.
3) To design a powerful prediction engine to effectively address the challenges of secondary structure prediction. The more robust, accurate and efficient the engine is, the more it is effective.
4) To select an appropriate evaluation framework including proper and comprehensive evaluation measures and methods to adapt to the conditions of the problem. In case of secondary structure prediction as a classification problem, accuracy, specificity, sensitivity, MCC and SOV are most frequently used and well describe the effectiveness of a method. Also cross validation is popular for tuning parameters of the problem as well as training and testing the prediction method.
5) To establish a public user-friendly web-server which predicts the secondary structure of the user’s input protein data on the basis of the proposed method. Through out this paper, we elaborate each step in detail.

4.2 Test, train and Data set description

RS126, CASP12, CASP11, table of properties.
To validate the effectiveness of our approach three publicly available data sets of RS126, CASP11 and CASP12 are employed. It is important for a data set to have been used to evaluate various methods as comparison of methods is merely possible on the identical data sets. RS126 data set is widely used through out the literature of the problem. Also to evaluate our method against more recent approaches and the latest proteins with more complications the two recent CASP11 and CASP12 are utilized.
To accomplish a comprehensive validation we first employ RS126 for training and testing via 10-fold cross validation and report the results. Then we utilize CASP11 and CASP12 as two recent Independent test sets to further confirm our achievements.
Table 2 provides some quantities of the introduced data sets.

Data set No of proteins Sequence length No of residues Year of creation
RS126 [43] 126 185 32465 ’1993’
CASP11 85 44 to 669 20498 2014
CASP12 40 75 to 670 10526 2016
Table 2: Data sets description

The RS126 dataset contains globular non-homologous proteins. The coil with a portion of 45% is the most common structure in the dataset, while the other 23% and 32% of the residues are respectively categorized as -sheet and -helix.
According to the CASP official definition, CASP11 contains proteins which are regarded as hard targets meaning that it is difficult to detect their homologous structure templates from known protein structures. Also, the reported accuracy for CASP12 structures is usually lower than other CASP data sets since it contains proteins that are hard to classify.
It is worthwile to mention that as RS126 is far older than CASP11 and CASP12, there no common targets in these train and test data sets.

4.3 Evaluation measures

To demonstrate the real effectiveness of a method it is important to employ various evaluation metrics. The reason lies in the fact that a method might show higher values for some measures and possibly much lower values for some other. A truly effective and reliable method is the one that shows high and concurrently balanced values for various measures. Therefore to comprehensively evaluate our approach, we employed 5 widely used evaluation measures for protein secondary structure predictors as well as general classifiers. These measures include overall accuracy (), precision, recall, specificity and MCC. The mathematical definition of these measures is found through equations 1, 2, 3, 4 and 5[].

(1)
(2)
(3)
(4)
(5)

Where TP, FP, TN and FN respectively stand for true positive, false positive, true negative and false negative from the confusion matrix.

4.4 Availability

The links to download our code and the tree data sets are available at:

5 methodology

In this section, we explain the components of our framework as shown in Fig. 2.

5.1 Pre-processing

Conventional machine learning-based methods for prediction of protein secondary structure generate numerical feature vectors from the sequences of protein primary structure. The feature vectors are subsequently fed into a learning machine for structure prediction. Although such numerical features comprise protein evolutionary information or biochemical properties, they cause information loss in comparison with the original sequence. Additionally, some numerical encoding schemes can cause high dimensional feature vectors. In fact, the learner’s performance can differ based on various collection of numerical features. All the aforementioned consequences of numerical feature extraction can affect the learner’s performance negatively. As a result, in this work, we bypass feature extraction and directly process the input protein sequences. Thus the only pre-processing procedure we perform is to applying a sliding window of size on sequences of protein primary structure varying in length. The neighbors of a residue in a protein sequence contribute strongly to its secondary structure. Therefore, the sliding window technique facilitates the involvement of a residue’s neighbors in structure prediction. We predict the secondary structure of each residue at the center of the window that involves neighbors on the left and right sides. Assuming that there are residues in the protein dataset, sliding the window will lead to sequences (i.e. ) where each locates one of the data set residues at the center. We use the windows size that is justified by [55, 44]. Excessively large deviates the dissimilarity values between sequence pairs. Furthermore, the value of can not be very small (i.e. ) because the dissimilarity measure includes computation of n-gram scores and extremely small is biologically meaningless. Logically the value for must be smaller than .

5.2 The compound Dissimilarity Measure

The -FKNN component performs classification based on the compound dissimilarity measure that is computed using three parameters including LZ-scores, n-gram scores, and the dissimilarity rate. Our experiments show that the addition of each parameter to can further improve the accuracy (Section 6.1).
The LZ complexity measure reflects the degree of repeated patterns or the level of randomness in a sequence and can include the position information [46, 33].
Assume is a fragment of a protein sequence starting at position and ending at position where . Thus we can show that . The LZ complexity of the protein sequence is the number of fragments in the exhaustive history that represents the decomposition of the sequence. Each fragment obtained from the decomposition process must be unique except in the last step, at which it is permitted to copy a previously generated fragment.
For example, for the protein sequence the exhaustive history is where ’.’ separates the fragments generated at each step. Hence ) is 8 which is the number of fragments in the exhaustive history. The initial LZ score between two protein sequences () is defined in Eq.1 [33].

(6)

Where is the concatenation of two protein sequences of and . The more similar the two sequences are, the less the vale of will be. The final LZ score (the LZ dissimilarity of two sequences) is attained from the normalization of Eq.1 as formulated in Eq.2. We utilize the normalized LZ score in our work [33].

(7)

Since each string has a unique exhaustive history [29], algorithm 1 demonstrates how we generate the exhaustive history of a sequence.

Input: (a protein sequence with length h).
Output: (set of fragments related to the exhaustive history of protein sequence).
1:   
2:   
3:  if  then
4:      and terminate 
5:  end if
6:  if  then
7:     , go to 3 
8:  else
9:     , go to 3
10:  end if
Algorithm 1 create the exhaustive history of a protein sequence

As stated previously, LZ complexity for a sequence considers the character distribution rather than the characters themselves. Hence, this property leads to the same complexity for sequences of the same distributions, but with different characters. As a case in point, consider the two sequences of = and =. The exhaustive history of the sequences are and =. However the LZ complexity for both strings is identical and equal to 7. Nevertheless, each amino-acid brings along distinct properties to form a certain secondary structure. Therefore, to include sensitivity to the type of amino-acid molecules, we employ a new parameter called dissimilarity rate . Let and be the respective list of unique amin-acids composing and while and . Here, is retrieved by substituting of Eq.4 and Eq.5 into Eq.3.

(8)
(9)
(10)

As a significant parameter for secondary structure prediction, Local similarities among protein sequences can identify conserved structures during proteins’ evolution [31]. To incorporate the local similarities into our measure, we employ n-gram score between each pair of sequence (). The larger the n, the more strictly the similarity will be computed. Let and be the respective sets of n-gram patterns associated with the protein sequences of and where and . We compute the n-gram score () using Eq.6.

(11)

Algorithm 2 shows how in this paper, we obtain the n-gram patterns of a sequence.

Input: (a protein sequence with length ).
Output: (the set of n-gram patterns () related to the input protein sequence).
1:   
2:  while  do
3:     ,
4:  end while
Algorithm 2 generate the n-gram patterns of a protein sequence


We fuse LZ scores, n-gram scores () and the dissimilarity rate () into Eq.7 to create our final dissimilarity measure (i.e. ). As is a compound measure with beneficial parameters, it can infer dissimilarity more effectively.

(12)

5.3 Fuzzy KNN Algorithm

unlike some model-based learners such as rule-based methods and decision trees whose boundaries are triangular and straight lines [45], the KNN is capable to implement the complicated and irregular decision boundaries. Nevertheless, the effectiveness of KNN method [17, 47] is significantly influenced by the selected distance measure. In this paper, we introduced the effective dissimilarity measure of . We later employ in the membership function () of the fuzzy KNN (-FKNN) to determine the nearest neighbors of the input data. Precisely speaking, FKNN does not return one secondary structure for each input residue (). Rather, it produces the likelihood for the input residue to adopt each secondary structure. Eq.8 [26] computes the fuzzy membership values of a test residue () in each secondary class.

(13)

Here, is the likelihood of the unlabeled input residue to belong to class . Also, is the number of neighbors, denotes the degree of dissimilarity between and its neighbor () and determines the degree of fuzziness. Finally, is the initial fuzzy membership value of the neighbor for class . As formulated in Eq.8, the likelihood for each test residue () belonging to each of three classes is dependent on the initial membership values of the neighbors and also the inverse dissimilarity between and its neighbors (). The inverse dissimilarity () determines how each neighbor can enforce the membership of the input residue in a class (c). In fact, as far as a neighbor is from the input residue, in a decaying manner less weight will be assigned to the neighbor [42]. Eq.9 [26] indicates how we compute the initial membership values corresponding to the neighbors (). Suppose the class of the training residue with known structure () is .

(14)

Where is the corresponding membership value of a training residue () for class , is the number of neighboring training residues of and is the number of neighboring training residues of that also belong to class . assigns three initial values of fuzzy membership () to each training residue which corresponds to each class of secondary structure. According to Eq.9, if the label of and its neighbors is , gets the full membership of 1 for class . However, if or its neighbors do not belong to class , gets a membership value of less than 1 for class . In other words, the function aims to fuzzify the class membership of the labeled residues (with known structure) which lie in the intersecting region of three classes in the sample space. However, assigns full membership value of 1 to the samples far away from the intersecting region. As the value of is computed using (where is the neighbor of ), the class of unlabeled residues () located in the intersecting region of classes will be less influenced by the labeled residues () lying in the intersecting region [26]. Assignment of the initial fuzzy membership values to the training residues in fuzzy KNN can be considered a training phase, which leads to performance enhancement compared to the crisp KNN. Another important advantage of the employed fuzzy property is that the fuzzy membership values provide a confidence level for predictions which determines how strongly an input data belongs to each class.

5.4 Edit-SVM Algorithm

The power and efficiency of SVM lie in the fact that it actually solves an optimization problem to find the globally optimum solution. The SVM not only finds the separating boundary, but also discovers the boundary with the maximum margin from the samples of each class. This property boosts the generalization ability of SVM. Furthermore, SVM searches a higher dimensional feature space in order to find a separating hyperplane rather than searching the original space to find a non-linear separating boundary

[17]. Using kernel tricks, the transformation to higher dimensions does not need to be directly calculated and thus the efficiency of the algorithm remains high. This algorithm outperforms the shallow neural networks and almost all individual learners for prediction of protein secondary structure [7, 56]. Thus, it has been selected as one component of our framework. In the SVM module, we aim to directly process the protein language and predict the structure of the residue that is positioned at the center of the sliding window. Therefore, we employ a kernel capable to take string data as input. A popular and well-performing kernel in this area is the edit kernel, formulated in Eq.10 [3], which utilizes the edit distance between two strings in a form similar to RBF kernel. RBF kernel is one of the most effective kernels that maps the original feature space to an infinite dimensional space [56, 47].

(15)

Given two strings and generated from the alphabet set , the edit distance () is the minimum number of operations that transform to . Levenshtein ia an edit distance that permits the operations of insertion, deletion, and substitution [48] when transforming strings.

5.5 Biological Filtering

Considering a protein sequence (i.e. ) composed of amino-acids and three classes of secondary structure, the prediction vector will have different states. Nevertheless, not all of these states are biologically meaningful and valid. For instance, a single structure in a structure sequence is meaningless. Hence, in order to rectify the prediction output, we apply a set of biological transformations proposed by [2, 6] on the secondary structure sequences. The transformers are elucidated as follows: , , , and .
For an instance means that if the chunk is observed in a protein sequence, it will be transformed to . Because a single structure cannot appear between two E structures. It is important to remind that applying such filtration on the output of different predictors can cause varying results. In some methods, enacting filtration on the prediction vectors can significantly improve the accuracy. Such methods are more accurate since they have more single false predictions rather than contiguous mispredictions. However, some methods may show a minor improvement over filtration.

5.6 Aggregation Rules

Aggregation is an influential module in the effectiveness of a multi-component learner [7]. In our work, five various aggregation rules are proposed to produce the final secondary structures. The aggregators take advantage of the output from -FKNN and edit-SVM classifiers. -FKNN generates three fuzzy membership values which provide a confidence level for each decided class and enhance the aggregation process. The accuracy of our method using each aggregation rule is evaluated in section 6.
Let be the residue of the protein sequence . Also, assume and to be the first and second decisions (associated with the class with maximum and mid membership values) made by -FKNN classifier. Suppose to be the predicted secondary structure for the residue .
Aggregation 1:
According to aggregation 1, if both -FKNN and edit-SVM vote to a certain secondary structure, it will be taken as the final prediction. Otherwise, based on the fuzzy levels of confidence, the second decision of -FKNN () will most probably be the actual prediction outcome. Aggregations 2, 4 and 5 work based on the weighted decisions of the classifiers. We introduce two strategies for weight assignment. The first strategy assigns a weight proportional to the accuracy of the classifier on a validation set. Accordingly, the more accurate the classifier is, the higher priority its decision will get in the aggregation process. Suppose and are the weights for either of -FKNN and edit-SVM where and . Eq.10 and 11 are used to compute the weights.

(16)
(17)

Subsequently, the interval of [0,1] is divided into two sub-intervals proportional to the weight of each classifier. For instance, if and , the sub-intervals of and will be dedicated to the first and the second classifiers respectively. Finally, we generate a random number in the interval of [0,1]. If the value of lies in , the first classifier will decide the final class. Otherwise, the second classifier will determine the final prediction. In fact, the length of the dedicated sub-interval to a certain classifier is the probability that the decision of that classifier is returned as the final decision. We call this strategy as Roulette Wheel 1 which is particularly more applicable when the accuracy of the classifiers are not level.
The second strategy (Roulette Wheel 2) differs from the Roulette Wheel 1 when dedicating a sub-interval to each classifier. Instead of assigning sub-intervals according to the predefined weights of the classifiers, Roulette Wheel 2 considers a step size and examines the resultant accuracy of the prediction machine. Here we appoint different breakpoints over the interval of [0,1] and the selected breakpoint assigns a sub-interval to each classifier. For example, if the step size is set to 0.1, the breakpoints will be examined and the sub-intervals of the classifiers can be appointed in pairs as and , and , and ,…, and and .
In this work, the accuracy of the two classifiers is nearly even. Hence we employ the weight assignment strategy of Roulette Wheel 2 in aggregation rules of 2, 4 and 5. Figure 3 illustrates the change in accuracy (of the prediction machine) when an increase in the value of breakpoint is observed.

Figure 3: Impact of interval breakpoints on accuracy

In this experiment, we compute the accuracy as an average of 15 generated random numbers (r) for each breakpoint. As shown in Figure 3, the breakpoint 0.75 which assigns the interval of [0,75) to -FKNN and [0.75,1] to edit-SVM gains to the highest accuracy.
The pseudo-code of aggregation 2 is described below:
Aggregation 2:
According to aggregation 2, where and do not equate, the Roulette Wheel 2 is performed to determine the final prediction from and based on the associated weights.
Aggregation 3:
As aggregation 3 denotes, when differs from , the last fuzzy decision of KNN (i.e. ) will predict the final structure. Intuitively, selecting the last fuzzy decision is not necessarily the best choice. Nevertheless, we opt for the last fuzzy decision as we aim to investigate how the fuzzy decisions comply with both edit-SVM decision and the sequence-structure relation.
Aggregation 4:
In aggregation 4, if the two classifiers predict different structures, the Roulette Wheel 2 will use and to select the final structure. Aggregation 4 states that an improper local decision can cause an improper global decision.
Aggregation 5:
Aggregations 5 and 2 work quite similarly except that, aggregation 5 receives the filtered output from -FKNN ()) and edit-SVM ()). Theoretically, aggregation 2 and 5 can provide better results compared to other rules. Because they rely on the early fuzzy decisions.
Performing classification using MCP on an arbitrary number of classes Considering the number of classes to be an arbitrary value of , there will be a unified final decision corresponding to SVM () which is obtained via voting between the decisions of SVM models (one-versus-one strategy [51]). There also will be decisions associated with FKNN as , , …, where and correspond to the secondary class with maximum and minimum values of fuzzy membership function respectively. In aggregation rules of 1, 2, and 5, we take advantage of the decision of SVM () and the first decision of FKNN (). Thus, MCP can perform classification on an arbitrary number of classes.

Method Q3
FKNN+LZ 76.38
FKNN+LZ+ 80.42
FKNN+LZ++n-gram () 81.96
Edit-SVM 82.2701
MCP1 83.0906
MCP2 85.4072
MCP3 75.487
MCP4 80.8661
MCP5 87.2853
Table 3: The accuracy of the propped approach and its components

6 Experiment

We conducted comprehensive experiments on real-world RS126 benchmark dataset to evaluate the effectiveness of our framework in the prediction of the protein secondary structure. Additionally, we compared our approach with four state-of-the-art models to validate the competence of our proposed framework. We also consider four widely-used evaluation measures of accuracy, recall, specificity, and Matthews Correlation Coefficient (MCC) to compare the classification baselines. At the proper time during the course of development, we used accuracy to gradually evaluate the performance of our proposed framework in protein structure prediction. Table 3 reports the comparison among the effectiveness of various aggregation rules. From the results in Table 3, we can see that adjoining of each component (e.g. the dissimilarity parameter) to the primary ensemble machine can enhance the overall performance of the framework. According to Table 3, extending the FKNN model with the dissimilarity rate () significantly improves the accuracy. It is also observed that the multi-component variations (i.e. MCP1 to MCP5) remarkably perform better than the single classifiers (e.g. -FKNN and edit-SVM). Our framework gains a better effectiveness compared to other competitors. The reason is that our framework is not only multi-component, but also takes advantage of a new dissimilarity measure alongside with different aggregation rules. Due to utilizing the first fuzzy and edit-SVM decisions and including an extra filtration, aggregation rules of 1, 2 and 5 perform better than other variations. Because of granting priority to the last and second fuzzy decisions, both aggregation rules of 3 and 4 gain lower accuracy . Consequently, we exclude the aggregation rules of 3 and 4 from the rest of the experiments.

(a) Precision (b) Recall (c) Specificity (d) MCC
Figure 4: Effectiveness comparison between MCP1, MCP2, and MCP5

Figure 4 compares the effectiveness of various version of our proposed framework using precision, recall, specificity, and MCC metrics. According to Figure 4(a), MCP1 has the highest precision in predicting H and E structures. However, it shows a fairly low precision for the class C. Also, compared to MCP1, the MCP2 model demonstrates a better precision on class C. Moreover, since the transformations of and reduce the number of false-positives, the extra filtration in MCP5 significantly promotes the precision for the C structures. Despite the fact that MCP5 gains a lower precision for H and E structures as shown in Figure 4(b), MCP5 notably shows a better recall compared to MCP1 and MCP2 in the prediction of H and E structures.
As depicted in Figure 4(b), MCP1 is sensitive to the class imbalance. Since the higher portion of structures respectively belongs to class C, H, and E, MCP1 tends to predict each sample as C, H, and E respectively. This imbalanced tendency leads to lower false negative predictions for each of the classes based on the number of data samples. Hence, MCP1 exhibits the maximum and minimum recall values for the classes of C and E respectively. Nevertheless, while MCP2 partially resolves this sensitivity, MCP5 remarkably eliminates the effect of the class imbalance.
As illustrated in Figure 4(c), all the variations of our model demonstrate a very high and level values of specificity for H and E classes. For class C, specificity is significantly improved via MCP2 and MCP5 respectively. Except for MCP1 in class C, the values of specificity for all variations of our method and particularly in every class of the secondary structure is remarkably higher compared to the values of other metrics (e.g. MCC). Since the rate of true-negative compared to false-positive is large, our proposed method reaches a higher effectiveness.
As the MCC metric concurrently considers all the elements of the confusion matrix (i.e. True-Positive and etc), it can reflect reliable evaluation outcomes. The MCC is further useful for the class imbalance issue where the size of the classes - like in our dataset - is different. Based on the experiment, Figure 4(d) shows high values of MCC for class H and even values for E and C classes.
With regard to the classification task, the leveler and simultaneously higher values the evaluation metrics gain, the more effective the classifier will be. Hence, MCP5 and then MCP2 achieve the best performance. Moreover, the overall effectiveness of the advanced versions of our proposed approach (i.e. MCP2 and MCP5) are significantly promoted compared to the primary MCP1 model.

Method Precision Recall Specificity MCC
MCP1 89.13 61.02 97.74 0.68
MCP2 85.61 72.92 96.28 0.73
MCP5 82.25 82.54 94.59 0.77
Table 4: Effectiveness on class H
Method Precision Recall Specificity MCC
MCP1 96.27 81.87 98.53 0.84
MCP2 92.12 89.76 96.43 0.87
MCP5 90.91 94.22 95.63 89
Table 5: Effectiveness on class E
Method Precision Recall Specificity MCC
MCP1 75.18 95.38 74.23 0.7
MCP2 81.12 88.81 83.09 0.72
MCP5 87.05 84.65 89.69 0.75
Table 6: Effectiveness on class C

Tables 4, 5, and 6 compare MCP2 and MCP5 versus the initially devised classifier (MCP1). The results include the evaluation metrics of precision, recall, specificity, and MMC for the H, E, and C classes. The maximum and minimum values are respectively highlighted with the bold and underlined formats. For the class H (Table 4) MCP1 owns the lowest values of recall and MCC and concurrently the highest values of precision and specificity. In an opposite manner, MCP5 has the lowest values of precision and specificity and the highest values of recall and MCC. The same scenario holds for class E (Table 5). The behavior of the models is quite different towards class C (Table 6) for which, MCP1 exhibits the lowest values of the precision, specificity, and MCC. Nevertheless, MCP5 acquires the highest values for these metrics. In total, MCP2 achieves moderate values for all the metrics in three classes. However, since MCP5 gains the highest MCC and values for all three classes, it clearly outperforms other variations of our proposed framework.
Genuinely, there are 8 categories of protein secondary structures. The number of categories can be reduced to three by applying various reduction rules (e.g. DSSP [2]) on similar properties. According to Tables 4, 5, and 6, the measures exhibit higher values for class H compared to other classes. The reason is that the diversity of the protein categories is less in class H and it sufficiently includes one-third of the dataset samples.

6.1 Comparison and discussion

In this section, we compare the effectiveness of the variations of our framework in the prediction of protein secondary structure versus other baselines in the literature. The rival methods in this experiment are elucidated as follows:

  • Muti-Component predictor with Aggregation 1(MCP1): As explained in Section 5, MCP1 utilizes the Aggregation Rule 1 in the framework. Four other versions of our proposed solution are also devised through employing other aggregation rules. For instance, MCP2 utilizes the aggregation rule 2.

  • Support Vector Regression Using the Non-Dominated Sorting Genetic Algorithm 2 and Dynamic Weighted Kernel Fusion (SVR-NSGA2): This model proposed by Zangooei et al. [55] takes sequence profiles (numerical vectors) as input. This baseline on the one hand, employs the support vector regression for the classification task, and on the other hand utilizes NSGA2 to map the real values to integers and subsequently optimize the kernel parameters. The model also applies the weighted fusion of three kernel functions.

  • Support Vector Machine using Parallel Hierarchical Grid Search and Dynamic Weighted Kernel Fusion (SVM-PHGS): The model proposed in [56] takes the sequence profiles (numerical vectors of evolutionary information) as the training input. The baseline method utilizes SVM with a compound kernel function and PHGS to optimize the kernel parameters.

  • Ensemble Method Using Neural Networks and Support Vector Machines(EM): [7] is a method that combines the vote of multiple individual learners comprising a multi-layer perceptron (MLP), an RBF neural network, and four SVM classifiers. The method applies numerous combination rules to generate the final prediction output. Bouziane et al. [7] also investigated the effect of two input data types including Position-Specific Scoring Matrix (PSSM) [38] and a coding scheme which is used to represent the amino-acids.

  • Support Vector Machine Using Hybrid Coding for Protein (SVM-HC): The method proposed by Li et al. [30] initially employs the geometry-based similarities and where the similarity comparison is not applicable, the SVM module will be used to leverage the final protein structure. Moreover, SVM-HC extracts a 6-bit code from the physiochemical properties of both amino-acids and the tendency factors.

In this experiment, we use the RS126, CASP11 and CASP12 datasets to compare different versions of our approach with other rivals. The competitors have also similar learning modules. For instance, all the frameworks of SVR-NSGA2, SVM-PHGS, and EM utilize the SVM to accomplish the learning task. Like our proposed framework, the EM approach accommodates various combination rules. Moreover, the SVM-HC as an SVM based approach utilizes the similarity metrics for the first phase of the predictions. The abbreviation for each version of our proposed approach is numbered based on the aggregation rule - e.g. MCP1 represents the Multi-component predictor with aggregation Rule 1.

Method Name
EM_EXPOP(SC) [7] 65.2 61.54 40.76 78.8 0.473 0.387 0.438
EM_LOGOP(SC) [7] 65.19 61.59 40.78 78.73 0.474 0.387 0.438
EM_LINOP(SC) [7] 65.19 61.59 40.78 78.73 0.474 0.386 0.438
EM_EXPOP(PSSM) [7] 78.14 77.18 65.34 84.86 0.72 0.624 0.615
EM_LOGOP(PSSM) [7] 78.12 77.2 65.28 84.83 0.72 0.624 0.615
EM_LINOP(PSSM) [7] 78.14 77.18 65.34 84.86 0.72 0.624 0.615
SVM-PHGS(DWKF) [56] 84.6 91.2 72.1 84.3 NA NA NA
SVR-NSGAII(DWKF) [55] 85.75 92.47 78.41 85.11 NA NA NA
SVM_HC [30] 82.5 82.1 65.09 89.07 0.779 0.761 0.748
MCP1 83.09 81.87 61.02 95.4 0.84 0.68 0.7
MCP2 85.41 89.76 72.92 88.81 0.87 0.73 0.72
MCP5 87.3 94.2 82.5 84.65 0.89 0.77 0.75
Table 7: Effectiveness of structure prediction - Various versions of the baselines

Table 7 offers a comparison between the introduced baseline methods. Our approach owns the highest values for all evaluation metrics. The two SVM-based methods named as SVR-NSGA2 and SVM-PHGS employ support vector machine besides a weighted fusion of three kernels. Also, they optimize the kernel parameters through the grid search and genetic algorithms. However, compared to the aforementioned SVM-based models, even without any optimization, our proposed multi-component framework achieves a better performance. The reason lies in the fact that the multi-component architecture can better address the sequence-structure complexity. The table lists various versions of the EM method including Sequence Coding (postfixed with SC e.g. EM_EXPOP) and Position-Specific Scoring Matrix (PSSM). Furthermore, despite the fact that the EM method as an ensemble SVM-based machine takes advantage of several SVM modules and introduces a variety of combiners, our approach still can notably outperform various variations of this baseline. The EM baseline suffers from two possible deficiencies. First, the insufficient diversity among SVM components which is a necessary property of an ensemble system. Second, the use of the encoding schemes which can result in information loss. Additionally, our approach excels the SVM-HC baseline. Similar to our Fuzzy classification component (FKNN), SVM-HC initially employs a similarity metric for structure prediction and subsequently applies an SVM classifier on the residues that are not classified in the initial step. What empowers our method versus SVM-HC is that our approach further employs the effective compound dissimilarity measure () to directly process the protein sequences and further uses an efficient fuzzy aggregation component. Tables 8 and 9 compare the effectiveness of our approach versus other baselines as discussed in the literature (Section 2). Table 8 demonstrates how our approach outperforms various learners including the ensemble machines, neural networks, SVMs, and decision trees. Since our method uses the fuzzy aggregation process besides a multi-component designation, it outperforms other baselines. Table 10 further compares the effectiveness of other distance-based classifiers against our FKNN-based extensions using the dissimilarity parameters (i.e. LZ scores, n-gram scores, and ). The results in Table 10 prove that taking the string protein sequences for the input - as utilized in our FKNN-based components - is more beneficial compared to consuming of the numerical encoding schemes as used in rival baselines. Furthermore, the results clarify that our dissimilarity measure can better improve prediction outcomes compared to the distance metrics which are used in the rival methods.

Method’s Name
Ensemble NN (Imbalanced Training Set) [2] 73.33 68.07 52.78 73.02
Ensemble NN (Over-Sampling) [2] 73.75 71.72 68.9 77.44
Ensemble NN (Under-Sampling) [2] 73.02 69.76 77.48 73.2
Ensemble NN (Under and Over-Sampling)[2] 73.73 71.05 69.38 77.38
Ensemble NN (Tree-based) [2] 73.51 75.78 67.85 74.84
NN with SMV voting scheme [2] 74.66 74.85 72.78 75.32
NN with GWMV voting scheme [2] 74.9 72.61 70.25 78.56
NN with WMV voting scheme [2] 74.64 72.43 69.76 78.43
PLM-PBC-HPP [49] 69 67.1 62.7 73.4
ELM-HPP01 [49] 67.9 62.5 61.9 71.6
Single-Stage One-against-all [39] 69.7 54.1 79.3 59
Single-Stage One-against-one [39] 67.6 54.5 79.8 58.3
Single-Stage DAG [39] 67.5 54.2 80 58.3
Single-Stage Crammer and Singer [39] 70.2 55.8 78 56.5
Two-Stage One-against-all [39] 66.5 61.2 78.5 63.9
Two-Stage One-against-one [39] 66.5 57.5 81.2 65.4
Two-Stage DAG [39] 66.8 57.4 80.9 65.5
Single-Stage Vapnik and Weston [39] 70.4 55.7 78.2 57.1
Two-Stage Vapnik and Weston [39] 66.1 57.8 81.9 67
Multi-SVM Ensemble [39] 74.98 75.37 66.43 79.26
Two-Stage Crammer and Singer [39] 66.8 57.9 81 66.8
DT [18] NA 70.4 78.4 67.1
SVM-DT [18] NA 72.8 79.6 69.3
MCP1 83.09 81.87 61.02 95.38
MCP2 85.41 89.76 72.92 88.81
MCP5 87.29 94.22 82.54 84.65
Table 8: The accuracy of our approach versus other competitors
Method’s Name
Ensemble NN (Imbalanced Training Set) [2] 0.65 0.54 0.51
Ensemble NN (Over-Sampling) [2] 0.66 0.57 0.51
Ensemble NN (Under-Sampling) [2] 0.64 0.56 0.49
Ensemble NN (Under-Sampling and Over-Sampling) [2] 0.65 0.56 0.51
Ensemble NN (Tree-based) [2] 0.64 0.55 0.52
NN with SMV voting scheme [2] 0.67 0.58 0.53
NN with GWMV voting scheme[2] 0.67 0.58 0.53
NN with WMV voting scheme [2] 0.67 0.58 0.53
MCP1 0.84 0.68 0.7
MCP2 0.87 0.73 0.72
MCP5 0.89 0.77 0.75
Table 9: The comparison of the MCC metric using the RS126 dataset
Method’s Name
KNN [16] 49.85
Fuzzy KNN [16] 53.08
Minimum Distance [16] 59.22
FKNN + LZ scores 76.38
FKNN + LZ + 80.42
FKNN + LZ + + n-gram 81.96
Table 10: The metric for the distance-based methods

In terms of protein secondary structure prediction, Figure 5 shows the performance comparison between the best version of each of the baselines versus three variations of our proposed approach (MCP1, MCP2, and MCP5).

Figure 5: Comparison of the accuracy and MCC between the best baselines

As shown in the figure, EM_EXPOP(PSSM) has an overall poor performance across all classes. The MCP5 model gains the highest and the evenest values for all measures on two major classes of E and H. While MCP5 can effectively predict the protein structure in class E, almost all other competitors obtain a very low accuracy in this class.

6.2 Conclusion

Given the input protein sequences, in this paper, we propose a unified multi-component framework to predict protein secondary structure effectively. More specifically, we divide the prediction task into four subtasks: Pre-processing, classification, filtration, and aggregation. We utilize a sliding-window approach for the pre-processing task to make the sequences equal in length and improve the results by including the neighboring residues. We facilitate the classification module through employing two classifiers of Edit-SVM and -FKNN. The fuzzy component uses three dissimilarity measures of LZ score, n-gram score, and the dissimilarity rate which are fused into a unified dissimilarity metric. Subsequently, the output of classification module will be filtered biologically to eliminate meaningless structures. The refined output will then be fed into the aggregation pool which comprises multiple rules to better infer the outcome secondary structure. Additionally, our method is capable to directly process string protein sequences and avoid feature extraction. Feature extraction can cause data loss and increase the input data dimensions. Additionally, we consume the proteins input sequences that can globally exploit the latent natural language and better explain the hidden relationship between the input sequence and the extracted output structures. Processing textual contents is also beneficial to overcome the negative influence of the high dimensional numerical feature vectors. The experimental results of our approach and the numerical feature-based methods demonstrate the strength of the string dissimilarity measures. Generally, our approach performs prediction effectively and compared to other rivals enhances the overall accuracy of the prediction process.

6.3 Future work

Our multi-component approach is flexible and its modules can be replaced and modified. As the initial change, one might want to add more classifiers and expand the ensemble size with different base learners. Fuzzification of Edit-SVM module can enhance the aggregation process and consequently enrich the final prediction results. Moreover, a weighted dissimilarity measure in Fuzzy KNN can better validate the role for each of the dissimilarity components and result in a higher accuracy. Additionally, a parameter optimization approach can adjust the parameters in the prediction process. Hence, an extensive range of modifications can be applied to our unified multi-component framework. We leave such modifications as future work. Additionally, a user-friendly and publicly accessible protein secondary structure prediction web-servers facilitates the development of practically more useful prediction methods and computational tools. Actually, many practically useful web-servers have significantly contributed to the advancement of medical science and chemistry. Consequently, we shall make efforts in our future work to provide a web-server based on the prediction method presented in this paper

Author Biographies

Leila Khalatbari

is currently a Ph.D. candidate in Artificial Intelligence at Sharif University of Technology, Tehran, Iran. Since 2014, she has been a researcher in the computational cognitive models laboratory at Iran University of Science and Technology. She received her B.S. degree in computer software engineering from Qazvin. She recently completed her M.S. degree in Artificial Intelligence. Her research interests include machine learning, data analysis, data mining, and bioinformatics.

Mohammad Reza Kangavari

received the B.S. degree in mathematics and computer science from the Sharif University of Technology in 1982, the M.S. degree in computer science from Salford University in 1989, and the Ph.D. degree in computer science from the University of Manchester in 1994. He is currently a lecturer in the Computer Engineering Department, Iran University of Science and Technology, Tehran, Iran. His research interests include Intelligent Systems, Human Computer-Interaction, Cognitive Computing, Data Science, Machine Learning, and Wireless Sensor Networks.

Saeid Hosseini completed the Ph.D. degree in Computer Science at the University of Queensland, Australia. He obtained his M.Sc. from the Queensland University Of Technology in 2012. He received the Australian Postgraduate Award in 2015. He is currently a post-doc researcher. His research interests mainly focus on, diffusion networks, graph mining, crowdsourcing, recommendation systems, spatiotemporal databases, and social network analytics. Dr. Hosseini is an invited reviewer in reputed proceedings including ICDM and TKDE. He has also been a program committee member in CSS (2017) and DASFAA (2017 and 2018).

Hongzhi Yin

works as a Lecturer and an ARC DECRA Fellow with The University of Queensland, Australia. He received his doctoral degree from Peking University in July 2014 under the supervision of Prof. Bin Cui. His research interests include user behavior modeling, user profiling, recommender system, especially spatial-temporal recommendation, user linkage across social networks, network embedding, knowledge graph mining, and construction, topic discovery and event detection, deep learning. He has been serving as conference organizers, conference PC member for PVLDB, SIGIR, ICDE, IJCAI, ICDM, CIKM, DASFAA, ASONAM, MDM, WISE, PAKDD and reviewer of more than 10 reputed journals such as VLDB Journal, TKDE, TOIS, TKDD, and etc.

Ngai-Man Cheung

received the Ph.D. degree in electrical engineering from the University of Southern California, Los Angeles, CA, in 2008. He is currently an Assistant Professor with the Singapore University of Technology and Design (SUTD). From 2009-2011, he was a postdoctoral researcher with the Image, Video and Multimedia Systems group at Stanford University, Stanford, CA. He has also held research positions with Texas Instruments Research Center Japan, Nokia Research Center, IBM T. J. Watson Research Center, HP Labs Japan, Hong Kong University of Science and Technology (HKUST), and Mitsubishi Electric Research Labs (MERL). His work has resulted in 11 U.S. patents granted with several pending. His research interests include signal, image, and video processing, computer vision and machine learning.

References

  • [1] H. M. Abachi, S. Hosseini, M. A. Maskouni, M. Kangavari, and N.-M. Cheung. Statistical discretization of continuous attributes using kolmogorov-smirnov test. pages 309–315, 2018.
  • [2] M. Alirezaee, A. Dehzangi, and E. Mansoori. Ensemble of neural networks to solve class imbalance problem of protein secondary structure prediction. International Journal of Artificial Intelligence & Applications, 3(6):9, 2012.
  • [3] E. Aygun, B. J. Oommen, and Z. Cataltepe. Peptide classification using optimal and information theoretic syntactic modeling. Pattern Recognition, 43(11):3891–3899, 2010.
  • [4] S. Babaei, A. Geranmayeh, and S. A. Seyyedsalehi. Protein secondary structure prediction using modular reciprocal bidirectional recurrent neural networks. Computer methods and programs in biomedicine, 100(3):237–247, 2010.
  • [5] S. Babaei, A. Geranmayeh, and S. A. Seyyedsalehi. Towards designing modular recurrent neural networks in learning protein secondary structures. Expert Systems with Applications, 39(6):6263–6274, 2012.
  • [6] R. Bondugula, O. Duzlevski, and D. Xu. Profiles and fuzzy k-nearest neighbor algorithm for protein secondary structure prediction. In Proceedings of the 3rd Asia-Pacific Bioinformatics Conference, pages 85–94. World Scientific, 2005.
  • [7] H. Bouziane, B. Messabih, and A. Chouarfia. Effect of simple ensemble methods on protein secondary structure prediction. Soft Computing, 19(6):1663–1678, 2015.
  • [8] C. Bystroff and A. Krogh. Hidden markov models for prediction of protein features. In Protein Structure Prediction, pages 173–198. Springer, 2008.
  • [9] C. Chen, Y. Tian, X. Zou, P. Cai, and J. Mo. Prediction of protein secondary structure content using support vector machine. Talanta, 71(5):2069–2073, 2007.
  • [10] Y. Chen. Long sequence feature extraction based on deep learning neural network for protein secondary structure prediction. In Information Technology and Mechatronics Engineering Conference (ITOEC), 2017 IEEE 3rd, pages 843–847. IEEE, 2017.
  • [11] P. Y. Chou and G. D. Fasman. Prediction of protein conformation. Biochemistry, 13(2):222–245, 1974.
  • [12] P. M. Dinubhai and H. B. Shah. Protein secondary structure prediction using neural network: a comparative study. Int J Enhanc Res Manag Comput Appl, 3(4):18–23, 2014.
  • [13] C. Fang, Y. Shang, and D. Xu. Mufold-ss, new deep inception-inside-inception networks for protein secondary structure prediction. Proteins: Structure, Function, and Bioinformatics, 2018.
  • [14] J. Garnier, J.-F. Gibrat, and B. Robson. Gor method for predicting protein secondary structure from amino acid sequence. In Methods in enzymology, volume 266, pages 540–553. Elsevier, 1996.
  • [15] J. Garnier, D. J. Osguthorpe, and B. Robson. Analysis of the accuracy and implications of simple methods for predicting the secondary structure of globular proteins. Journal of molecular biology, 120(1):97–120, 1978.
  • [16] A. Ghosh and B. Parai. Protein secondary structure prediction using distance based classifiers. International journal of approximate reasoning, 47(1):37–44, 2008.
  • [17] J. Han, J. Pei, and M. Kamber. Data mining: concepts and techniques. Elsevier, 2011.
  • [18] J. He, H.-J. Hu, R. Harrison, P. C. Tai, and Y. Pan. Rule generation for protein secondary structure prediction with support vector machines and decision tree. IEEE Transactions on nanobioscience, 5(1):46–53, 2006.
  • [19] S. Hosseini, S. Unankard, X. Zhou, and S. Sadiq. Location oriented phrase detection in microblogs. pages 495–509, 2014.
  • [20] S. Hosseini, H. Yin, N.-M. Cheung, K. P. Leng, Y. Elovici, and X. Zhou. Exploiting reshaping subgraphs from bilateral propagation graphs. pages 342–351, 2018.
  • [21] S. Hosseini, H. Yin, M. Zhang, Y. Elovici, and X. Zhou. Mining subgraphs from propagation networks through temporal dynamic analysis.
  • [22] S. Hosseini, H. Yin, M. Zhang, X. Zhou, and S. Sadiq. Jointly modeling heterogeneous temporal properties in location recommendation. In International Conference on Database Systems for Advanced Applications, pages 490–506. Springer, 2017.
  • [23] S. Hosseini, H. Yin, X. Zhou, S. Sadiq, M. R. Kangavari, and N.-M. Cheung. Leveraging multi-aspect time-related influence in location recommendation. World Wide Web, pages 1–28, 2017.
  • [24] W. Hua, D. T. Huynh, S. Hosseini, J. Lu, and X. Zhou. Information extraction from microblogs: A survey. Int. J. Software and Informatics, 6(4):495–522, 2012.
  • [25] A. K. Johal and R. Singh. Protein secondary structure prediction using improved support vector machine and neural networks. International Journal of Engineering and Computer Science, 3(1):3593–3597, 2014.
  • [26] J. M. Keller, M. R. Gray, and J. A. Givens. A fuzzy k-nearest neighbor algorithm. IEEE transactions on systems, man, and cybernetics, (4):580–585, 1985.
  • [27] E. Krissinel. On the relationship between sequence and structure similarities in proteomics. Bioinformatics, 23(6):717–723, 2007.
  • [28] S. Y. Lee, J. Y. Lee, K. S. Jung, and K. H. Ryu. A 9-state hidden markov model using protein secondary structure information for protein fold recognition. Computers in biology and medicine, 39(6):527–534, 2009.
  • [29] A. Lempel and J. Ziv. On the complexity of finite sequences. IEEE Transactions on information theory, 22(1):75–81, 1976.
  • [30] Z. Li, J. Wang, S. Zhang, Q. Zhang, and W. Wu. A new hybrid coding for protein secondary structure prediction based on primary structure similarity. Gene, 618:8–13, 2017.
  • [31] H.-N. Lin, T.-Y. Sung, S.-Y. Ho, and W.-L. Hsu. Improving protein secondary structure prediction based on short subsequences with local structure similarity. In Bmc Genomics, volume 11, page S4. BioMed Central, 2010.
  • [32] L. Lin, S. Yang, and R. Zuo. Protein secondary structure prediction based on multi-svm ensemble. In Intelligent Control and Information Processing (ICICIP), 2010 International Conference on, pages 356–358. IEEE, 2010.
  • [33] T. Liu, X. Zheng, and J. Wang. Prediction of protein structural class using a complexity-based distance measure. Amino acids, 38(3):721–728, 2010.
  • [34] Y. Liu, J. Cheng, Y. Ma, and Y. Chen.

    Protein secondary structure prediction based on two dimensional deep convolutional neural networks.

    In Computer and Communications (ICCC), 2017 3rd IEEE International Conference on, pages 1995–1999. IEEE, 2017.
  • [35] M. A. Maskouni, S. Hosseini, H. M. Abachi, M. Kangavari, and X. Zhou. Auto-ces: An automatic pruning method through clustering ensemble selection. pages 275–287, 2018.
  • [36] F. Masulli and S. Mitra. Natural computing methods in bioinformatics: A survey. Information Fusion, 10(3):211–216, 2009.
  • [37] N. Mossos, D. F. Mejia-Carmona, and I. Tischer. Fs-tree: Sequential association rules and first applications to protein secondary structure analysis. In Advances in Computational Biology, pages 189–198. Springer, 2014.
  • [38] P. C. Ng and S. Henikoff. Predicting deleterious amino acid substitutions. Genome research, 11(5):863–874, 2001.
  • [39] M. N. Nguyen and J. C. Rajapakse. Multi-class support vector machines for protein secondary structure prediction. Genome Informatics, 14:218–227, 2003.
  • [40] K. Paliwal, J. Lyons, and R. Heffernan. A short review of deep learning neural networks in protein structure prediction problems. Advanced Techniques in Biology & Medicine, pages 1–2, 2015.
  • [41] M. S. Patel and H. S. Mazumdar. Knowledge base and neural network approach for protein secondary structure prediction. Journal of theoretical biology, 361:182–189, 2014.
  • [42] G. Pollastri, A. J. M. Martin, C. Mooney, and A. Vullo. Accurate prediction of protein secondary structure and solvent accessibility by consensus combiners of sequence and structure information. BMC bioinformatics, 8(1):201, 2007.
  • [43] B. Rost and C. Sander. Prediction of protein secondary structure at better than 70accuracy. Journal of molecular biology, 232(2):584–599, 1993.
  • [44] M. Spencer, J. Eickholt, and J. Cheng. A deep learning network approach to ab initio protein secondary structure prediction. IEEE/ACM transactions on computational biology and bioinformatics, 12(1):103–112, 2015.
  • [45] P.-N. Tan. Introduction to data mining. Pearson Education India, 2006.
  • [46] Y. T. Tan and B. A. Rosdi. Fpga-based hardware accelerator for the prediction of protein secondary class via fuzzy k-nearest neighbors with lempel-ziv complexity based distance measure, 2015.
  • [47] S. Theodoridis and K. Koutroumbas. Pattern Recognition. Elsevier, San Diego, second edition, 2003.
  • [48] M. P. J. Van der Loo. The stringdist package for approximate string matching. The R Journal, 6(1):111–122, 2014.
  • [49] G. Wang, Y. Zhao, and D. Wang. A protein secondary structure prediction framework based on the extreme learning machine. Neurocomputing, 72(1-3):262–268, 2008.
  • [50] S. Wang, J. Peng, J. Ma, and J. Xu. Protein secondary structure prediction using deep convolutional neural fields. Scientific reports, 6:18962, 2016.
  • [51] J. J. Ward, L. J. McGuffin, B. F. Buxton, and D. T. Jones. Secondary structure prediction with support vector machines. Bioinformatics, 19(13):1650–1655, 2003.
  • [52] X. Wu, V. Kumar, J. R. Quinlan, J. Ghosh, Q. Yang, H. Motoda, G. J. McLachlan, A. Ng, B. Liu, and S. Y. Philip. Top 10 algorithms in data mining. Knowledge and information systems, 14(1):1–37, 2008.
  • [53] A. Yaseen and Y. Li. Context-based features enhance protein secondary structure prediction accuracy. Journal of chemical information and modeling, 54(3):992–1002, 2014.
  • [54] M. Zamani and S. C. Kremer. A multi-stage protein secondary structure prediction system using machine learning and information theory. In Bioinformatics and Biomedicine (BIBM), 2015 IEEE International Conference on, pages 1304–1309. IEEE, 2015.
  • [55] M. H. Zangooei and S. Jalili. Protein secondary structure prediction using dwkf based on svr-nsgaii. Neurocomputing, 94:87–101, 2012.
  • [56] M. H. Zangooei and S. Jalili. Pssp with dynamic weighted kernel fusion based on svm-phgs. Knowledge-Based Systems, 27:424–442, 2012.