Deep Robust Framework for Protein Function Prediction using Variable-Length Protein Sequences

by   Ashish Ranjan, et al.
NIT Patna

Amino acid sequence portrays most intrinsic form of a protein and expresses primary structure of protein. The order of amino acids in a sequence enables a protein to acquire a particular stable conformation that is responsible for the functions of the protein. This relationship between a sequence and its function motivates the need to analyse the sequences for predicting protein functions. Early generation computational methods using BLAST, FASTA, etc. perform function transfer based on sequence similarity with existing databases and are computationally slow. Although machine learning based approaches are fast, they fail to perform well for long protein sequences (i.e., protein sequences with more than 300 amino acid residues). In this paper, we introduce a novel method for construction of two separate feature sets for protein sequences based on analysis of 1) single fixed-sized segments and 2) multi-sized segments, using bi-directional long short-term memory network. Further, model based on proposed feature set is combined with the state of the art Multi-lable Linear Discriminant Analysis (MLDA) features based model to improve the accuracy. Extensive evaluations using separate datasets for biological processes and molecular functions demonstrate promising results for both single-sized and multi-sized segments based feature sets. While former showed an improvement of +3.37 respectively for two datasets over the state of the art MLDA based classifier. After combining two models, there is a significant improvement of +7.41 +9.21 Specifically, the proposed approach performed well for the long protein sequences and superior overall performance.



There are no comments yet.


page 1

page 2


Deep Recurrent Neural Network for Protein Function Prediction from Sequence

As high-throughput biological sequencing becomes faster and cheaper, the...

Learning protein sequence embeddings using information from structure

Inferring the structural properties of a protein from its amino acid seq...

λ-Scaled-Attention: A Novel Fast Attention Mechanism for Efficient Modeling of Protein Sequences

Attention-based deep networks have been successfully applied on textual ...

ProteiNN: Intrinsic-Extrinsic Convolution and Pooling for Scalable Deep Protein Analysis

Proteins perform a large variety of functions in living organisms, thus ...

Lattice protein design using Bayesian learning

A novel protein design method using Bayesian learning is proposed in thi...

GaKCo: a Fast GApped k-mer string Kernel using COunting

String Kernel (SK) techniques, especially those using gapped k-mers as f...

ASAP-SML: An Antibody Sequence Analysis Pipeline Using Statistical Testing and Machine Learning

Antibodies are capable of potently and specifically binding individual a...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Proteins remain the most elementary yet complex functional molecules found in all living organisms. Their presence enables various molecular and biological processes that are essential for smooth operation of different biological components in an organism. Among other things, they act as catalysis and transporting agents, signal molecules and form structural support for an organism [1]

. Consequently, estimation of protein functions is important for decoding the underlying mechanism of various complex biological processes of an organism. This is also helpful in the treatment of certain diseases and development of new drugs.

Although laboratory experimental procedures are reliable for estimation of protein functions, they are slow and expensive [1]. To overcome the limitations of experimental procedures, researchers have been working towards developing new computational approaches to protein function prediction. These include analyzing protein sequences [2, 3, 4, 5], protein structures [6, 7, 8] and protein-protein interaction (PPI) networks [9, 10]. Other approaches are hybrid in the sense that they combine two or more of these three approaches [4, 14, 15, 18]. However, the availability of protein structures and protein interaction networks is quite less (restricted to a few organisms only) as compared to protein sequences. The recent successful advent of next generation sequencing technologies [16] have led to a deluge of protein sequences. This has elicited computational biologists to identify the structural [21, 29] and functional [3, 4] behaviour of uncharacterized proteins using characterized ones by analyzing their sequential data. In this context, analyzing and decoding hidden patterns from protein sequences is a crucial task for understanding the functional roles of a protein.

Function characterization of a protein based upon analysis of its amino acid sequence is a complex procedure. In [1], existing procedures were put into three categories as: Homology based approaches, Subsequence based approaches and Feature based approaches. In Homology based approaches, annotations depend upon similarity score with known protein sequence using sequence alignment techniques such as FASTA [11] and BLAST [12, 13]. These approaches usually correlate sequence similarity with functional similarity. But, studies highlight that such rationale based on similarity is a weak hypothesis [6]. Next, Subsequence based approaches search for hidden recurrent patterns or segments in a protein sequence such as motifs [22, 23] and/or functional domain. However, identifying such latent pattern given variable-length protein sequences is still a bottleneck problem. Lastly, Feature based approaches transform raw protein sequences into discriminative features for efficient characterization of new proteins based on machine learning techniques. These approaches frequently use n-mers for the construction of the sequence vector for a protein [18, 24]. Apart from n-mers, PSSM (Position Specific Scoring Matrix) features are also used for functional annotation [2].

Recently, several neural network based techniques have also been introduced for function characterization based on the analysis of entire protein sequences. Reference

[3], Cao et al.

proposed a Neural Machine Translation (NMT) based method to translate a protein sequence into possible

-terms by treating both sequence and -terms as a language. In another approach, Kulmanov et al. [4]

introduced a Convolutional Neural Network (CNN) based method and used it in combination with the hierarchy of neural network for each

-term to predict the function(s) of query protein sequence. Clark et al. [5] formulated a protein vector based on i-score for each -term and used neural network ensemble for function prediction.

All methods listed above have made great contributions towards protein function prediction based on protein sequences. However, there still exist few shortcomings that need attention from machine learning prospective, thereby providing motivation for our work is as follows:

  • Dealing with variation in sequence lengths. Protein sequences vary in length and can be partitioned into short, medium and long sequences. Generally, protein sequence datasets have a mix of short, medium and long sequences. While several approaches have been proposed to transform varied length protein sequences to equal-sized protein vectors (e.g., [18, 27]), the proposed methods often fail to perform for long protein sequences. One of the main reasons for this is that majority protein sequences are in short-to-medium range resulting in an uneven distribution with respect to sequence lengths as shown in Fig. 1.

  • Efficient feature vector representation. In reference to function prediction, every sequence is assumed to have conserved regions and non-conserved regions. Between these, only conserved regions are responsible for functional classification of sequences. In this sense, the non-conserved regions are also noise. As the length of sequences increase, the noisy regions tend to out-weight conserved regions in terms of relative lengths. As a result, the existing vector representations of protein sequences tend to get affected by non-conserved regions resulting in low performance, especially for longer sequences.

  • Performance Improvement: To improve upon the overall prediction accuracy compared to existing methods in case of varied lengths.

Based on the above discussion, the objective of our work is: To develop a multi-label protein function classification framework that gives superior performance with respect to existing approaches and produces consistent results for sequences of varying lengths, especially for long sequences.

Protein functions are represented as Gene Ontology (GO) [25] terms. GO produces unified vocabulary for gene products across three domains: Cellular Component (CC), Molecular Function (MF), and Biological Process (BP). To the best of our knowledge, no work has been done to specifically handle long sequences while preserving the prediction accuracy of short and medium sequences. In this regard, the contributions of this paper are as follows:

(a) Biological Process
(b) Molecular Function
Fig. 1: Length-wise distribution of protein sequences used for experimental evaluation. The sequences were obtained from UniProtKB database ( [26]. Distribution of protein sequences for biological process prediction is shown in (a), while (b) shows the corresponding graph for molecular function prediction.
  • Efficient feature vector representation: A novel approach to protein vector construction ProtVecGen, based on the proposed segmentation technique and bi-directional LSTM networks [28] is introduced. The proposed technique produces a fixed-length vector for a protein sequence that is robust for function prediction with respect to high variation in sequence lengths.

  • Multi-sized segmentation: Protein vector construction is further improved by combining multiple ProtVecGen features based on different segment sizes to yield a more discriminative set of features called ProtVecGen-Plus. This significantly improved the performance of the proposed framework compared to ProtVecGen.

  • Hybrid approach: The classification model based on ProtVecGen-Plus features is combined with another model based on MLDA features [31] to produce even better results.

  • Superior performance: The proposed features are evaluated using protein sequences from UniProtKB dataset [26] with 58310 protein sequences (having 295 distinct GO-terms) for BP and 43218 protein sequences (having 135 GO-terms) for MF respectively. Proposed ProtVecGen-Plus based framework achieved an average F1-score of 54.65 0.15 and 65.91 0.10 for BP and MF respectively. This is a significant improvement over existing state-of-the-art MLDA features [18] based model with corresponding average F1-score of 51.66 0.09 and 62.31 0.09 respectively. However, the hybrid model outperformed all with an average F1-score of 56.68 0.13 and 67.12 0.10 for BP and MF respectively.

  • Consistent results w.r.t. sequence lengths: Proposed approach produces consistent results for a wide range of protein sequence lengths.

The rest of the paper is organized as follows. Section II discusses the proposed framework architecture. Section III describes the proposed weighted hybrid model. This follows results and discussion in Sec. IV. Finally, Sec. V concludes the paper.

Ii Proposed Method

Recent success of deep learning techniques such as Recurrent Neural Network (RNN) for the analysis of sequence and time series data has made a strong case for the use of these techniques for the analysis of protein sequences, such as protein structure prediction

[21], protein remote homology detection [19, 20], and protein function prediction [3]. Motivated by this, the proposed framework uses RNN that processes the small segments of a protein sequence to model the protein for enhanced functional annotation. The proposed framework has the following components: 1) ProtVecGen : a novel approach to protein vector construction based on small segments of protein sequences and bi-directional long short-term memory network (a class of RNN networks), and 2) Classification Model : for predicting function(s) of protein sequences. These are described next.

Ii-a Notations

Let denote the set of “-terms (either for biological process or molecular function). Let denote a database of labeled protein sequences, where denotes a protein sequence and = denotes one-hot vector encoding representing a set of -term annotations corresponding to protein ; if performs else 0.

Ii-B Protein Vector Construction

For each amino acid sequence, a fixed-length feature vector is modeled by averaging segment vectors corresponding to the segments of a protein sequence. Protein feature vector construction involves three sub-steps: (i) Protein sequence segmentation, (ii) Segment vector generator and (iii) Protein vector generator. These are described next.

Ii-B1 Protein Sequence Segmentation

Most existing approaches to protein function prediction such as ProLanGo [3], DeepGo [4] and MLDA [18], rely upon complete protein sequences for predicting functions. Nevertheless, the functionality of a protein is the consequence of a relatively small functional subsequence present within the protein sequence [17]; this sub sequence can be considered as a latent pattern. More than often, a family of proteins having a common function shares one or more latent patterns associated with the common function [17].

Such a latent pattern is called conserved if it can be directly associated with some GO-term and is common in the family of proteins associated with the GO-term. The part of a protein sequence that is not conserved is called non-conserved region. Since, the non-conserved regions are assumed to be not associated with the functionality of a protein, they are considered as noise and may adversely affect the vector formulation. The adverse affect is prominent especially in long sequences because, here, the noisy regions significantly dominate the conserved ones. In the proposed framework, we have incorporated segmentation approach to ensure efficient formulation of a protein vector. The aim of segmentation is to reduce the dominance of non-conserved regions over relatively small conserved regions in a protein sequence.

Motivated by the above, instead of using the complete full length amino acid sequences, an efficient segmented sequence approach is proposed for developing a global feature vector for a protein. As shown in Fig. 2, each protein sequence having label set is partitioned into equal sized segments with overlapping regions as , where is the set of segments corresponding to protein and represents segment of protein . The gap between two adjacent segments and is taken as 30 amino acids. Each segment is assigned the same label set as the protein sequence. This constitutes the new training dataset of protein segments as , where and . The choice of appropriate segment size has been discussed in details in Sec. IV-B.

Fig. 2: Protein Sequence Segmentation

Ii-B2 Protein Segment Vector Generator

This section describes the method to generate segment vector based on recurrent neural network. A brief introduction to RNN is given next.

Bi-Directional LSTM. Bi-Directional Long Short-Term Memory (Bi-LSTM) [28] are long chain of repeating units called memory cells. A small buffer element called cell state is the key to the network and passes through each memory cell in the chain. As the cell state passes through each memory cell, its content can be modified using special logics namely, forget logic and input logic. This modification depends upon , which is a concatenation of the output of previous cell and the current input :


A forget logic is used to remove the irrelevant information from the cell state as new input

is encountered. It is composed of a single neural network layer with sigmoid activation function

, which acts as a filter and produces a value in range [0,1] for each element in a cell state; the value represents the component of each element (i.e., information) to let through. Mathematically, it is described as:


where denotes weight matrix and

denotes the bias vector with subscript term

f indicating forget gate.

The input logic adds new information to the cell state. It is composed of two independent neural network layers with and activations respectively. does filtering and creates a vector of new elements. Mathematically, it is described as:


and denotes weight matrices,
and denote bias vectors.

Next, the following equation is applied to update the information of a cell state by removing the information filtered out using forget logic and adding new information using input logic:


where, denotes element-wise multiplication.

Finally, the hidden state at each memory cell is decided based on the updated cell state and the output logic , where is obtained using single neural network layer. They are described as:


where and denotes weight matrix and bias vector respectively with subscript term o indicating output gate. In contrast to LSTM network, Bi-LSTM reads a sequence in both forward and backward directions. Like LSTM, it also generates a fixed-length feature vector for a sequence.

Protein Segment Vector Generator. We propose a novel method Protein Segment Vector Generator (ProtSVG) to convert protein segments to a segment vector. In ProtSVG, each segment of protein sequence is modeled independently irrespective of its neighboring segments and is represented as protein segment vector . Encoded vector representation for a protein segment is formulated based on bi-directional LSTM as discussed above. Figure. 3 describes ProtSVG model. It consists of three layers: (i) embedding layer as input layer, (ii) bi-directional LSTM layer, and (iii) dense layer with sigmoid activation function. The model is trained using protein segment pairs from the training dataset.

Each protein segment is decomposed into -mers (with ) to formulate protein words. These words are placed in the same order in which they are found in the protein segment to form a sequence. These sequences of words are fed as input to the embedding layer which outputs dense representation for each protein segment. The embedding layer generates a matrix of size ( - 4 + 1) x 32, where

denotes the fixed length of segments and 32 denotes the size of embedding. Padding is done on the last segment of a protein sequence if its length

. Next, the bi-directional LSTM layer has 70 memory cells in one block, which receive dense representations from the embedding layer. Dropout [32]

was applied in bi-directional LSTM layer and the proportion of disconnection was 0.3. This network was implemented using Keras 2.0.6

111 with the backend of TensorfFlow.

The trained ProtSVG model is then used to produce a fixed-size vector for each protein segment . Here, is the total number of -terms for either molecular function or biological process, and

denotes the posterior probability of

given the protein segment :

Fig. 3: ProtSVG : Protein Segment Vector Generator

Ii-B3 Feature Vector Construction

To generate the global feature vector of an entire protein sequence, we average its segment vectors as obtained from ProtSVG. The entire process is described in Algorithm 1. In step 1, the entire protein sequence is segmented into equal size segments of size . Segments created for each amino acid sequence in the training set are then fed to the trained ProtSVG model. This yields segment vector corresponding to segment as shown in step [2-4]. In step 5, all such s corresponding to protein are averaged to yield the global feature vector , where:


here, indicates the number of segments for protein and is the mean posterior probability for . This process allows us to generate equal-sized descriptor for each protein sequence irrespective of its length.

0:  Protein Sequence and Segment size s
0:  Protein Sequence Vector
1:  Partition into set of fixed sized segments .
2:  for  in  do
3:     Generate segment vector using protein segment vector generator ProtSVG.
4:  end for
6:  return  
Algorithm 1 ProtVecGen: Algorithm for Protein Vector Generator.

The benefits of the proposed approach are three-fold;

1) It retains relative ordering of -mers in the protein sequence, which is not the case with frequently used tf-idf weight scheme.

2) It fragments the protein sequence into small segments, transforming each such segment into a fixed size feature vector. This enables machine learning models to learn the latent patterns conserved within small regions without getting affected by remaining non-conserved parts of the complete sequence. Thus, more a machine learning models sees such a recurrent conserved pattern, the easier it becomes for the model to associate the conserved pattern with the specific protein functionality.

3) The segmentation approach also avoid the adverse effects due to high amounts of padding for short protein sequences and truncation of long sequences as in [19, 20]. The latter may result in information lose or even dropping of some conserved pattern.

Ii-C Classification based on Multi-sized Segment Feature Vectors

Because the size of the conserved patterns can vary from one sequence to another, partitioning a sequence based on a fixed segment size may split a conserved region across two segments. In order to prevent this, we partitioned protein sequences based on multiple segment sizes so that each conserved pattern is preserved in at least one of the segments. Protein vectors were generated separately for each segment size. For a given sequence, the protein vectors corresponding to its different-sized segments are then concatenated to produce a merged vector, which is finally used for training.

We call this method ProtVecGen-Plus as described in Algorithm 2. In steps [2-4], for each protein sequence , separate protein vectors for three segment sizes 100, 120, and 140 are generated as , , and respectively using Algorithm 2. These are then concatenated in step 5 ( denotes concatenation operator) to yield vector , which is finally used for training. The dimension of is , where denotes the total number of -terms as defined earlier. The choice of as segment sizes was made after evaluating a wide range of segment sizes from 60 to 700; detailed discussion regarding this is given in Sec. IV-B.

0:  Protein Sequence as
0:  Protein Sequence Vector as
1:  Initialisation : = {100, 120, 140}
2:  for size in segSize do
3:      = ProtVecGen()
4:  end for
5:  ]
6:  return  
Algorithm 2 ProtVecGen-Plus: Algorithm for construction of protein vector based on feature concatenation.

The proposed framework is shown in Fig. 4. It consists of three layers (i) Protein vector generator layer, (ii) a fully-connected neural network layer and (iii) an SVM layer, connected sequentially as shown. The first layer generates protein vectors as described in Algorithm 2. The generated protein vectors s are fed to the fully-connected neural network layer. The neural network layer consists of an input layer followed by a dense output layer with sigmoid activation function. The output of the neural network layer is fed to the SVM layer. However, the neural network and SVM layers train separately.

First, the neural network trains on s to produce a posterior probability for each individual functional class given , where . Then, the SVM layer is trained. The input to the SVM layer is the output of the trained neural network layer generated for each . Here, a separate SVM is trained for each term that produces a binary output, where value 1 predicts that is annotated with and value 0 predicts otherwise. Using SVM eliminates the need to manually identify a threshold for converting continuous probabilities to clear-cut binary values.

Iii Weighted Hybrid Framework Approach

ProtVecGen produces an improvement of +3.37% (for BP) and +5.48% (for MF) over state of the art MLDA feature based classifier. ProtVecGen-Plus does better with an improvement of +5.38% over MLDA and +2.01% over ProtVecGen for BP. Similarly, ProtVecGen-Plus showed an improvement of +8.00% and +2.52% over MLDA and ProtVecGen respectively for MF. The detailed results are discussed in Sec. IV. In order to boost the predictive performance further, a new hybrid framework is proposed that combines the proposed ProtVecGen-Plus based model with one based on MLDA (Multi-lable Linear Discriminant Analysis) features [18]. The MLDA based model is trained using MLDA features on a two-layer neural network classifier as described before in the proposed model. There are two reasons of using MLDA based model: (i) Reference [18] is the state of art for multi-label protein function prediction using machine learning techniques, and (ii) it produces good results for classifying short and mid-range protein sequences as elaborated in Sec. IV-D. The complete hybrid framework is shown in Fig. 5. We first describe MLDA in Sec. III-A. Then, the proposed weighted scheme to combine the predictions from the two models has been described in Sec. III-B.

Fig. 4: Proposed Framework

Iii-a Multi-Label Linear Discriminant Analysis

Multi-Label Linear Discriminant Analysis (MLDA) [31] is a generalization of the classical Linear Discriminant Analysis (LDA) method for multi-label scenario. Specifically, classical LDA assumes that each data element belongs to one class only, whereas MLDA allows an element belonging to multiple classes. Like LDA, MLDA also projects data from feature space to a subspace :


x is a data element in feature space ,
y is the projected element in subspace , and
is the projection matrix.

Like LDA, in MLDA the projection matrix

is obtained by solving the eigenvalue problem for matrix:


where, and denote between-class and within-class scatter matrices respectively. In case of MLDA, these matrices are calculated as:


is the cardinality of -term,
is the number of sequences,
denotes class mean of ,
denotes global mean, and
, if has as its annotation and 0 otherwise.

Iii-B Function Prediction based on Hybrid Framework

Functional annotations of proteins is done using the combined weighted predictions of: 1) The proposed model (represented as M1) and 2) MLDA based model [18] (represented as M2) as shown in Fig. 5. The MLDA-based model uses tf-idf features for classification [18]. Each protein sequence is decomposed into n-mers, based on which the corresponding tf-idf weights are computed. Each weight highlights the importance of an n-mer in a protein sequence with respect to the entire input dataset. The tf-idf technique produces a sparse feature matrix. Further application of MLDA on this produces a dense feature matrix, which is eventually used for classification.

The combined weighted prediction is computed as:


  denotes the prediction by M1 for -term,
denotes the prediction by M2 for -term, and
denotes the trade off parameter. It is formulated empirically using avg. -scores of M1 and M2:


where, avg. -score(M1) and avg. -score(M2) denote the average F1-scores of M1 and M2 respectively.

Fig. 5: Hybrid Framework

Iv Results and Discussion

We have evaluated the proposed models using protein sequences from UniProtKB database [26], which consists around 558125 protein sequences reviewed and annotated with -terms. Out of these, we randomly chose 103683 protein sequences labeled with BP and 91690 protein sequences labeled with MF for experiments. Further, we removed those terms which annotated less than 200 sequences. Protein sequences not annotated with at least one of the remaining terms were dropped. This left us with 58310 sequences (with 295 GO-terms) for BP prediction and 43218 sequences (with 135 GO-terms) for MF prediction. 222The dataset have been made available at

Iv-a Evaluation Methods and Metrices

Popular metrics precision, recall, and F1-score were used to evaluate the performance of the proposed models [30, 31]. Let be the actual set of -terms annotating protein . Let be the corresponding predicted set of -terms.

  1. Average Precision: Precision indicates the fraction of predicted -terms in set that are correct. Average precision is the mean of precision values for all the data samples.

  2. Average Recall: Recall captures the fraction of -terms in actual set that are predicted. Average recall is the mean of recall values for all the data samples.

  3. Average F1-Score: F1-Score is the harmonic mean of precision and recall values. Average f1-score is the mean of f1-score values for all the data samples.


Iv-B Choosing Appropriate Segment Size

The choice of appropriate segment size is a crucial task for the proposed approach. The best segment size was chosen based on empirical evaluation of the proposed framework with different segment sizes in the range [60-700]. Figure (a)a shows Average Precision, Average Recall and Average F1 values for different segment sizes for biological process prediction. Figure (b)b shows the corresponding values for molecular function prediction.

As shown in Fig. 6, the proposed framework yields best results in the segment size range of [80-180], for both BP and MF. Motivated by this and the success of multi-segment approach compared to single-segment approach discussed next in Sec. IV-C, we chose sizes 100, 120 and 140 to segment each protein sequence. Thereafter, vectors corresponding to the three segment sizes were then concatenated for all experiments, as discussed in Sec. II-C.

(a) Biological Process
(b) Molecular Function
Fig. 6: Effect of segment size on the performance of the proposed framework for both biological process and molecular function.

Iv-C Single Segment vs Multi Segment Approach

We also compared the effect of the three chosen segment sizes individually. Performance of feature vector ProtVecGen-100 (based on segment size 100) for the BP is shown in Table I along row 4. Similarly, the performances of ProtVecGen-120 (based on segment size 120) and ProtVecGen-140 (based on segment size 140) are shown in rows 5 and 6 respectively. ProtVecGen-Plus feature formulated based on multiple segments of size 100, 120, and 140 is also compared as shown in row 7 of Table I. Results in the remaining rows are discussed later. Columns under heading Molecular Function in Table I shows the corresponding values for MF.

ProtVecGen-Plus based classifier yields overall average F1-scores of 54.65 0.15 and 65.91 0.10 for BP and MF respectively. This is a significant improvement over the second best classifier trained using ProtVecGen-120 for the BP with average F1-scores of 52.64 0.08. For the case of MF, ProtVecGen-100 based classifier is the next best with average F1-score of 63.39 0.15. These results clearly demonstrate the superiority of multi-segment based ProtVecGen+Plus feature over the other three based on single segment sizes.

Biological Process Molecular Function
S. No. Approach Avg. Recall (%) Avg. Precision (%) Avg. F1-Score (%) Avg. Recall (%) Avg. Precision (%) Avg. F1-Score (%)
1 Bi-LSTM + NN (Complete) 31.63 0.11 33.34 0.11 31.59 0.12 37.81 0.06 38.5 0.05 37.43 0.05
2 SGT + CNN 40.95 0.13 40.91 0.14 39.53 0.12 52.82 0.15 48.46 0.14 48.69 0.16
3 MLDA + NN 49.42 0.10 52.61 0.14 49.27 0.12 58.29 0.16 60.20 0.20 57.91 0.17
4 ProtVecGen-100 + NN 53.15 0.09 54.42 0.11 52.11 0.10 63.93 0.25 65.25 0.11 63.39 0.15
5 ProtVecGen-120 + NN 53.58 0.11 55.07 0.07 52.64 0.08 64.06 0.21 64.81 0.06 63.18 0.11
6 ProtVecGen-140 + NN 52.34 0.12 54.22 0.06 51.66 0.09 63.16 0.21 63.93 0.08 62.31 0.09
7 ProtVecGen-Plus + NN 56.42 0.24 56.65 0.12 54.65 0.15 66.93 0.21 67.42 0.10 65.91 0.10
8 ProtVecGen-Plus + MLDA + NN 58.19 0.14 58.80 0.13 56.68 0.13 68.62 0.12 68.27 0.13 67.12 0.10
TABLE I: Classification Accuracy

Iv-D Effect of the length of protein sequences

In this subsection, we investigate how the proposed feature sets perform for protein sequences of different lengths. We choose to evaluate the performance of a feature set over nine distinct ranges of sequence lengths as shown in Fig. 7-9. The two proposed feature sets ProtVecGen-120 and ProtVecGen-Plus, along with MLDA [31] and Bi-LSTM features [19, 20] from literature are used for performance evaluation. A neural network based classifier is used as model. In addition to these, the performance of the proposed hybrid ProtVecGen-Plus+MLDA approach is also compared with others.

Figure (a)a and (b)b show the distribution of protein sequences based on sequence length for BP and MF respectively. While majority of protein sequences are of short-to-medium length, very few are of long length. Performance of ProtVecGen-Plus is better compared to ProtVecGen-120 for both BP and MF as shown in Figures 7 (average F1-score), 8 (average precision values) and 9 (average recall values). Here, MLDA is better than ProtVecGen-Plus for protein sequences having sequence length less then 300 amino acid residues for BP. However, the performances of both ProtVecGen-Plus and ProtVecGen-120 features are better for sequences of having more than 300 residues when compared to MLDA. The same is true for the case of MF also with MLDA performing better for sequences with less than 200 amino acid residues and vice-versa.

The performance of Bi-LSTM is worst, while that of ProtVecGen-Plus+MLDA is superior compared to others as shown in Figures 7, 8 and 9, illustrating average F1-score, average precision and average recall values respectively. The performance of ProtVecGen-Plus+MLDA remains almost consistent for the entire range of sequence lengths. Thus, all the three proposed approaches ProtVecGen-120, ProtVecGen-Plus and ProtVecGen-Plus+MLDA are able to mitigate the ill-effect of the uneven distribution of sequence lengths. Specifically, our proposed features showed much superior results for longer protein sequences when compared to other works in existing literature.

(a) Biological Process
(b) Molecular Function
Fig. 7: F1-Score for biological process and molecular function.
(a) Biological Process
(b) Molecular Function
Fig. 8: Precision for biological process and molecular function.
(a) Biological Process
(b) Molecular Function
Fig. 9: Recall for biological process and molecular function.

Iv-E Overall Comparision with State of The Art and Other Approaches

In the previous section we discussed the performance of the proposed feature sets with respect to sequence lengths. In this section, the overall performance of the proposed features is compared with features from existing literature on the NN-SVM based proposed model described in Sec. II-C. The comparison is done for both biological process (BP) and molecular function (MF) predictions. Apart from state of the art Multi-label LDA (MLDA)[31] features, the features from existing literature such as recurrent neural network based Bi-LSTM features [19, 20] and the positional statistics based Sequence Graph Transform (SGT) features [27] are also used to comparative evaluation.

Clearly, for BP, ProtVecGen-Plus outperforms MLDA and ProtVecGen-120 with an improvement of +5.38% and +2.01% respectively. Similar results were obtained for MF, with ProtVecGen-Plus showing an improvement of +8.00% and +2.52% over MLDA and ProtVecGen-100 respectively. The performances of Bi-LSTM features, based on the analysis of entire sequences and SGT features, are very low compared to state of the art and other proposed features. The performance results for both BP and MF are shown in Table I. The results clearly demonstrate that the predictive performances of all the segment-based proposed approaches are significantly better than the rest. This highlights the superiority of the sub-sequence analysis approach as compared to entire sequence analysis.

The hybrid ProtVecGen-Plus+MLDA approach does well for both biological process and molecular function. For the case of BP, average F1-score of the hybrid approach is 56.68 0.13, which is an improvement of +7.41% and +2.03% over the MLDA and ProtVecGen-Plus approach repectively. Recall and precision values also exhibit similar trend as shown in Table I. The corresponding numbers for MF prediction are: ProtVecGen-Plus+MLDA scores 67.12 0.10 on average F1 metric. This produces an improvement of +9.21% and +1.21% over the MLDA and ProtVecGen-Plus approach respectively. Recall and precision values also exhibit similar trend. Overall, ProtVecGen-Plus+MLDA feature outperforms all other feature sets and does well while annotating new proteins.

V Conclusion

The proposed framework generates a highly discriminative feature vector corresponding to a protein sequence by splitting the sequence into smaller segments, which are eventually used for constructing feature vectors. The approach overcomes the side-effects associated with using a protein sequence in entirety for constructing feature vectors by reducing the effect of noisy or non-conserved regions over the conserved pattern.

The proposed approach also takes care of the variable size of conserved patterns by converting a protein sequence into a vector based on segments of multiple sizes - 100, 120, and 140. The new framework greatly improves the function prediction performance over the existing techniques and also shows consistent performance for protein sequences of different lengths.

The proposed framework is also resistant to the uneven distribution of protein sequences based on sequence length. Further, the proposed approach is combined with MLDA feature based model to improve the overall performance using the new weighted scheme based on individual average F1-score of each model. Future work will be to incorporate interaction data to improve upon the performance for the biological processes and handle a large number of -terms.


The authors would like to thank…


  • [1] Pandey, G., Kumar, V. & Steinbach, M., (2006). “Computational approaches for protein function prediction: A survey”. Department of Computer Science and Engineering, University of Minnesota, 13-28.
  • [2] cheol Jeong, J., Lin, X. and Chen, X.W., 2011. “On position-specific scoring matrix for protein function prediction”. IEEE/ACM transactions on computational biology and bioinformatics, 8(2), pp.308-315.
  • [3] Cao, R., Freitas, C., Chan, L., Sun, M., Jiang, H., & Chen, Z. (2017). “ProLanGO: protein function prediction using neural machine translation based on a recurrent neural network.” Molecules, 22(10), 1732.
  • [4] Kulmanov, Maxat, Mohammed Asif Khan, & Robert Hoehndorf. “DeepGO: predicting protein functions from sequence and interactions using a deep ontology-aware classifier.” Bioinformatics 34.4 (2017): 660-668.
  • [5] Clark, Wyatt T., and Predrag Radivojac.“Analysis of protein function and its prediction from amino acid sequence.” Proteins: Structure, Function, and Bioinformatics 79.7 (2011): 2086-2096.
  • [6] Whisstock, James C., and Arthur M. Lesk. “Prediction of protein function from protein sequence and structure.” Quarterly reviews of biophysics 36.3 (2003): 307-340.
  • [7] Lee, D., Redfern, O. and Orengo, C., 2007. “Predicting protein function from sequence and structure”. Nature Reviews Molecular Cell Biology, 8(12), p.995.
  • [8] Roy, A., Yang, J. and Zhang, Y., 2012. “COFACTOR: an accurate comparative algorithm for structure-based protein function annotation.” Nucleic acids research, 40(W1), pp.W471-W477.
  • [9] R. Sharan, I. Ulitsky, and R. Shamir, “Network-Based Prediction of Protein Function.” Molecular Systems Biology, vol. 3, no. 1, pp. 88-100, 2007.
  • [10] H. Chua, W. Sung, and L. Wong, “Exploiting Indirect Neighbours and Topological Weight to Predict Protein Function from Protein-Protein Interactions.” Bioinformatics, vol. 22, no. 13, pp. 1623-1630, 2006.
  • [11] Lipman, David J., and William R. Pearson. “Rapid and sensitive protein similarity searches.” Science 227.4693 (1985): 1435-1441.
  • [12] Altschul, Stephen F., Warren Gish, Webb Miller, Eugene W. Myers, and David J. Lipman. “Basic local alignment search tool.” Journal of molecular biology 215, no. 3 (1990): 403-410.
  • [13] Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schäffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman. “Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.” Nucleic acids research 25, no. 17 (1997): 3389-3402.
  • [14]

    Lan, L., Djuric, N., Guo, Y. and Vucetic, S., 2013, February. “MS-kNN: protein function prediction by integrating multiple data sources”. In

    BMC bioinformatics (Vol. 14, No. 3, p. S8). BioMed Central.
  • [15] Piovesan, D., Giollo, M., Leonardi, E., Ferrari, C., & Tosatto, S. C. (2015). “INGA: protein function prediction combining interaction networks, domain assignments and sequence similarity.” Nucleic acids research, 43(W1), W134-W140.
  • [16] Metzker, Michael L. “Sequencing technologies—the next generation.” Nature reviews genetics 11, no. 1 (2010): 31.
  • [17] Friedberg, Iddo. ”Automated protein function prediction - the genomic challenge.” Briefings in bioinformatics 7.3 (2006): 225-242.
  • [18] Wang, H., Yan, L., Huang, H., & Ding, C. (2017). “From Protein Sequence to Protein Function via Multi-Label Linear Discriminant Analysis”. IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB), 14(3), 503-513.
  • [19] Li, Shumin, Junjie Chen, and Bin Liu. “Protein remote homology detection based on bidirectional long short-term memory.” BMC bioinformatics 18.1 (2017): 443.
  • [20] Liu, B., & Li, S. (2018). “ProtDet-CCH: Protein remote homology detection by combining Long Short-Term Memory and ranking methods.” IEEE/ACM Transactions on Computational Biology and Bioinformatics.
  • [21]

    Heffernan, R., Yang, Y., Paliwal, K. and Zhou, Y., 2017. “Capturing non-local interactions by long short-term memory bidirectional recurrent neural networks for improving prediction of protein secondary structure, backbone angles, contact numbers and solvent accessibility.”

    Bioinformatics, 33(18), pp.2842-2849.
  • [22] Ben-Hur, A. and Brutlag, D., 2006. “Sequence motifs: highly predictive features of protein function”. In Feature extraction . Springer, Berlin, Heidelberg, pp. 625-645.
  • [23] Wang, X., Schroeder, D., Dobbs, D. & Honavar, V., 2003. “Automated data-driven discovery of motif-based protein function classifiers”. Information Sciences, 155(1-2), pp.1-18.
  • [24] You, Z. H., Zhou, M., Luo, X., & Li, S. (2017). “Highly efficient framework for predicting interactions between proteins.” IEEE transactions on cybernetics, 47(3), 731-743.
  • [25] Ashburner, M., Ball, C.A., Blake, J.A., Botstein, D., Butler, H., Cherry, J.M., Davis, A.P., Dolinski, K., Dwight, S.S., Eppig, J.T. and Harris, M.A., 2000. “Gene Ontology: tool for the unification of biology”. Nature genetics, 25(1), p.25.
  • [26] UniProt Consortium, 2014. “UniProt: a hub for protein information”. Nucleic acids research, 43(D1), pp.D204-D212.
  • [27] Ranjan, C., Ebrahimi, S. and Paynabar, K., 2016. “Sequence Graph Transform (SGT): A Feature Extraction Function for Sequence Data Mining”. stat, 1050, p.12.
  • [28] Graves, A. and Schmidhuber, J., 2005. “Framewise phoneme classification with bidirectional LSTM and other neural network architectures.” Specifically, the proposed approach performed well for the long protein sequences and superior overall performance.Neural Networks, 18(5-6), pp.602-610.
  • [29] Marks, D. S., Hopf, T. A., & Sander, C. (2012). “Protein structure prediction from sequence variation”. Nature biotechnology, 30(11), 1072.
  • [30] Zhang, M.L. and Zhou, Z.H., 2014. “A review on multi-label learning algorithms”. IEEE transactions on knowledge and data engineering, 26(8), pp.1819-1837.
  • [31] Wang, H., Ding, C. and Huang, H., 2010, September. “Multi-label linear discriminant analysis.” In

    European Conference on Computer Vision

    (pp. 126-139). Springer, Berlin, Heidelberg.
  • [32] Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I. and Salakhutdinov, R., 2014. “Dropout: a simple way to prevent neural networks from overfitting.” The Journal of Machine Learning Research, 15(1), pp.1929-1958.