Network-based protein structural classification

04/12/2018 ∙ by Arash Rahnama, et al. ∙ 0

Experimental determination of protein function is resource-consuming. As an alternative, computational prediction of protein function has received attention. In this context, protein structural classification (PSC) can help, by allowing for determining structural classes of currently unclassified proteins based on their features, and then relying on the fact that proteins with similar structures have similar functions. Existing PSC approaches rely on sequence-based or direct ("raw") 3-dimensional (3D) structure-based protein features. Instead, we first model 3D structures as protein structure networks (PSNs). Then, we use ("processed") network-based features for PSC. We are the first ones to do so. We propose the use of graphlets, state-of-the-art features in many domains of network science, in the task of PSC. Moreover, because graphlets can deal only with unweighted PSNs, and because accounting for edge weights when constructing PSNs could improve PSC accuracy, we also propose a deep learning framework that automatically learns network features from the weighted PSNs. When evaluated on a large set of 9,509 CATH and 11,451 SCOP protein domains, our proposed approaches are superior to existing PSC approaches in terms of both accuracy and running time.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

1.1 Motivation and related work.

Proteins are major molecules of life, and thus understanding their cellular function is important. However, doing so experimentally is costly and time consuming [1]. Instead, computational approaches are often used for this purpose, which are much more efficient because they leverage on the fact that (sequence or 3-dimensional (3D)) structural similarity of proteins often indicates their functional similarity. One type of such computational approaches is protein structural classification (PSC) [2]. PSC uses structural features of proteins with known labels (typically CATH [3] or SCOP [4] structural classes) to learn a classification model in a supervised manner (i.e, by including the labels into the process of training the model). Then, the structural feature of a protein with unknown label can be used as input to the classification model to determine the structural class of the protein. This information can in turn be used to predict function of a protein based on functions of other proteins that belong to the same structural class as the protein of interest. In this paper, we focus on the PSC problem.

Note that there exists a related computational problem which can help with protein function prediction – that of protein structural comparison [5]. However, unlike PSC: 1) protein structural comparison uses structural features of proteins with known or unknown labels in unsupervised rather than supervised manner (i.e., it ignores any potential label information), and 2) it uses the features to compute pairwise similarities between the proteins in hope that highly-similar proteins will have the same label (where the labels are used only after the fact), rather than predicting the label of a single protein. In other words, both the goals and working mechanisms of PSC and protein structural comparison are different. Hence, the two approach categories are not comparable to and cannot be fairly evaluated against each other.

Since proteins with high sequence similarity typically also have high 3D structural and functional similarity, traditional PSC approaches have relied only on sequence-based protein features [6]. A popular baseline sequence feature is the amino acid composition (AAComposition), which measures the relative composition of the different amino acid types in a protein sequence [5]. Some other, more comprehensive sequence features include the position-specific scoring matrix [7], the three-state secondary structure profile [8], and the HMM profile [9], all of which were recently used by a PSC approach called SVMfold [6]

. SVMfold integrates the above three sequence features to represent a protein sequence and then uses support vector machine as the classification algorithm to perform PSC.

Although sequence features have been extensively used for the purpose of PSC, it has been argued that proteins with low sequence similarity can still show high 3D structural and functional similarity [10]. On the other hand, proteins with high sequence similarity can have low 3D structural and functional similarity [11]. Hence, PSC based on 3D-structural as opposed to (or in addition to) sequence features could more correctly identify the structural class of a protein [12].

Typically, 3D-structural approaches extract features directly from the 3D structures of proteins and then use these “raw” features to compare proteins [12, 13]. Interestingly, recent 3D-structural PSC approaches have focused on classification based on protein pairs [14, 15, 2]

. For example, they consider a pair of protein structures – one with known label (class) and the other one with unknown label, and if the proteins are similar enough in terms of their 3D-structural (and possibly also sequence) features, they assign the known label of the currently classified protein to the currently unclassified protein. As such, these approaches fall somewhere in-between PSC (because both are supervised, but PSC analyzes a single protein at a time) and protein structural comparison (because both focus on protein pairs, but protein structural comparison is unsupervised). Therefore, they are not comparable to and cannot be directly evaluated against approaches that solve the PSC problem as defined in our study.

In addition, several fully unsupervised approaches have also been proposed that use 3D-structural features to compare protein structures [16, 17]. For example, recently, a 3D-structural feature called tuned gauss integrals (GIT) was used to cluster proteins into structurally similar groups [17].

In contrast to the “raw” 3D-structural features, protein 3D structures can first be modeled using protein structure networks (PSNs), in which nodes are amino acids and edges link amino acids that are spatially close enough to each other. Then, network-based features can be extracted from the PSNs and used in the task of PSC. A popular concept in this regard is the notion of protein contact maps [18], which are nothing but an alternative representation of PSNs. A contact map is the representation of an amino acid-long 3D protein structure into an 2-dimensional matrix . In a contact map , has a value of 1 if the amino acids and are within a pre-defined distance cutoff, i.e., they are in contact, and 0 otherwise. A recent approach that used contact map-based features for PSC is the cutoff scanning matrix (CSM) [19].

Unlike contact maps that are “simple” network representations, there exists a different category of PSN features that are based on graph-theoretic concepts, i.e., that measure different network properties. One such baseline PSN feature is Existing-all, which integrates seven network properties to represent a PSN [5]. Another popular PSN feature that counts different types of network patterns is the concept of graphets; graphlets are subgraphs or small lego-like building blocks of complex networks [20].

We believe that the graph-theoretic PSN-based PSC is promising. This is because we recently proposed an unsupervised protein structural comparison approach called GRAFENE that relies on graphlets as PSN features of a protein [5]. Given a set of PSNs as input, GRAFENE first extracts different versions of graphlet features from each PSN. Then, it quantifies structural similarity between each pair of the PSNs by comparing their features in an unsupervised manner. GRAFENE outperformed other state-of-the-art 3D-structural protein comparison approaches, including DaliLite [21] and TM-align [22].

In this work, we use the graphlet-based PSN features for the first time in the task of supervised PSC, with a hypothesis that they will improve upon state-of-the-art non-graphlet PSN features and non-PSN features that have traditionally been used in this task. Note that there exists a supervised approach that used graphlets to study proteins [23]. However, it did so in the task of functional classification of amino acids, i.e., nodes in a PSN, rather than in our task of structural classification of proteins, i.e., PSNs. Also, this approach only used the concept of regular graphlets, while we also test a newer concept of ordered graphlets [24] (see Methods), which outperformed regular graphlets in the GRAFENE study [5].

In general, a PSC approach comprises of two key aspects: 1) a method to extract features from a protein structure and 2) selection of a classification algorithm to be trained based on the features (and protein labels). Hence, existing PSC approaches can be divided into two broad categories. The first category includes approaches that focus on improving a classification algorithm by relying on existing features [25, 26, 27]. The second category includes approaches that extract novel features to predict the structural class of a protein by relying on existing classification algorithms [6, 28, 29]. Our study belongs to the second category, since our goal is to evaluate graphlet features against other state-of-the-art PSC features in a fair evaluation framework, i.e., under the same (representative) classifier, without necessarily aiming to find the best classifier.

1.2 Our contributions

We propose a PSC framework called NETPCLASS (network-based protein structural classification). As one part of our framework, we propose the use of graphlet- and thus PSN-based protein features in the PSC task under an existing classification algorithm. As another part of our framework, we aim to achieve the following. Graphlets can deal only with edge-unweighted networks. Yet, we hypothesize that the existing PSN definition, which links with unweighted edges those pairs of amino acids whose 3D spatial distance is below some predefined threshold, can benefit from including as edge weights the actual spatial distances, and by doing so for all pairs of amino acids in the 3D structure rather than only for those pairs that are below the given threshold. So, we model a PSN as a weighted adjacency matrix. Because extracting features from such a matrix is a non-trivial task, we propose a deep learning-based PSC approach that achieves this automatically.

More details about our study are as follows:

  1. We evaluate, in the task of PSC, eight versions of graphlet features that were already used for unsupervised protein structural comparison [5]. In addition, in the same task, we evaluate a non-graphlet network (Existing-all) feature [5], a recent contact map-based (CSM) feature [19], a recent 3D-structural (GIT) feature [17], a state-of-the-art sequence (SVMfold) feature, and a baseline sequence (AAComposition) feature [5]. We use the same classification algorithm for each of the above 13 features to learn (train) their classification models, in order to fairly compare their performance. Here, as a proof-of-concept, we use a simple yet powerful logistic regression (LR) classifier, whose output indicates, for the given input protein and each class, the likelihood that the protein belongs to the given class.

  2. Since the different categories (i.e., sequence, 3D structural, contact-map, or PSN-based) of protein structural features can provide complementary information, we combine the individual features to form new integrated features and evaluate these against each of the individual features.

  3. Because graphlets, which are state-of-the-art network features, are currently designed only for unweighted PSNs, and because the current literature lacks knowledge on how to efficiently extract meaningful features from a weighted network, we aim to extract such features automatically via deep learning (DL).

  4. We evaluate the considered approaches on a large set of CATH and SCOP protein domains. We transform protein domains to PSNs with labels corresponding to CATH and SCOP structural classes, where we study each of the four levels of CATH and SCOP hierarchies [5]. Our evaluation is based on measuring how correctly the trained classification models can predict the classes of labeled proteins in the test data using -fold cross-validation.

Our key findings are as follows.

In terms of PSC accuracy, when we compare the individual features, we observe that the best of our graphlet features outperform all of the other individual features except GIT and SVMfold. However, regarding GIT, while it shows only marginally superior performance to the graphlet features, GIT is only applicable to proteins, while graphlets are general-purpose network features and are applicable to many other complex systems that can be modeled as networks. Further, we show that integrating GIT with the best graphlet features improves accuracy of each of GIT and the graphlet features, which means that each of the two individual features contributes to the superior performance of their integrated version. Additionally, we observe that integrating all of the individual features (not just GIT and the best graphlet features) further improves accuracy over each individual feature, yielding the best proposed approach. Regarding SVMfold, while this approach performs well (comparable to the best of our proposed approaches), SVMfold is orders of magnitude slower than our proposed approaches. In fact, SVMfold is so slow that we were able to run it only on 5.5% of our data.

In terms of running time, most of the features show similarly fast performance, followed by CSM, the integrated feature, and SVMfold, respectively.

Accounting for edge weights in PSNs via DL achieves accuracy that is relatively comparable to performance of the individual unweighted network-based methods. Note that here we are comparing as simple as possible weighted network information (the weighted adjacency matrix) against highly sophisticated unweighted network information (graphlet features, which are the state-of-the-art in network science). So, a comparable accuracy of the former and the latter implies a promise of future weighted network-based analyses of protein 3D structures (such as developing and using weighted graphlet-based features).

2 Methods

2.1 Data and protein structure network (PSN) construction

First, we use a set of 17,036 proteins that was previously used in the large-scale unsupervised protein structural comparison GRAFENE study [5]. In this data set, each protein pair is at most sequence similar. To identify protein domains, we use two protein domain categorization databases: CATH and SCOP.

To construct a PSN from a protein domain, we use Protein Data Bank (PDB) files, which contain information about the 3D coordinates of the heavy atoms (i.e., carbon, nitrogen, oxygen, and sulphur) of the amino acids in the domain. In a PSN, nodes are amino acids of a protein domain and there is an edge between any two nodes if they are sufficiently close in the 3D space. Clearly, given a protein domain, its corresponding PSN construction depends on 1) the choice of atom(s) of an amino acid to represent it as a node in the PSN and 2) a distance threshold between a pair of nodes to capture their spatial proximity. It was recently shown, by considering four different combinations of atom choice and distance threshold definitions (any heavy atom with 4 Å, 5 Å, and 6 Å distance thresholds, and -carbon with 7.5 Å distance threshold), that the choice of atom and distance threshold does not significantly affect the overall protein structural comparison performance [5]. Hence, we consider only one of these PSN construction strategies in our study. Namely, we define an edge between two amino acids if the spatial distance between any of their heavy atoms is within 4 Å. Additionally, in order to only keep “meaningful” PSNs for further analysis, we filter the PSNs using an established guideline that is based on network properties of a PSN [5]. Namely, we only keep a PSN if it has 1) a single connected component, 2) a diameter of at least six, and 3) at least 100 nodes (amino acids). Following the above established criteria to create and filter PSNs, we obtain 9,440 and 11,352 PSNs corresponding to CATH and SCOP, respectively.

Given the CATH PSN data, we do the following. First, we test the power of the considered PSC approaches to predict the top hierarchical level classes of CATH: alpha (), beta (), alpha/beta (/), and few secondary structures. For few secondary structures, none of the CATH PSNs belongs to this class, so we do not consider this class further. Hence, we take all 9,440 CATH PSNs and identify them as a single PSN set, where the PSNs have labels corresponding to three top level CATH classes: , , and . Second, we compare the approaches on their ability to predict the second level classes of CATH, i.e., within each of the top-level classes, we classify PSNs into their sub-classes. To ensure enough training data, we focus only on those top-level classes that have at least two sub-classes with at least 30 PSNs each. Three classes satisfy this criteria. For each such class, we take all of the PSNs belonging to that class and form a PSN set, which results in three PSN sets. Third, we compare the approaches on their ability to predict the third level classes of CATH, i.e., within each of the second level classes, we classify PSNs into their sub-classes. Again, we focus only on those second-level classes that have at least two sub-classes with at least 30 PSNs each. Nine classes satisfy this criteria. For each such class, we take all of the PSNs belonging to that class and form a PSN set, which results in nine PSN sets. Fourth, we compare the approaches on their ability to predict the fourth level classes of CATH, i.e., within each of the third level classes, we classify PSNs into their sub-classes. We again focus only on those third level classes that have at least two sub-classes with at least 30 PSNs each. Six classes satisfy this criteria. For each such class, we take all of the PSNs belonging to that class and form a PSN set, which results in six PSN sets.

Thus, in total, we analyze 1+3+9+6=19 CATH PSN sets. For further details on the number of PSNs and the number of different protein structural classes in each of the PSN sets, see Supplementary Tables S1-S3. We follow the same procedure for the SCOP PSN data and obtain 1+5+6+4=16 SCOP PSN sets. For more details, see Supplementary Section S1 and Supplementary Tables S1-S3.

Given the CATH and SCOP PSN sets, we group them into four PSN set groups, corresponding to the four hierarchy levels of CATH and SCOP: group 1 (all PSN sets), group 2 (all PSN sets), group 3 (all PSN sets), and group 4 (all PSN sets).

Second, in addition to the 35 CATH and SCOP PSN sets from the GRAFENE study as described above, we use a different dataset because of the following reason. Typically, high sequence similarity of proteins indicates their high structural similarity. Hence, given a set of proteins in which proteins in the same structural class have high sequence similarity (typically ), a “simple” protein sequence comparison might be sufficient to perform PSC [30]. So, we aim to evaluate how well our considered protein features (that are based on different aspects of a protein structure) can identify proteins in the same structural category when all of the proteins (within and across structural categories) show low () sequence similarity.

In order to do this, we download the dataset called Astral from the SCOPe 2.04 database [31]. This dataset has 14,666 protein domains, where each domain pair is at most sequence similar to each other. Each protein domain in this dataset is annotated by a label (i.e., protein structural class) assigned by the protein domain categorization database SCOP, where the label indicates the protein family to which the domain belongs. We create a PSN corresponding to each of the protein domains as described above. Then, we follow the same criteria as in the GRAFENE study [5] (also described above) to only keep “meaningful” PSNs. This results in 1,677 PSNs belonging to 33 different protein structural classes. We name this set of 1,677 PSNs as Astral.

Taken together, in our study, we use PSN sets (35 CATH and SCOP PSN sets and one Astral PSN set) that contain 9,440 protein domains annotated by the CATH database and 12,820 protein domains (union of the above SCOP-related 11,352 protein domains and the Astral-related 1,677 protein domains) annotated by the SCOP database.

2.2 Our evaluation framework

2.2.1 Protein features

For each of the protein domains, we extract 13 types of protein features that are based on either sequence, 3D structure, contact map, non-graphlet network, graphlet network, or weighted network (Table 1).

Category Name Brief description Sequence AAComposition Relative frequency of the 20 amino acid types in a protein SVMfold Integration of PSSM, SS, and HMM profiles of a protein 3D structure GIT Gauss integral measurements of the backbone of a protein Contact map CSM Counts of the number of contacts in different contact maps of a protein Non-graphlet network Existing-all Integration of seven network properties of a PSN Graphlet network Graphlet-3-4 Counts of graphlets of node size three to four in a PSN Graphlet-3-5 Counts of graphlets of node size three to five in a PSN NormGraphlet-3-4 Normalized Graphlet-3-4 of a PSN NormGraphlet-3-5 Normalized Graphlet-3-5 of a PSN OrderedGraphlet-3 Counts of ordered graphlets of node size three in a PSN OrderedGraphlet-3-4 Counts of ordered graphlets of node size three to four in a PSN NormOrderedGraphlet-3 Normalized OrderedGraphlet-3 of a PSN NormOrderedGraphlet-3-4 Normalized OrderedGraphlet-3-4 of a PSN Integrated GIT+OrderedGraphlet-3-4 Concatenation of GIT and OrderedGraphlet-3-4 Concatenate-all Concatenation of all of the above individual features except SVMfold Weighted network Distance matrix Pairwise euclidean distances between each pair of amino acids of a protein
Table 1: Summary of all protein features that we use in this study. We use the same classifier (logistic regression-based) for each of the features (all except distance matrix), in order to evaluate their performance. For the distance matrix feature (colored in gray), we use deep learning.

Sequence-based features. We use a baseline sequence feature, AAComposition. Given a protein sequence, AAComposition measures the relative frequency of the 20 types of amino acids: for each amino acid type , it measures the frequency occurrence of in the sequence divided by the total number of amino acids in the sequence.

Additionally, we use a recent state-of-the-art sequence method called SVMfold. Given a protein sequence, SVMfold computes the position-specific scoring matrix (PSSM) [7], the three-state secondary structure (SS) profile [8], and the HMM profile [9], of the protein sequence and integrate these three features to obtain a single feature representation of the protein [6].

3D structural feature. We use a recent 3D-structural feature, GIT. Given a protein structure, GIT measures how often the -carbon trace of the protein forms different kinds of patterns in the 3D space. To measure the number of different patterns, GIT computes 31 different gauss integrals and uses them as the feature representation of a protein [17].

Contact map-based feature. We use a recent contact map-based feature, CSM. Given a protein structure, CSM first computes 151 contact maps, which are based on 151 distance cutoffs (ranging from Å with the interval of Å). Then, CSM measures the number of amino acid pairs that are in contact in each of the 151 contact maps as a protein feature [19].

Non-graphlet network-based feature. Here, we use a feature that was shown to out-perform many other non-graphlet network-based features in an unsupervised protein comparison task [5]. We denote this feature as Existing-all. Given a PSN, Existing-all calculates and integrates seven network features: average degree, average distance, maximum distance, average closeness centrality, average clustering coefficient, intra-hub connectivity, and assortativity.

Graphlet network-based features. We use eight such features.

Graphlet counts. We use two graphlet-based protein features, i.e., Graphlet-3-4 and Graphlet-3-5. Given a PSN, Graphlet-3-4 and Graphlet-3-5 count the number of 3-4-node and 3-5-node graphlets, respectively. In particular, in the Graphlet-3-4 or Graphlet-3-5 feature vector, position represents the count of graphlets of type [5].

Normalized graphlet counts. Since PSNs can be of very different sizes, we use two recent protein features that are based on normalized graphlet counts and that thus account for network size differences [5]. These features are NormGraphlet-3-4 and NormGraphlet-3-5; they are normalized equivalents of Graphlet-3-4 and Graphlet3-5, respectively. In particular, given a PSN, in both NormGraphlet-3-4 and NormGraphlet-3-5 feature vectors, a position represents the total count of graphlets of type divided by the sum of the counts of all graphlet types.

Ordered graphlet counts. Graphlets capture 3D structural but not sequence information. To integrate the two, ordered graphlets were proposed [24]. These are graphlets whose nodes acquire a relative ordering based on positions of the amino acids in the sequence. Two ordered graphlet features exist: OrderedGraphlet-3 and OrderedGraphlet-3-4 [24, 5]. For a PSN, in OrderedGraphlet-3 and OrderedGraphlet-3-4 feature vectors, position is the total count of ordered graphlets of type .

In addition, we use two features that are based on normalized counts of ordered graphlets [5]: NormOrderedGraphlet-3 and NormOrderedGraphlet-3-4; these are normalized equivalents of OrderedGraphlet-3 and OrderedGraphlet-3-4, respectively. For a PSN, in NormOrderedGraphlet-3 and NormOrderedGraphlet-3-4 feature vectors, position is the total count of ordered graphlets of type divided by the total count of all ordered graphlet types.

Principal component analysis (PCA)-transformed features. For each of the 13 features described above (eight graphlet features, one non-graphlet network feature, one contact map feature, one 3D-structural feature, and two sequence features), we generate the corresponding 13 new features using PCA. Recently, PCA transformation of protein features, in order to better capture their (dis)similarity, was proposed [5]. Here, we perform the same PCA transformation. For a given PSN set, for each of the above 13 protein features, we apply PCA to obtain new PCA-transformed features. Specifically, we pick the first principal components such that the value of is at least two or as low as possible so that the PCA-transformation retains at least 90% variation in the data set.

Integrated features. we propose four new features that integrate some of the above existing protein features in different manner. We concatenate the selected features to each other, in order to integrate them. Formally, given features of size (=1, 2, …, ), we concatenate them to form a features of size .

We integrate the best of the pre-PCA non-graphlet features (GIT) and the best of the pre-PCA graphlet features (OrderedGraphlet-3-4)(see Section 3) into a new combined feature called GIT+OrderedGraphlet-3-4. Additionally, we integrate all of the pre-PCA features (except SVMfold) into a new combined feature called Concatenate-all. We do not use SVMfold because of its high running time complexity (see Section 3).

Similar to the case of pre-PCA features as above, we combine the GIT post-PCA and the OrderedGraphlet-3-4 post-PCA protein features as a new post-PCA combined feature called GIT+OrderedGraphlet-3-4*. Additionally, we combine all of the post-PCA protein features (except SVMfold) as a new post-PCA combined feature called Concatenate-all*.

Weighted network-based feature. We use a weighted adjacency matrix, or distance matrix [32], of a 3D protein structure as a weighted PSN-based feature representation. In particular, given a protein of length , we define a weighted adjacency matrix of size , in which each position contains the minimum 3D spatial distance between the amino acids and , where the minimum is taken over all pairwise distances between any heavy atoms of and .

Taken together, in our study, we use 31 different protein features (13 individual pre-PCA features, 13 individual post-PCA features, two integrated pre-PCA features, two integrated post-PCA features, and a weighted network-based feature.)

2.2.2 The logistic regression (LR) framework

Given a PSN set, we train an LR classifier corresponding to each of the 30 out of the 31 (all except the weighted network-based feature) pre- or post-PCA protein features (see above). Hence, for each of the PSN sets, we get 30 different trained LR classifiers. In each of the trained classifiers, the input is a feature representation of a protein and output is the structural class to which the protein belongs. Since PSC is a multi-class problem, we use the one-vs-rest scheme to train an LR classifier. Due to space constraints, we provide further details about our LR framework in Supplementary Section S2.

We use -fold cross-validation to evaluate the performance of our LR classifiers. Given a PSN set, we first divide it into 10 equal-sized subsets, such that each subset contains the same proportion of different protein structural classes (i.e., labels) as present in the initial PSN set. Then, using one of the subsets as the test set and the union of the remaining nine subsets as the training set, we measure the percentage of proteins that are classified into their correct protein structural classes in the test set. We do this for each of the 10 subsets. Then, we take average of the 10 percentages (i.e., accuracy values) that correspond to the 10 runs.

2.2.3 The deep learning (DL) framework

In the second part of our study, we design a DL framework, in order to learn features of 3D protein structures using weighted protein structure networks. For each of the 36 PSN data sets, we train a deep neural network, where we use distance matrix-representations of proteins as input. Our DL framework consists of one input layer, seven hidden layers, and an output layer. Due to space constraints, we provide further details about our DL framework in Supplementary Section S3.

We evaluate our DL architecture using 10-fold cross-validation as described above. Given a PSN set, we follow the same procedure as for the LR framework to obtain performance accuracy for the given PSN set.

Note that recently a related deep learning method was proposed that uses 3D-structural information for protein function prediction [33]. Given a protein, the method extracts two types of structural information: the torsional angles for each of the amino acids and the pairwise spatial distances between the

-carbons of the amino acids of a protein. Given these structural information, the method uses a deep convolutional neural network framework along with support vector machine to perform the protein function prediction. The method was applied to classify enzymes (i.e., proteins) into functional categories. Since, we only became aware of this very recent method towards the end of our work, we could not include it into our evaluation. However, note that this method is not based on PSNs and the focus of the study was on the task of protein function prediction rather than on our task of PSC.

3 Results and discussion

Throughout this section, unless stated otherwise, we analyze the 35 considered PSN sets that span all four levels (groups) of CATH and SCOP hierarchies, plus the Astral PSN set, totaling to 36 PSN sets. For each considered feature, we report its accuracy as well as running time.

In Section 3.1, we compare the different graphlet features (Section 2) to identify the best one(s) for further analyses. In section 3.2, we identify the best of the pre- or the post-PCA versions for each of the considered features. In Section 3.3, we evaluate how well the best of the graphlet feature(s) perform in comparison to the existing baseline or state-of-the-art protein features that we study (considering for each feature the best of its pre- and post-PCA versions). Here, we leave out from consideration the existing SVMfold sequence approach [6], because we were unable to apply this approach to all 36 considered PSN sets due to its extremely high time complexity. Instead, we consider the SVMfold later on, in a smaller-scope analysis of two of the 36 PSN sets on which we were able to run SVMfold (see below). In Section 3.4, we evaluate whether integration of the different features improves upon each of the individual features.

In Section 3.5, we compare the performance of the best graphlet-based PSC approaches that deal with unweighted PSNs to the performance of simple weighted PSN-based feature classification via deep learning. In Section 3.6, we analyze two representative PSN sets on which SVMfold could be run, in order to compare our proposed approaches to this state-of-the-art existing sequence-based PSC approach.

3.1 Comparison of graphlet features

When we compare all graphlet features under the LR classifier, OrderedGraphlet-3-4 is the most accurate of all pre-PCA graphlet features, while NormOrderedGraphlet-3-4 is the most accurate of all post-PCA graphlet features (Fig. 1). So, for further analyses we keep these two best-performing graphlet features.

OrderedGraphlet-3-4, i.e., adding sequence-based node (amino acid) order, improves upon its regular (non-ordered) counterpart. This result is in alignment with our past work on unsupervised protein comparison [5]. Unlike in our past unsupervised study, in our current study, graphlet feature normalization does not always improve upon non-normalized features, and sometimes it actually worsens accuracy.

Figure 1:

Accuracy of the 16 pre- and post-PCA graphlet features under the LR classifier, for each of the four hierarchy levels (groups) of CATH dataset, averaged over all PSN sets belonging to the given group (vertical lines are standard deviations), plus the Astral PSN set. Results are qualitatively similar for the four groups of the SCOP dataset as well (Supplementary Fig. S1).

3.2 Selection of the best of the pre- or the post-PCA features

Here, we consider the following features under the LR classifier: all non-graphlet features except SVMfold, the two top performing graphlet features from Section 3.1, and all integrated features (i.e., Concatenate-all, and GIT+OrderedGraphlet-3-4) (Table 1). Note that since we could not apply SVMfold to all of the 36 PSN sets that we use, we exclude SVMfold from this analysis. When we compare pre- and post-PCA versions of each of the considered features, we find that pre-PCA performs the best for GIT, CSM, OrderedGraphlet-3-4, and GIT+OrderedGraphlet-3-4, while post-PCA performs the best for AAComposition, Existing-all, NormOrderedGraphlet-3-4, and Concatenate-all (Fig. 2).

So, PCA helps of the time. Thus, henceforth, for each feature, we use the best of its pre- and post-PCA versions. Also, for a given feature, if we use its post-PCA version, “*” is shown next to the feature’s name.

Figure 2: Accuracy of pre- and post-PCA versions all non-graphlet features except SVMfold, the two top performing graphlet features from Section 3.1, and all integrated features (i.e., Concatenate-all, and GIT+OrderedGraphlet-3-4) under the LR classification framework (Table 1), for PSN group 4 of CATH. Results for the other groups of CATH, all groups of SCOP, and the Astral PSN set are qualitatively similar (Supplementary Figs. S2, S3, and S4 respectively). Results are averaged over all PSN sets in the given group (horizontal and vertical lines are standard deviations).

3.3 Graphlet feature(s) outperform other protein features

First, we compare the best of our graphlet features to a baseline sequence (AAComposition) feature to see how our graphlet features, which are PSN-based, compare against a naive sequence-based feature. We find that both OrderedGraphlet-3-4 and NormOrderedGraphlet-3-4* significantly (-value ) outperform AAComposition in terms of accuracy and are comparable in terms of running time (Figs. 3, 4, and 5).

Second, we compare the best of our graphlet features with a recent 3D-structural (GIT) feature to see how our graphlet features, which intuitively capture the 3D structure of a protein and are PSN-based, compare against a 3D-structural protein feature that is not PSN-based. We find that on average GIT shows marginally superior performance in terms of accuracy over both OrderedGraphlet-3-4 and NormOrderedGraphlet-3-4*. However, while GIT shows only marginally superior performance to the graphlet features, GIT is only applicable to proteins, while graphlets are general-purpose network features and are applicable to many other complex systems that can be modeled as networks. Even in the field of modeling protein structures, GIT has the following limitation. GIT cannot process a protein structure that misses more than three -carbons in its structurally resolved form, while graphlet-based features have no such limitation. This is a problem because protein structures are usually determined using experimental procedures that often, if not always, fail to determine the whole protein structure [34]. Note that this is not a problem in our analysis because we only include those protein structures that could be processed by each of the approaches that we consider. Furthermore, notice that given a PSN set, we measure the performance of a given approach as the percentage of correctly classified protein structures over all of the protein structural classes, without looking into how the given approach performs with respect to each of the structural classes individually (Section 2.2). So, although on average GIT shows comparable performance to our individual graphlet features, it would be interesting to see whether our individual graphlet features are more suitable for certain protein structural classes as compared to GIT. We find that of all 228 protein classes over all 36 PSN sets, the best of our graphlet approaches, OrderedGraphlet-3-4 and NormOrderedGraphlet-3-4*, show better performance than GIT in 26 and 15 protein classes, respectively. This means that for some of the protein structural classes our graphlet-based protein features can identify protein structures more correctly than GIT. So, we expect that our integrated features, which include GIT as well as at least one of the graphlet features, would improve upon each of the individual graphlet features and GIT. This is exactly what we observe (see below). In terms of running time, GIT and our individual graphlet features are comparable (Figs. 3 and 4).

Figure 3: Accuracy versus running time of the approaches from Fig. 2 plus deep learning (DL), for group 4 of CATH. Results are qualitatively similar for all other groups of CATH and SCOP (Supplementary Figs. S5 and S6). For each method except DL, the best of its pre- and post-PCA versions is chosen (DL does not have this option). If the latter is selected, “*” is shown next to the given feature’s name.
Figure 4: Accuracy versus running time of the approaches from Fig. 2 plus deep learning (DL), for the Astral PSN set. For each method except DL, the best of its pre- and post-PCA versions is chosen (DL does not have this option). If the latter is selected, “*” is shown next to the given feature’s name.

Third, we compare the best of our graphlet features with a recent contact map-based (CSM) feature to see how our graphlet features, which are more based on graph-theoretic concepts, compare against CSM, which relies on a “simple” concept of contact maps. We find that both OrderedGraphlet-3-4 and NormOrderedGraphlet-3-4* significantly (-value ) outperform CSM both in terms of accuracy and running time (Figs. 3, 4, and 5).

Fourth, we compare the best of our graphlet features with a baseline network (Existing-all) feature to see how our graphlet features, that are more comprehensive network measures, compare against Existing-all that relies on naive network measures. We find that both OrderedGraphlet-3-4 and NormOrderedGraphlet-3-4* significantly (-value ) outperform Existing-all in terms of accuracy and are comparable in terms of running time (Figs. 3, 4, and 5).

These results indicate that, as expected, graphlet-based features perform better than any of the considered baseline features, while showing better or comparable performance in terms of accuracy (possibly providing complimentary information) compared to the considered state-of-the-art contact map-based and 3D-structural features.

3.4 Feature integration improves PSC accuracy

We expect the different categories (i.e., sequence, 3D-structural, contact-map, or PSN-based) of features to capture different aspects of a protein structure. Thus, integrating these features may help capture complementary structural information. Hence, we integrate GIT, the best of the non-graphlet features, and OrderedGraphlet-3-4, the best of the graphlet features, to form a new feature called GIT+OrderedGraphlet-3-4. GIT captures the raw 3D-structure of a protein and OrderedGraphlet-3-4 captures both the PSN structure and the protein sequence structure using ordered graphlets. Hence, we expect that GIT+OrderedGraphlet-3-4 will improve upon most, if not all, of the individual features. Indeed, we find that GIT+OrderedGraphlet-3-4 improves upon each of the individual features in terms of PSC accuracy and is comparable to the individual features in terms of running time (Figs. 3 and 4).

Note that we integrate two of the best performing features and do not use other possible combinations of features because of the following reason. Given all of the features that we use (Table 1), there are more than

possibilities in which we can combine those features to form a new feature. Hence, evaluating all of the combinations is not feasible. However, one heuristic is to integrate all of the features (and hence capture as much of the different protein structure information as possible) into a single feature and integrate them to form a new combined feature. We do this in the following manner.

We integrate all individual features that we consider under the LR classifier (all except SVMfold) to form a single feature called Concatenate-all, in hope that Concatenate-all will improve upon each of the individual features (Table 1). Because of the high running time complexity of SVMfold, we could not apply it to all 36 PSN sets and hence we do not use SVMfold as part of our integrated feature. Our results show that Concatenate-all* shows significant (-value ) improvement in accuracy compared to each of the individual protein features, although at the expense of higher running times, as expected (Figs. 3, 4, and 5).

Figure 5: Statistical significance of the accuracy difference of the approaches from Fig. 3. For each of the 36 PSN sets, we measure raw accuracy values for each of the 10 approaches. Hence, for each approach, there are 36 raw accuracy values (corresponding to the 36 PSN sets). For each pair of approaches, we compare the two given approaches’ 36 raw accuracy values using paired -test. In the figure, every cell () indicates the statistical significance (in terms of -value) of approach being superior to approach .

3.5 Weighted network-based DL classification performs comparable to unweighted graphlet classification

Our proposed DL classifier performs quite well in terms of accuracy (Figs. 3 and 4, and Table 2). Specifically, it is significantly (-value ¡ 0.05) superior to AAComposition*, CSM, and Existing-all* (Fig. 5). However, the performance of the DL classifier is significantly (-value ¡ 0.05) lower than OrderedGraphlet3-4, GIT, GIT+OrderedGraphlet-3-4, and Concatenate-all* (Fig. 5). Yet, the performance of the DL classifier is comparable to one of the two top performing graphlet features, NormOrderedGraphlet-3-4* (Figs. 3 and 4).

Approach CATH SCOP Group 1 Group 2 Group 3 Group 4 Group 1 Group 2 Group 3 Group 4 Astral OrderedGraphlet-3-4 91.64 78.42 89.54 93.85 81.40 79.15 84.25 91.67 70.45 NormOrderedGraphlet-3-4* 89.03 71.95 87.17 92.87 77.59 71.67 82.20 89.16 65.35 Existing-all* 82.40 62.33 59.44 72.24 63.82 45.31 71.31 74.58 30.32 AAComposition* 63.05 61.33 72.23 83.38 52.85 55.38 82.26 78.96 39.16 CSM 41.79 58.52 51.70 68.56 72.61 56.74 82.19 88.39 44.39 GIT 87.86 78.51 91.48 92.83 81.92 89.21 92.39 96.72 81.94 GIT+OrderedGraphlet-3-4 94.61 87.43 95.81 97.49 84.99 91.03 91.17 95.59 79.03 Concatenate-all* 93.89 85.95 96.34 98.32 85.31 91.05 93.82 96.58 81.29 Deep Learning 85.72 82.94 81.63 90.85 67.67 62.27 84.36 89.96 48.84
Table 2: Accuracy of the approaches from Fig. 3 for each CATH and SCOP group plus Astral dataset.

Importantly, unlike the other individual (LR) classifiers that make use of highly sophisticated unweighted network information such as graphlet features, the DL framework utilizes only as simple as possible weighted network information (i.e., weighted adjacency matrix of a network) as its input. This points to a promise of future algorithmic developments for dealing with weighted networks, perhaps even designing weighted graphlet features.

3.6 Our features versus SVMfold

SVMfold has high running time because it needs to extract three sets of very comprehensive features from protein sequence information. This complex information retrieval process needs to be performed for each protein in the considered PSN set, which is not feasible when analyzing large PSN sets containing many proteins (such as those at the higher levels of CATH/SCOP hierarchies) or many PSN sets. Hence, we can compare our approaches to the state-of-the-art SVMfold approach only for two representative PSN sets out of all 36 PSN sets.

Specifically, we choose CATH-3.20.20 and CATH-3.40.50 from group 4 of the CATH data as the representative PSN sets, for the following reasons. These two PSN sets correspond to the fourth level of the CATH hierarchy, i.e., as specific structural classes as possible, which are the most relevant for applied biochemistry scientists. Also, of all fourth-level PSN sets, CATH-3.20.20 is one of the PSN sets in which at least one of our top performing graphlet approaches give low accuracy (), which gives SVMfold the best-case advantage over our approaches, and CATH-3.40.50 is one of the PSN sets in which both of our top performing graphlet approaches give high accuracy (), which gives our approaches the best-case advantage over SVMfold.

Overall, our best performing graphlet features OrderedGraphlet-3-4 and NormOrderedGraphlet-3-4*, and our integrated features GIT+OrderedGraphlet-3-4 and Concatenate-all* are comparable (within ) to SVMfold in terms of accuracy (individual graphlet approaches and GIT+OrderedGraphlet-3-4 on CATH-3.40.50, and Concatenate-all* on both CATH-3.20.20 and CATH-2.40.50) at a fraction of SVMfold’s running time (Table 3).

Approach Accuracy Running time (in minutes) CATH-3.20.20 CATH-3.40.50 CATH-3.20.20 CATH-3.40.50 OrderedGraphlet-3-4 82.07 97.86 18.77 4.95 NormOrderedGraphlet-3-4* 76.90 96.07 26.61 7.02 Existing-all* 51.90 60.36 6.42 1.65 AAComposition* 79.31 81.07 0.06 0.04 CSM 76.03 80.71 53.74 15.32 GIT 85.86 99.64 2.1 0.36 GIT+OrderedGraphlet-3-4 92.07 97.86 20.83 5.31 Concatenate-all* 94.66 98.93 90.70 25.05 Deep Learning 83.79 92.87 5 3.07 SVMfold* 99.31 100 79,365.37 29,859.46
Table 3: Accuracy and Running times (in minutes) of the approaches from Fig. 3 plus SVMfold, for CATH-3.20.20 and CATH-3.40.50 PSN sets. Due to SVMfold’s large time, we could not evaluate it on additional PSN sets.

4 Conclusion

We propose a PSC framework using network-based features. We evaluate the proposed approaches with other state-of-the-art methods that use protein features based on various aspects of a protein structure, including sequence, 3D structure, and contact map information. In a comprehensive evaluation, we demonstrate that our proposed network-based graphlet protein features are superior to most of the other baseline and state-of-the-art features that we evaluate. Importantly, we show that integrating different protein features improves the PSC accuracy possibly by capturing complementary structural information. Further, our proposed DL framework, which automatically learns appropriate features from simple weighted adjacency matrices, yields comparable accuracy to many of the sophisticated features that we use. This points to a promising future for algorithms that will rely on weighted network-based attributes of protein 3D structures.

Authors’ contributions

Conceived the study: KN AR TM. Collected and processed data: KN. Designed the methodology: AR MG TM. Designed the experiments: KN AR MG TN. Performed the experiments: KN AR MG. Analyzed the results: KN AR MG TM. Wrote the paper: KN AR MG TM. Read and approved the paper: KN AR MG PJA TM. Supervision: PJA TM.

Competing interests

The authors have no competing interests.

Funding

This work is funded by the National Institutes of Health (NIH) 1R01GM120733 and the National Science Foundation (NSF) CNS-1629914 grants.

References

  • [1] Kasabov, N. K. Springer Handbook of Bio-/Neuro-Informatics (Springer, 2013), 1 edn.
  • [2] Jain, P., Garibaldi, J. M. & Hirst, J. D.

    Supervised machine learning algorithms for protein structure classification.

    Computational Biology and Chemistry 33, 216–223 (2009).
  • [3] Greene, L. H. et al. The cath domain structure database: new protocols and classification levels give a more comprehensive resource for exploring evolution. Nucleic acids research 35, D291–D297 (2006).
  • [4] Murzin, A. G., Brenner, S. E., Hubbard, T. & Chothia, C. Scop: a structural classification of proteins database for the investigation of sequences and structures. Journal of molecular biology 247, 536–540 (1995).
  • [5] Faisal, F. E. et al. Grafene: Graphlet-based alignment-free network approach integrates 3d structural and sequence (residue order) data to improve protein structural comparison. Scientific Reports 7, 14890 (2017).
  • [6] Xia, J., Peng, Z., Qi, D., Mu, H. & Yang, J. An ensemble approach to protein fold classification by integration of template-based assignment and support vector machine classifier. Bioinformatics 33, 863–870 (2016).
  • [7] Altschul, S. F. et al. Gapped blast and psi-blast: a new generation of protein database search programs. Nucleic acids research 25, 3389–3402 (1997).
  • [8] Jones, D. T. Protein secondary structure prediction based on position-specific scoring matrices1. Journal of molecular biology 292, 195–202 (1999).
  • [9] Remmert, M., Biegert, A., Hauser, A. & Söding, J. Hhblits: lightning-fast iterative protein sequence searching by hmm-hmm alignment. Nature methods 9, 173 (2012).
  • [10] Krissinel, E. On the relationship between sequence and structure similarities in proteomics. Bioinformatics 23, 717–723 (2007).
  • [11] Kosloff, M. & Kolodny, R. Sequence-similar, structure-dissimilar protein pairs in the pdb. Proteins: Structure, Function, and Bioinformatics 71, 891–902 (2008).
  • [12] Cui, C. & Liu, Z. Classification of 3d protein based on structure information feature. In BioMedical Engineering and Informatics, 2008. BMEI 2008. International Conference on, vol. 1, 98–101 (IEEE, 2008).
  • [13] Kalajdziski, S., Mirceva, G., Trivodaliev, K. & Davcev, D. Protein classification by matching 3d structures. In Frontiers in the Convergence of Bioscience and Information Technologies, 2007. FBIT 2007, 147–152 (IEEE, 2007).
  • [14] Jo, T., Hou, J., Eickholt, J. & Cheng, J. Improving protein fold recognition by deep learning networks. Scientific reports 5, 17573 (2015).
  • [15] Wang, J., Li, Y., Zhang, Y., Tang, N. & Wang, C. Class conditional distance metric for 3d protein structure classification. In Bioinformatics and Biomedical Engineering,(iCBBE) 2011 5th International Conference on, 1–4 (IEEE, 2011).
  • [16] Zhi, D., Shatsky, M. & Brenner, S. E. Alignment-free local structural search by writhe decomposition. Bioinformatics 26, 1176–1184 (2010).
  • [17] Harder, T., Borg, M., Boomsma, W., Røgen, P. & Hamelryck, T. Fast large-scale clustering of protein structures using gauss integrals. Bioinformatics 28, 510–515 (2012).
  • [18] Godzik, A., Kolinski, A. & Skolnick, J. Topology fingerprint approach to the inverse protein folding problem. Journal of Molecular Biology 227, 227 – 238 (1992).
  • [19] Pires, D. E. et al. Cutoff scanning matrix (csm): structural classification and function prediction by protein inter-residue distance patterns. BMC Genomics 12, S12 (2011).
  • [20] Pržulj, N., Corneil, D. G. & Jurisica, I. Modeling interactome: scale-free or geometric? Bioinformatics 20, 3508–3515 (2004).
  • [21] Holm, L. & Rosenström, P. Dali server: conservation mapping in 3D. Nucleic Acids Research 38, W545?–W549 (2010).
  • [22] Zhang, Y. & Skolnick, J. TM-align: a protein structure alignment algorithm based on the TM-score. Nucleic Acids Research 33, 2302–09 (2005).
  • [23] Vacic, V., Iakoucheva, L. M., Lonardi, S. & Radivojac, P. Graphlet kernels for prediction of functional residues in protein structures. Journal of Computational Biology 17, 55–72 (2010).
  • [24] Malod-Dognin, N. & Pržulj, N. GR-Align: fast and flexible alignment of protein 3D structures using graphlet degree similarity. Bioinformatics 30, 1259–1265 (2014).
  • [25] Lin, C. et al. Hierarchical classification of protein folds using a novel ensemble classifier. PloS one 8, e56499 (2013).
  • [26] Vipsita, S., Shee, B. K. & Rath, S. K.

    An efficient technique for protein classification using feature extraction by artificial neural networks.

    In India Conference (INDICON), 2010 Annual IEEE, 1–5 (IEEE, 2010).
  • [27] Melvin, I., Weston, J., Leslie, C. S. & Noble, W. S. Combining classifiers for improved classification of proteins from sequence or structure. Bmc Bioinformatics 9, 389 (2008).
  • [28] Wei, L., Liao, M., Gao, X. & Zou, Q. Enhanced protein fold prediction method through a novel feature extraction technique. IEEE transactions on nanobioscience 14, 649–659 (2015).
  • [29] Dai, H.-L. Imbalanced protein data classification using ensemble ftm-svm. IEEE transactions on nanobioscience 14, 350–359 (2015).
  • [30] Rost, B. Twilight zone of protein sequence alignments. Protein Engineering, Design and Selection 12, 85–94 (1999).
  • [31] Fox, N. K., Brenner, S. E. & Chandonia, J.-M. Scope: Structural classification of proteins-extended, integrating scop and astral data and classification of new structures. Nucleic Acids Research 42, D304–D309 (2014).
  • [32] Holm, L. & Sander, C. Protein structure comparison by alignment of distance matrices. Journal of molecular biology 233, 123–138 (1993).
  • [33] Zacharaki, E. I. Prediction of protein function using a deep convolutional neural network ensemble. PeerJ Computer Science 3, e124 (2017).
  • [34] Lander, G. C., Saibil, H. R. & Nogales, E. Go hybrid: Em, crystallography, and beyond. Current opinion in structural biology 22, 627–635 (2012).