Multitask Protein Function Prediction Through Task Dissimilarity

11/03/2016
by   Marco Frasca, et al.
0

Automated protein function prediction is a challenging problem with distinctive features, such as the hierarchical organization of protein functions and the scarcity of annotated proteins for most biological functions. We propose a multitask learning algorithm addressing both issues. Unlike standard multitask algorithms, which use task (protein functions) similarity information as a bias to speed up learning, we show that dissimilarity information enforces separation of rare class labels from frequent class labels, and for this reason is better suited for solving unbalanced protein function prediction problems. We support our claim by showing that a multitask extension of the label propagation algorithm empirically works best when the task relatedness information is represented using a dissimilarity matrix as opposed to a similarity matrix. Moreover, the experimental comparison carried out on three model organism shows that our method has a more stable performance in both "protein-centric" and "function-centric" evaluation settings.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 9

11/26/2021

A multitask transfer learning framework for the prediction of virus-human protein-protein interactions

Viral infections are causing significant morbidity and mortality worldwi...
12/01/2021

Leveraging Sequence Embedding and Convolutional Neural Network for Protein Function Prediction

The capability of accurate prediction of protein functions and propertie...
07/24/2020

Hierachial Protein Function Prediction with Tails-GNNs

Protein function prediction may be framed as predicting subgraphs (with ...
09/18/2015

Evaluation of Protein-protein Interaction Predictors with Noisy Partially Labeled Data Sets

Protein-protein interaction (PPI) prediction is an important problem in ...
05/18/2018

Combining Cost-Sensitive Classification with Negative Selection for Protein Function Prediction

Motivation: Computational methods play a central role in annotating the ...
03/21/2019

A Principled Approach for Learning Task Similarity in Multitask Learning

Multitask learning aims at solving a set of related tasks simultaneously...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The constant increase in the volume and variety of publicly available genomic and proteomic data is a characteristic trait of modern biomedical sciences. A fundamental problem in this area is the assignment of functions to biological macromolecules, especially proteins. Indeed, the accurate annotation of protein function would also have great biomedical and pharmaceutical implications, since several human diseases have genetic causes. While molecular experiments provide the most reliable annotation of proteins, their relatively low throughput and restricted scope have led to an increasing role for automated function prediction (AFP). AFP is characterized by unbalanced functional classes with rare positive instances. Moreover, since only positive membership to functional classes is usually assessed, negative instances are not uniquely defined, and different approaches to choose them have been proposed [1, 2, 3]. Other peculiarities of AFP include: (1) the need of integrating several heterogeneous sources of genomic, proteomic, and transcriptomic data in order to achieve more accurate predictions [4, 5]; (2) the presence of multiple labels and dependencies among class labels; (3) the hierarchical structure of functional classes (a direct acyclic graph for the Gene Ontology GO [6], a forest of trees for the FunCat taxonomy [7]) with different levels of specificity.

Recently, two international challenges for Critical Assessment of Functional Annotation, (CAFA [8] and CAFA2 [9]) were organized to evaluate computational methods that automatically assign protein functions. In particular, CAFA2 emphasized the need for multilabel or structured-output learning algorithms for predicting a set of terms, or a subgraph of the GO ontology for a given protein. In this work we mainly focus on this problem, whose solution however requires paying attention also to the other aspects of AFP.

Several approaches to the predicton of protein functions were proposed in the literature, including sequence-based [10, 11, 12] and network-based methods [13, 14, 15], structured output algorithms based on kernels [16, 17, 3] and hierarchical ensemble methods [18, 19, 20]

. In particular, the availability of large-scale networks, in which nodes are genes/proteins and edges their functional pairwise relationships, has promoted the development of several machine learning methods where novel annotations are inferred by exploiting the topology of the resulting biomolecular network. Initially, network-based approaches relied on the so called

guilt-by-association (GBA) rule, which makes predictions assuming that interacting proteins are likely to share similar functions [21, 22, 23]. Indirect neighbours were also exploited to modify the notion of pairwise-similarities among nodes by accounting for pairs of nodes connected through intermediate ones [24, 25]. Protein functions can be also propagated through the network with an iterative process until convergence [26, 27], by tuning the amount of propagation allowed in the graph through Markov random walks [28, 29], by evaluating the functional flow through the nodes [30], by exploiting kernelized score functions [31], and by modelling protein memberships through Markov Random Fields [32] and Gaussian Random Fields [33, 34]. Furthermore, methods based on the convergence of classical [35, 36] and multi-category Hopfield networks [37] were recently proposed to specifically tackle the class imbalance.

Although protein functions are clearly dependent (see, e.g., the GO functions, where parent terms include all the proteins of their children), most AFP methods described above predict biological functions independently from each other. Multitask methods, on the other hand, take advantage of existing dependencies by transferring information between related tasks, which typically leads to learning faster than algorithms trained independently on each task.

In this paper we investigate an alternative approach to multitask learning based on exploiting task dissimilarities rather than similarities. In particular, we consider two multitask extensions of a known label propagation algorithm [26]: the first extension follows a standard multitask approach based on task similarities; the second extension learns instead from task dissimilarities. Both approaches can be naturally applied to the multilabel prediction of proteins. The prediction tasks we consider are the GO protein functions of fly, human, and bacteria

model organisms. We compute different measures of similarity/dissimilarity between GO terms, taking into account both GO structure and protein annotations. We show that the approach learning from task dissimilarities greatly helps in unbalanced tasks (by helping instances labeled with the rare class labels to be correctly classified), and does not hurt in the more balanced cases. This is a crucial point in protein function prediction, since terms better describing protein functions —i.e., the most specific ones— are the most unbalanced (proteins annotated with these terms are very rare). On the other hand, learning from similar tasks tends to be more effective on balanced settings. Note that the proposed multitask extensions of label propagation do not increase the overall running time of the algorithm, allowing its application on large-sized datasets. Finally, we compare our methods with the state-of-the-art methodologies for AFP by considering both “term-centric” and “protein-centric” evaluation settings.

The paper is organized as follows. In Section 2 we formally introduce the problem and in Section 3 we describe the proposed multitask label propagation methodology. Finally, Section 4 is dedicated to the experimental validation of the method.

2 Automated Protein Function Prediction

The Automated protein Function Prediction (AFP) problem can be formalized as semi-supervised learning problem on a weighted and undirected graph

, where is the set of vertices, is the set of edges, and is the symmetric weight matrix, where is the weight on the edge between vertices and (we assume and for all ).

A set of binary classification tasks on is defined by labelings of the nodes in , where is the label of node for task . For any subset

and any vector

, we use to denote the vector obtained from by retaining only the coordinates in .

The multitask prediction problem on the graph is then defined as follows. Given a set of training vertices and the complement set of test vertices, the learner must predict the test labels for each task given the training labels for the same tasks.

3 Methods

We first describe the standard label propagation algorithm [26, 38, 39] for single-task classification on graphs. This will be later extended to the multitask setting.

3.1 Label Propagation (LP)

In the single-task setting, a standard notion of regularity of a labeling on a graph is the weighted cutsize induced by and defined as follows:

(1)

The weighted cutsize can be also expressed as a quadratic form

The matrix is the Laplacian of , where is the diagonal matrix with entries . The Label Propagation algorithm minimizes the above quadratic form over real-valued (rather than binary) labels. More precisely, LP finds the unique solution of

(2)

The solution of (2) is smooth on . Namely, if two vertices are connected with a large weight , then is close to . Indeed, the components of satisfy the harmonic property [26]

The vector can be also written in closed form as

(3)

where

is the weight matrix partitioned in blocks to emphasize the labeled and unlabeled part of the graph (similarly for the matrix ). As the components of given by (3) are not in , the final labeling produced by LP is obtained by thresholding each component for .

3.2 Multitask label propagation (MTLP)

It is fairly easy to use similarity or dissimilarity information between tasks in order to generalize LP to multitask learning, while preserving the regularity of every task in the sense of (1).

We start by considering multitask LP based on similarity information. Suppose a symmetric matrix is given, where each entry quantifies the relatedness between tasks and . Let be the matrix where , is the identity matrix, and is the Laplacian of . The matrix is symmetric and positive definite, since is diagonally dominant with positive diagonals, and thus invertible. Denote by the label matrix whose -th column is the vector , and by the matrix whose -th column is the vector .

When learning multiple related tasks, a widely used approach is requiring that similar tasks be assigned similar labelings. To this end, we introduce the linear map , defined as follows:

(4)

It can be shown that the map acts on a multitask labeling matrix by getting closer (in Euclidean distance) the labelings (columns of ) corresponding to tasks that are similar according to .

By means of , the exploitation of task similarities can be encoded into the learning problem (2) as follows:

(5)

where . The solution to (5) is

where is the submatrix of including only the rows indexed by , and is the submatrix of including only the rows indexed by . By observing that , we have

where is the solution of (5) with constraints for and . The equality shows that it is equivalent to apply the task feature map (4) before or after performing label propagation. This ensures that the multitask mapping does not increase the label propagation complexity.

As we show next, this solution does not perform well on unbalanced classification problems, where some class (typically the positive class) is largely underrepresented. We propose here an alternative approach, which exploits the prior information about task relatedness in an “inverse” manner. Specifically, we propose a multitask label propagation algorithm which learns multiple tasks by requiring that dissimilar tasks be assigned dissimilar labelings. As we see in the experiments, this approach turns out to work particularly well on unbalanced classification problems.

The first component of this method is a dissimilarity matrix , where is measure of dissimilarity between tasks and (we discuss in Section 3.2.2 possible choices for the matrices and ).

Given the matrix , we consider the multitask map , defined as

(6)

where , , and is the Laplacian matrix of . Unlike the inverse transformation (4), the map moves the columns of matrix farther away from each other in the corresponding -dimensional space (in the sense of the Euclidean distance). We formally show that in Section 3.2.1. Using instead of in (5), we obtain the following optimization problem:

(7)

with . Similarly to (5), the solution of (7) is

where is the solution of (7) with constraints for and . Just like in the previous case, the equality shows that it is equivalent to apply the task feature map (6) before or after performing label propagation.

We call MTLP-inv the similarity-based method (5) and MTLP the dissimilarity-based method (7). In the next section we show some interesting properties of the map which make MTLP suitable for unbalanced classification problems.

3.2.1 Analysis of the multitask map

Given , let and be, respectively, the -th row and the -th column of the matrix . Let also be the set of tasks for which the instance is positive, and the set of tasks for which the instance is negative. We introduce the following notation: for each

and

The next result shows that the action of the linear map on the label matrix is to change the value of each label without altering the sign. The label of an instance in task is made roughly proportional to the weighted sum of tasks in that are dissimilar to and have a different label for instance —see also Corollary 1.

Fact 1

Given , the task interaction matrix , and the map such that , where , then for all it holds

Proof:

By definition, . We distinguish the following two cases.

Case 1. . In this case we have , since by definition for any . Moreover, since , we have (by the definition of Laplacian), and accordingly

Case 2. . In this case, it holds , whereas . It follows

The property is proven by observing that implies and implies .

Using Fact 1 we can show that the map tends to increase the distance between the labelings and , for any pair of distinct tasks . Indeed, we can prove the following.

Fact 2

Given , the task interaction matrix , and the map such that , where . Then for every it holds

for every , where is the Euclidean norm.

Proof:

We prove this property by showing that for all . We distinguish the following four cases:

  1. . In this case , and by Fact 1, .

  2. . Even in this case , whereas .

  3. . In this case, , and . Since both and , it follows .

  4. . Again , and . As and we have, like the previous case, .

The map not only increases the distance between the instance-indexed label vector for two distinct tasks (as we just showed), but it also increases the distance between the task-indexed label vector for two distinct instances. Indeed, since is positive semidefinite, it is easy to show that when the transformation increases the distance between the labelings and , for any pair of distinct instances .

We now focus our discussion on another important feature of the algorithm, which makes our multitask label propagation appropriate for tasks with very unbalanced labelings. Specifically, when most entries of each column in the label matrix are . In this case, the rows of also contain mostly negative entries. Accordingly, by Fact 1, we can compensate the preponderance of negatives by applying the map . We show that with an example.

Consider the task interaction matrix such that for all . That is, all tasks are strongly dissimilar to each other. Then

(8)

By Fact 1, it is straightforward to prove the following.

Corollary 1

Fix and the map such that , where is defined as in (8). Then, for all it holds that

Corollary 1 shows that, when (that is, the multitask labeling for vertex is unbalanced towards negatives), the map assigns to positives () an absolute value higher than that assigned to negatives (). An analogous behaviour characterizes our method when a generic matrix is considered, as stated in Fact 1. This simple property allows the rare positive labels to propagate in the graph. This is unlike the standard LP algorithm, where positive vertices are easily overwhelmed by the negative vertices during the label propagation process. The toy example in Figure 1 shows that the application of the map , where is defined as in (8), allows to improve the final classification of vertices. These observations are empirically confirmed in Section 4.

Fig. 1: Toy example with four vertices , labeled for three tasks according to the matrix . The test point is instance in all the tasks, and we apply LP and MTLP to predict it. For tasks and both methods correctly associate with a negative label. However, in the third task, only MTLP correctly predicts a positive label for .

3.2.2 Task similarities

While MTLP and MTLP-inv are designed to work with any task matrix, similarity and dissimilarity measures are typically tailored to specific domains. Different tasks may share different types of similarities, or may be organized in a hierarchy with a specific structure —such as a tree or a directed acyclic graph— where the positive instances of the children tasks are subsets of the positive instances of their parent tasks. In the case of a hierarchy, different approaches for computing the task matrix are possible: considering only the structure of the hierarchy [40, 41], or combining the hierarchical information with the information content of the tasks [42].

In this work we consider two dissimilarity measures ( and ) and three similarity measures (). The similarity measures and were introduced by Jiang [43] and Lin [44], respectively. Both measures are derived from the dissimilarity measure , whose definition requires a hierarchy over the tasks. The dissimilarity is computed directly from the similarity , which does not require any hierarchical information.

When tasks are organized in a hierarchy, we denote by the set of ancestor tasks of task in the hierarchy. Moreover, we use to denote the frequency of positive instances for task . Since a positive instance for a task is also positive for any , it holds that . Finally, we denote by the common ancestor of tasks and whose frequency is the lowest among all ancestors of and .

Let be the information content of task . We start by recalling the hierarchical dissimilarity measure introduced in [44],

This is the sum of the information content of and minus the information content of their closest common ancestor . Note that is always positive, as . The two hierarchical similarity measures associated with are defined as follows.

Jiang similarity measure:

Lin similarity measure:

Our third similarity measure does not rely on a hierarchy of tasks. Let the set of instances that are positive for the task .

Information content measure:

This is the ratio between the number of examples that are positive for both tasks and the number of examples that are positive for at least one task. The higher the number of shared positive examples, the higher the similarity (up to ). When two tasks do not share any positive example, their similarity is zero. In a hierarchy of tasks, tasks with many positive examples are usually closer to the root (less specific). In this case the denominator of tends to reduce the similarity between the two tasks as opposed to the case in which tasks have a small number of positive annotations. Indeed, sharing annotations between two specific tasks (closer to leaves) is more informative than sharing annotations between two more general tasks (closer to the root).

In the experiments, we compare learning with similarities and against learning with the dissimilarity . We also compare learning with against . For each one of the similarity/dissimilarity measures defined above, we set and (where necessary, values are normalized so that all matrix entries lie in the range ).

4 Results and Discussion

In this section we evaluate our multitask algorithms on the prediction of the bio-molecular functions of proteins belonging to some considered model organisms. We start by describing the experimental setting. Then we compare the performance of our algorithms against that of state-of-the-art methods.

4.1 Experimental setting

4.1.1 Data

We considered three different experiments to predict the protein functions of three model organisms: Drosophila melanogaster (fly), Homo sapiens (human) and Escherichia coli (bacteria). Gene networks for model organisms have been downloaded from the GeneMANIA website (www.genemania.org), and selected in order to cover different types of data, including co-expression, genetic interactions, shared domains, and physical interactions. The selected networks are described in Tables III and  III.

Type Source Nodes
Co-expression Baradaran-Heravi et al. [45] 8857
Co-expression Busser et al. [46] 8857
Co-expression Colombani et al. [47] 8857
Co-expression Lundberg et al. [48] 8857
Genetic interactions BioGRID [49] 929
Genetic interactions Yu et al. [50] 1414
Physical interactions Guruharsha et al. A [51] 1866
Physical interactions Guruharsha et al. B [51] 3833
Physical interactions BioGRID [49] 558
Shared protein domains InterPro [52] 5627
TABLE I: Fly networks.
Type Source Nodes
Co-expression Bahr et al. [53] 7611
Co-expression Balgobind et al. [54] 17522
Co-expression Bigler et al. [55] 17522
Co-expression Botling et al. [56] 17522
Co-expression Clarke et al. [57] 17458
Co-expression Vallat et al. [58] 17521
Common biological PATHWAYCOMMONS [59] 2133
  pathways
Common biological Wu et al. [60] 5319
  pathways
Physical interactions BioGRID [49] 15800
Physical interactions iRref-GRID [61] 9403
Physical interactions iRref-HPRD [61] 9403
Physical interactions iRref-OPHID [61] 9403
Physical interactions IREF SMALL-SCALE-STUDIES [61] 9036
Shared protein InterPro [52] 15800
  domains
Shared protein Pfam [62] 15251
  domains
TABLE II: Human networks.
Type Source Nodes
Co-expression Graham et al. [63] 3959
Co-expression Robbins-Manke et al. [64] 3912
Genetic interactions Babu et al. [65] 715
Genetic interactions Butland et al. [66] 3497
Physical interactions Hu at al [67] 1537
Physical interactions IREF-Dip [61] 633
Physical interactions Y2H - PPI 1063
Shared protein domains InterPro [52] 3005
Shared protein domains Pfam [62] 2726
TABLE III: Bacteria networks.

For every organism, networks were integrated through unweighted sum on the union of genes in the individual networks. No preprocessing was applied to the individual networks, whereas each network, denoted by the corresponding connection matrix , was normalized as follows:

where is the diagonal matrix with diagonal entries .

Protein functions were downloaded from the Gene Ontology. This ontology is structured as a directed acyclic graph with different levels of specificity and contains three branches: Biological Process (BP), Molecular Functions (MF), and Cellular Components (CC). We considered the experimental annotations in the releases 07.03.16, 16.03.16, and 17.10.16 respectively for fly, human and bacteria organisms. We performed a dedicated experiment for every branch.

For predicting the most specific terms in the ontology (i.e., those best describing protein functions), and in order to consider terms with a minimum amount of prior information, we selected all the GO terms with positive annotated genes, obtaining ( BP, MF, CC), ( BP, MF, CC), and ( BP, MF, CC) terms for fly, human, and bacteria, respectively. We considered two groups of GO terms according to their specificity: GO terms with - and - annotated proteins, for a total of categories for every GO branch. In the end, we obtained a total of fly, human, and bacteria genes which have at least one GO positive annotation in the considered GO release. The obtained tasks are therefore severely unbalanced toward negatives.

4.1.2 Evaluation metrics

In order to evaluate the generalization performance of the compared methods, we applied a -fold cross-validation experimental setting and adopted the Area Under the Precision-Recall Curve (AUPRC) as “per term” ranking measure. AUPRC is indeed more informative on unbalanced settings than the classical area under the ROC curve [68]. Furthermore, following the recent CAFA2 international challenge, we also considered a “protein-centric evaluation” to assess performance accuracy in predicting all ontological terms associated with a given protein sequence [9]

. In this scenario, the multiple-label F-score is used as performance measure. More precisely, if we indicate as

, and respectively the number of true positives, true negatives, and false positives for the protein at threshold , we can define the “per-protein” multiple-label precision and recall at a given threshold as:

where is the number of proteins. In other words, (resp., ) is the average multilabel precision (resp., recall) across proteins. The multilabel F-measure depends on and according to CAFA2 experimental setting, the maximum achievable F-score () is adopted as the main multilabel “per-protein” metric:

(9)

4.2 Results

4.2.1 Evaluating GO semantic similarities

This section investigates the impact of the task similarity/dissimilarity measures described in Section 3.2.2 on the performance of the proposed multitask label propagation algorithms. Table IV shows the obtained results. In this experiment we set (the choice of parameter is discussed in Section 4.2.5). When MTLP-inv uses the similarity measures and MTLP uses , MTLP outperforms MTLP-inv in both AUPRC and . Nevertheless, the GO term similarity is much more informative for MTLP-inv, which achieves in this case results competitive with MTLP (whose performance instead is nearly indistinguishable when using or ), and in some cases even better. The difference in favor of MTLP seems to increase with the data imbalance: on human data set, the most unbalanced, we observe the highest gap in favor of MTLP; whereas on the Bacteria data set, the least unbalanced, the gap is reduced and —in some cases like for the MF terms— MTLP-inv significantly outperforms MTLP in terms of average AUPRC. In terms of , however, MTLP is always the top method.

Overall, these results suggests that MTLP should be preferred when the proportion of positives is drastically smaller than that of negatives. When data are more balanced, MTLP-inv better exploits the similarities among tasks and, at least in term of AUPRC, is a valid option. In terms of multilabel accuracy, MTLP is always better than MTLP-inv. Finally, it is worth noting that both methods outperforms LP in terms of AUPRC (see Section 4.2.3 for LP results), whereas in terms of only MTLP achieves better results than LP.

METHODS BP MF CC
All - - All - - All - -
FLY
MTLP 0.140 0.133 0.153 0.247 0.333 0.322 0.355 0.411 0.262 0.265 0.253 0.354
MTLP 0.140 0.133 0.153 0.246 0.333 0.322 0.355 0.410 0.262 0.265 0.253 0.357
MTLP-inv 0.020 0.013 0.031 0.183 0.198 0.179 0.238 0.374 0.150 0.138 0.181 0.306
MTLP-inv 0.020 0.014 0.031 0.170 0.192 0.172 0.235 0.351 0.101 0.082 0.147 0.259
MTLP-inv 0.135 0.129 0.146 0.244 0.328 0.318 0.352 0.381 0.261 0.265 0.251 0.333
HUMAN
MTLP 0.144 0.133 0.165 0.273 0.248 0.247 0.250 0.383 0.224 0.259 0.156 0.317
MTLP 0.145 0.134 0.165 0.275 0.249 0.248 0.250 0.385 0.224 0.259 0.156 0.318
MTLP-inv 0.008 0.005 0.014 0.200 0.093 0.083 0.152 0.330 0.105 0.113 0.090 0.274
MTLP-inv 0.008 0.005 0.012 0.182 0.059 0.050 0.079 0.294 0.066 0.064 0.068 0.223
MTLP-inv 0.139 0.129 0.159 0.244 0.243 0.241 0.244 0.355 0.220 0.256 0.160 0.299
BACTERIA
MTLP 0.119 0.107 0.169 0.210 0.173 0.157 0.238 0.269 0.122 0.105 0.220 0.348
MTLP 0.119 0.107 0.168 0.212 0.173 0.157 0.238 0.276 0.122 0.105 0.219 0.348
MTLP-inv 0.069 0.056 0.123 0.181 0.106 0.092 0.165 0.235 0.101 0.086 0.187 0.246
MTLP-inv 0.053 0.043 0.094 0.109 0.057 0.045 0.107 0.117 0.106 0.089 0.207 0.281
MTLP-inv 0.121 0.108 0.176 0.189 0.181 0.165 0.247 0.247 0.123 0.107 0.212 0.289
TABLE IV: Comparison according to average AUPRC and multilabel F-measure () between MTLP and MTLP-inv using the semantic similarity measures described in Section 3.2.2. Column All is the average across all GO terms, column - is the average across GO terms with at most positive genes, and column - is the average across terms with more than positives. Best results are in boldface. Results are underlined when the difference between MTLP and MTLP-inv is statistically significant (Wilcoxon signed rank test, -).

In order to investigate the reasons why, unlike MTLP-inv, MTLP performance slightly varies with the task dissimilarity measure, we run MTLP on the fly organism and CC tasks by randomly generating the matrix . We generated matrices with different sparsity (from to , with steps of ) and with different ranges of weight values. Specifically, we uniformly selected weights in the interval , with ranging from to , by steps of . In Figure 2, we show the heatmap of the average AUPRC obtained in each experiment.

Fig. 2: Average AUPRC values achieved by MTLP method of fly data and CC GO terms when the matrix is randomly generated. Values of are reported on the columns, whereas row labels show the proportion of nonzero entries in the generated matrix. The lighter the color, the larger the corresponding AUPRC value.

As expected, the results are considerably worse than those obtained when considering real dissimilarity matrices (see Table IV). There is a small AUPRC variation from the different random data, with higher AUPRC when the dissimilarity matrix is denser and with larger entries (the former seems to affect the results more than the latter). This is consistent with Fact 1, since the lower the weight and/or the sparser the matrix, the closer MTLP is to LP. Finally, on randomly generated dissimilarity matrices MTLP performs even worse than LP, as we can see from Figure 4.

Fig. 3: Average AUPRC performance across all GO terms (All), across GO terms with at most positive instances (), and across terms with more than positives ().

4.2.2 Grouping GO terms for multitask mapping

Following the approach proposed in [4], in addition to the strategy grouping GO terms by branch (i) adopted in the previous section, we have examined an alternative way for grouping the terms to be considered in the multitask maps (4) and (6) when running MTLP-inv and MTLP algorithms, respectively. Specifically, we grouped GO terms not just by GO branch (BP, MF, and CC), but also by taking into account the number of annotated proteins (ii), obtaining groups: BP with - ( terms) and - ( terms) annotations, MF - ( terms), - ( terms), and CC - ( terms), - ( terms). The corresponding results on fly data are reported in Figure 3. AUPRC results show negligible differences between strategies (i) and (ii), for both MTLP and MTLP-inv. More clear is the difference in terms of Fmax, with opposite behaviour between MTLP and MTLP-inv: MTLP has worse performance in all GO branches; MTLP-inv instead tends to perform better (see for instance MF results). Indeed, black lines (grouping strategy (ii)) in correspondence of Fmax are always between grey lines (grouping strategy (i)). However, the best results are still achieved by MTLP when grouping terms by GO branch, and accordingly we consider this strategy in the rest of the paper.

4.2.3 Prediction of GO functions for fly, human, and bacteria organisms

MTLP () was compared with state-of-the-art graph-based methodologies applied to the prediction of protein functions. We considered: LP, the label propagation algorithm described in Section 3.1; COSNetM [15], an extension of a node classifier designed for unbalanced settings [36]; RW, the classical -step random walk algorithm [69]; GBA, a method based on the guilt-by-association assumption [23];

MS-kNN

, one of the best methods in the recent CAFA2 challenge applying the kNN algorithm to each network independently, and then combining the obtained predictions [70].

In order to deal with label imbalance in LP, we applied a label normalization step before running label propagation. This step normalizes the labels of each GO term so that positive and negative labels sum to . In our experiments, this variant of LP performs much better than the vanilla LP algorithm. For the RW algorithm we set the limit on the number of iterations to , since higher values did not improve the performance while increasing the computational burden. Finally, we set to the parameter for the kNN algorithm, as a result of a tuning process on training data.

Fig. 4: Average AUPRC performance across all GO terms (All), across GO terms with at most positive instances (), and across terms with more than positives ().
Fig. 5: Average multi-label F-measure performance across all GO terms.

In Figures 4 and 5 we show the obtained results in terms of AUPRC and , on BP and CC terms respectively (on MF terms the methods showed a similar behaviour). Interestingly, MTLP always achieves the highest AUPRC averaged over all tasks (All), with statistically significant improvements over the second top method (-), except for bacteria data and for BP terms on fly data. When comparing with LP method, the improvement is always significant, except for CC (bacteria data). COSNetM is the second method on human and fly data sets, while on bacteria LP (CC) and RW (BP) rank as second method. Furthermore, and more importantly, MTLP improvements are more noticeable on the most unbalanced terms, which are those best characterizing the biological functions of genes. GBA, MS-kNN and RW methods seem suffer the strongly unbalanced setting, and perform worse than LP, with the exception of RW on bacteria data set. The good performance of COSNetM in this unbalanced setting is likely due to its cost-sensitive strategy, which requires learning two model parameters. This extra learning step increases its computation time. Indeed, COSNetM takes on average around seconds on a Linux machine with Intel Xeon(R) CPU 3.60GHz and 32 Gb RAM to perform an entire cross validation cycle for one task on fly data, whereas both LP and MTLP take on average slightly less than one second. This confirms our observation that applying the map after label propagation does not increase the algorithm complexity, and just slightly increases the execution time for computing .

Even in terms of Fmax MTLP obtains the best results, with LP second-best method (except on BP — fly data). This shows that our method can achieve good predictive capabilities both when predicting single GO terms and when predicting a GO multilabel for single proteins. On the other side, the compared methods tend to have competitive performance in only one scenario. For instance, RW poorly performs in terms of , whereas, unlike AUPRC, MS-kNN achieves good Fmax results: on BP (fly data) it is the best method after MTLP. Even COSNetM, which is the second method in terms of AUPRC, achieves the third or the fourth best rank.

4.2.4 Evaluating different powers of the Laplacian matrix

A further experiment was carried out to analyze how MTLP performance changes when using the map for , instead of . We empirically tested on the fly organism different values of , fixing the parameter and using the measure. The results are shown in Figure 6. We considered . Except for BP terms, where the map performs slightly better than , all choices of lead to worse results. In particular, the performance strongly decays for .

Fig. 6: Average AUPRC values achieved by MTLP of fly data with different values of the parameter .

4.2.5 Impact of parameter

Large values of the parameter, introduced in Section 3.2, tend to reduce the multitask contribution encoded in , since is diagonally dominant and absolute labels assigned to positives and negatives vertices by the map tend to be almost the same (see Fact 1). Hence, this allows to “regulate” to some extent the method between multitask and singletask label propagation. We experimentally tuned on fly and human data from to with step size . It turns out there is a negligible difference, with results reported in Table IV and corresponding to . This is expected, since is much larger than in the considered experiments. For this reason, we also performed another experiment in which we selected a smaller subset of terms in the BP branch (a similar trend is observed for the MF and CC branches). Specifically, we ran our algorithm on a subset of terms for the fly organism, by varying in the specified range. The results are shown in Table V. Confirming our observations, MTLP is more sensitive to values in this setting, and the overall trend is that the average AUPRC tends to decrease when becomes larger (similarly to ). This not surprising: as we explained, with large values of MTLP behaves closer to LP, whose results are lower in this setting.

All - -
0.25 0.158 0.150 0.182
0.5 0.157 0.151 0.177
0.75 0.145 0.134 0.178
1 0.144 0.133 0.178
1.25 0.140 0.130 0.175
1.5 0.139 0.129 0.174
TABLE V: AUPRC of the MTLP method (, task similarity measure averaged across selected MF GO terms for human data by varying the parameter . Column All is the average across all tasks, column - is the average across terms with at most annotations, and column - is the average across terms with more than positives.

5 Conclusions

We have shown that task relatedness information represented through task dissimilarity is better suited for label propagation in unbalanced protein function prediction than task similarity. The proposed multitask label propagation algorithm compared favourably with the state-of-the-art methodologies for protein function prediction on three model organisms. Although we gained some intuition and collected empirical evidence, we are still investigating the multitask problems where our approach is most effective. Specifically, it would be useful to study whether dissimilarity information helps when coupled with multitask algorithms different from label propagation. For example, linear learning algorithms such as SVM or Perceptron. Laplacian spectral theory is also likely to help us shed some further light on the properties of our method.

Acknowledgments

The authors would like to thank the reviewers of BIOKDD16 for useful comments on an earlier draft of this paper.

References

  • [1] N. Youngs, D. Penfold-Brown, K. Drew, D. Shasha, and R. Bonneau, “Parametric bayesian priors and better choice of negative examples improve protein function prediction,” Bioinformatics, vol. 29, no. 9, pp. 1190–1198, 2013.
  • [2] M. Frasca and D. Malchiodi, “Selection of negative examples for node label prediction through fuzzy clustering techniques,” in

    Advances in Neural Networks: Computational Intelligence for ICT

    , S. Bassis, A. Esposito, C. F. Morabito, and E. Pasero, Eds.   Cham: Springer International Publishing, 2016, pp. 67–76. [Online]. Available: http://dx.doi.org/10.1007/978-3-319-33747-0$_$7
  • [3] S. Mostafavi and Q. Morris, “Using the gene ontology hierarchy when predicting gene function,” in

    Proceedings of the Twenty-Fifth Annual Conference on Uncertainty in Artificial Intelligence (UAI-09)

    .   Corvallis, Oregon: AUAI Press, 2009, pp. 419–427.
  • [4] S. Mostafavi and Q. Morris, “Fast integration of heterogeneous data sources for predicting gene function with limited annotation,” Bioinformatics, vol. 26, no. 14, pp. 1759–1765, 2010.
  • [5] M. Frasca, A. Bertoni et al., “UNIPred: unbalance-aware Network Integration and Prediction of protein functions,” Journal of Computational Biology, vol. 22, no. 12, pp. 1057–1074, 2015.
  • [6] The Gene Ontology Consortium, “Gene ontology: tool for the unification of biology,” Nature Genet., vol. 25, pp. 25–29, 2000.
  • [7] A. Ruepp, A. Zollner, D. Maier, K. Albermann, J. Hani, M. Mokrejs, I. Tetko, U. Guldener, G. Mannhaupt, M. Munsterkotter, and H. Mewes, “The FunCat, a functional annotation scheme for systematic classification of proteins from whole genomes,” Nucleic Acids Research, vol. 32, no. 18, pp. 5539–5545, 2004.
  • [8] P. Radivojac et al., “A large-scale evaluation of computational protein function prediction,” Nature Methods, vol. 10, no. 3, pp. 221–227, 2013.
  • [9] Y. Jiang, T. R. Oron et al., “An expanded evaluation of protein function prediction methods shows an improvement in accuracy,” Genome Biology, vol. 17, no. 1, p. 184, 2016. [Online]. Available: http://dx.doi.org/10.1186/s13059-016-1037-6
  • [10] D. Martin, M. Berriman, and G. Barton, “Gotcha: a new method for prediction of protein function assessed by the annotation of seven genomes.” BMC Bioinformatics, vol. 5, p. 178, 2004.
  • [11] T. Hawkins, M. Chitale, S. Luban et al., “Pfp: Automated prediction of gene ontology functional annotations with confidence scores using protein sequence data.” Proteins, vol. 74, no. 3, pp. 566–82, 2009.
  • [12] A. Juncker, L. Jensen, A. Perleoni, A. Bernsel, M. Tress, P. Bork, G. von Heijne, A. Valencia, A. Ouzounis, R. Casadio, and S. Brunak, “Sequence-based feature prediction and annotation of proteins,” Genome Biology, vol. 10:206, 2009.
  • [13] A. Vazquez, A. Flammini, A. Maritan, and A. Vespignani, “Global protein function prediction from protein-protein interaction networks,” Nature Biotechnology, vol. 21, pp. 697–700, 2003.
  • [14] R. Sharan, I. Ulitsky, and R. Shamir, “Network-based prediction of protein function,” Mol. Sys. Biol., vol. 8, no. 88, 2007.
  • [15] M. Frasca, “Automated gene function prediction through gene multifunctionality in biological networks,” Neurocomputing, vol. 162, pp. 48 – 56, 2015.
  • [16] A. Sokolov and A. Ben-Hur, “Hierarchical classification of Gene Ontology terms using the GOstruct method,” Journal of Bioinformatics and Computational Biology, vol. 8, no. 2, pp. 357–376, 2010.
  • [17] A. Sokolov, C. Funk, K. Graim, K. Verspoor, and A. Ben-Hur, “Combining heterogeneous data sources for accurate functional annotation of proteins,” BMC Bioinformatics, vol. 14, no. Suppl 3:S10, 2013.
  • [18] G. Obozinski, G. Lanckriet, C. Grant, J. M., and W. Noble, “Consistent probabilistic output for protein function prediction,” Genome Biology, vol. 9, no. S6, 2008.
  • [19] Y. Guan, C. Myers, D. Hess, Z. Barutcuoglu, A. Caudy, and O. Troyanskaya, “Predicting gene function in a hierarchical context with an ensemble of classifiers,” Genome Biology, vol. 9, no. S2, 2008.
  • [20] G. Valentini, “Hierarchical Ensemble Methods for Protein Function Prediction,” ISRN Bioinformatics, vol. 2014, no. Article ID 901419, pp. 1–34, 2014.
  • [21] E. Marcotte, M. Pellegrini, M. Thompson, T. Yeates, and D. Eisenberg, “A combined algorithm for genome-wide prediction of protein function,” Nature, vol. 402, pp. 83–86, 1999.
  • [22] S. Oliver, “Guilt-by-association goes global,” Nature, vol. 403, pp. 601–603, 2000.
  • [23] B. Schwikowski, P. Uetz, and S. Fields, “A network of protein-protein interactions in yeast.” Nature biotechnology, vol. 18, no. 12, pp. 1257–1261, Dec. 2000.
  • [24] Y. Li and J. Patra, “Integration of multiple data sources to prioritize candidate genes using discounted rating systems,” BMC Bioinformatics, vol. 11, no. Suppl I:S20, 2010.
  • [25] P. Bogdanov and A. Singh, “Molecular function prediction using neighborhood features,” IEEE ACM Transactions on Computational Biology and Bioinformatics, vol. 7, no. 2, pp. 208–217, 2011.
  • [26] X. Zhu et al., “Semi-supervised learning with gaussian fields and harmonic functions,” in Proc. of the 20th Int. Conf. on Machine Learning, Washintgton DC, USA, 2003.
  • [27] H. Zhou and T. Hastie, “Regularization and variable selection via the elastic net,” Journal of the Royal Statistical Society B, vol. 67, no. 2, pp. 301–320, 2007.
  • [28] M. Szummer and T. Jaakkola, “Partially labeled classification with markov random walks,” in NIPS 2001, vol. 14, Whistler BC, Canada, 2001.
  • [29] A. Azran, “The rendezvous algorithm: Multi- class semi-supervised learning with Markov random walks,” in Proceedings of the 24th International Confer- ence on Machine Learning (ICML), 2007.
  • [30] E. Nabieva, K. Jim, A. Agarwal, B. Chazelle, and M. Singh, “Whole-proteome prediction of protein function via graph-theoretic analysis of interaction maps,” Bioinformatics, vol. 21, no. S1, pp. 302–310, 2005.
  • [31] G. Valentini, G. Armano, M. Frasca, J. Lin, M. Mesiti, and M. Re, “RANKS: a flexible tool for node label ranking and classification in biological networks,” Bioinformatics, 2016, in press. Accepted on 22 April 2016.
  • [32] M. Deng, T. Chen, and F. Sun, “An integrated probabilistic model for functional prediction of proteins,” J. Comput. Biol., vol. 11, pp. 463–475, 2004.
  • [33] K. Tsuda, H. Shin, and B. Scholkopf, “Fast protein classification with multiple networks,” Bioinformatics, vol. 21, no. Suppl 2, pp. ii59–ii65, 2005.
  • [34] S. Mostafavi, D. Ray, D. Warde-Farley, C. Grouios, and Q. Morris, “GeneMANIA: a real-time multiple association network integration algorithm for predicting gene function,” Genome Biology, vol. 9, no. S4, 2008.
  • [35] U. Karaoz et al., “Whole-genome annotation by using evidence integration in functional-linkage networks,” Proc. Natl Acad. Sci. USA, vol. 101, pp. 2888–2893, 2004.
  • [36] A. Bertoni, M. Frasca, and G. Valentini, COSNet: A Cost Sensitive Neural Network for Semi-supervised Learning in Graphs.   Berlin, Heidelberg: Springer Berlin Heidelberg, 2011, pp. 219–234. [Online]. Available: http://dx.doi.org/10.1007/978-3-642-23780-5$_$24
  • [37] M. Frasca, S. Bassis, and G. Valentini, “Learning node labels with multi-category hopfield networks,” Neural Computing and Applications, vol. 27, no. 6, pp. 1677–1692, 2016. [Online]. Available: http://dx.doi.org/10.1007/s00521-015-1965-1
  • [38] M. Belkin and P. Niyogi, “Laplacian eigenmaps for dimensionality reduction and data representation,” Neural Computation, vol. 15, pp. 1373–1396, 2002.
  • [39] B. Kveton, M. Valko, A. Rahimi, and L. Huang, “Semi-supervised learning with max-margin graph cuts.” in AISTATS, ser. JMLR Proceedings, Y. W. Teh and D. M. Titterington, Eds., vol. 9.   JMLR.org, 2010, pp. 421–428. [Online]. Available: http://dblp.uni-trier.de/db/journals/jmlr/jmlrp9.html$#$KvetonVRH10
  • [40] Z. Wu and M. Palmer, “Verbs semantics and lexical selection,” in Proceedings of the 32nd annual meeting on Association for Computational Linguistics.   Morristown, NJ, USA: Association for Computational Linguistics, 1994, pp. 133–138. [Online]. Available: http://portal.acm.org/citation.cfm?id=981751
  • [41] C. Leacock and M. Chodorow, “Combining local context and wordnet similarity for word sense identification,” in MIT Press, C. Fellfaum, Ed., Cambridge, Massachusetts, 1998, pp. 265–283.
  • [42] L. Meng, R. Huang, and J. Gu, “A review of semantic similarity measures in wordnet,” International Journal of Hybrid Information Technology, vol. 6, no. 1, pp. 1–12, 2013.
  • [43] J. J. Jiang and D. W. Conrath, “Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy,” in International Conference Research on Computational Linguistics (ROCLING X), Sep. 1997, pp. 9008+. [Online]. Available: http://adsabs.harvard.edu/cgi-bin/nph-bib_query?bibcode=1997cmp.lg....9008J
  • [44] D. Lin, “An information-theoretic definition of similarity,” in Proceedings of the Fifteenth International Conference on Machine Learning, ser. ICML ’98.   San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 1998, pp. 296–304. [Online]. Available: http://dl.acm.org/citation.cfm?id=645527.657297
  • [45] A. Baradaran-Heravi, K. S. Cho, B. Tolhuis et al., “Penetrance of biallelic SMARCAL1 mutations is associated with environmental and genetic disturbances of gene expression,” Human Molecular Genetics, vol. 21, no. 11, pp. 2572–2587, Jun. 2012.
  • [46] B. W. Busser, L. Shokri, S. A. Jeager et al., “Molecular mechanism underlying the regulatory specificity of a Drosophila homeodomain protein that specifies myoblast identity.” Development (Cambridge, England), vol. 139, no. 6, pp. 1164–1174, Mar. 2012.
  • [47] J. Colombani, D. S. Andersen, and P. Léopold, “Secreted peptide dilp8 coordinates drosophila tissue growth with developmental timing,” Science, vol. 336, no. 6081, pp. 582–585, 2012.
  • [48] L. E. Lundberg, M. Fiqueiredo, P. Stenberg et al., “Buffering and proteolysis are induced by segmental monosomy in Drosophila melanogaster,” Nucleic Acids Research, Mar. 2012.
  • [49] C. Stark, B. joe Breitkreutz, T. Reguly et al., “Biogrid: a general repository for interaction datasets.” Nucleic Acids Research, no. Database-Issue, pp. 535–539, 2006.
  • [50] J. Yu, S. Pacifico, G. Liu et al., “DroID: the Drosophila Interactions Database, a comprehensive resource for annotated gene and protein interactions,” BMC Genomics, vol. 9, no. 1, pp. 461+, Oct. 2008.
  • [51] K. G. Guruharsha, J. Rual, B. Zhai et al., “A Protein Complex Network of Drosophila melanogaster,” Cell, vol. 147, no. 3, pp. 690–703, Oct. 2011.
  • [52] R. Apweiler, T. K. Attwood, A. Bairoch et al., “The InterPro database, an integrated documentation resource for protein families, domains and functional sites,” Nucleic Acids Research, vol. 29, no. 1, pp. 37–40, Jan. 2001.
  • [53] T. M. Bahr, G. J. Hughes et al., “Peripheral Blood Mononuclear Cell Gene Expression in Chronic Obstructive Pulmonary Disease,” American Journal of Respiratory Cell and Molecular Biology, vol. 49, no. 2, pp. 316–323, 2013.
  • [54] B. V. Balgobind, M. M. Van den Heuvel-Eibrink et al., “Evaluation of gene expression signatures predictive of cytogenetic and molecular subtypes of pediatric acute myeloid leukemia,” Haematologica, vol. 96, no. 2, pp. 221–230, 2011.
  • [55] J. Bigler, H. A. Rand et al., “Cross-study homogeneity of psoriasis gene expression in skin across a large expression range,” PLoS ONE, vol. 8, no. 1, pp. 1–15, 01 2013.
  • [56] J. Botling, K. Edlund, M. Lohr, B. Hellwig, L. Holmberg, M. Lambe, A. Berglund, S. Ekman, M. Bergqvist, F. Pontén, A. König, O. Fernandes, M. Karlsson, G. Helenius, C. Karlsson, J. Rahnenführer, J. G. Hengstler, and P. Micke, “Biomarker discovery in non–small cell lung cancer: Integrating gene expression profiling, meta-analysis, and tissue microarray validation,” Clinical Cancer Research, vol. 19, no. 1, pp. 194–204, 2013. [Online]. Available: http://clincancerres.aacrjournals.org/content/19/1/194.abstract
  • [57] C. Clarke, S. F. Madden et al., “Correlating transcriptional networks to breast cancer survival: a large-scale coexpression analysis,” Carcinogenesis, vol. 34, no. 10, pp. 2300–2308, 2013. [Online]. Available: http://carcin.oxfordjournals.org/content/34/10/2300.abstract
  • [58] L. Vallat, C. A. Kemper et al., “Reverse-engineering the genetic circuitry of a cancer cell with predicted intervention in chronic lymphocytic leukemia,” Proceedings of the National Academy of Sciences, vol. 110, no. 2, pp. 459–464, 2013. [Online]. Available: http://www.pnas.org/content/110/2/459.abstract
  • [59] E. G. Cerami, B. E. Gross et al., “Pathway commons, a web resource for biological pathway data,” Nucleic Acids Research, vol. 39, no. suppl 1, pp. D685–D690, 2011.
  • [60] G. Wu, X. Feng, and L. Stein, “A human functional protein interaction network and its application to cancer data analysis,” Genome Biology, vol. 11, no. 5, pp. 1–23, 2010.
  • [61] S. Razick, G. Magklaras, and I. M. Donaldson, “irefindex: A consolidated protein interaction database with provenance,” BMC Bioinformatics, vol. 9, no. 1, pp. 1–19, 2008. [Online]. Available: http://dx.doi.org/10.1186/1471-2105-9-405
  • [62] R. D. Finn et al., “The pfam protein families database: towards a more sustainable future,” Nucleic Acids Research, vol. 44, no. D1, pp. D279–D285, 2016. [Online]. Available: http://nar.oxfordjournals.org/content/44/D1/D279.abstract
  • [63] A. I. Graham, G. Sanguinetti, N. Bramall, C. W. McLeod, and R. K. Poole, “Dynamics of a starvation-to-surfeit shift: a transcriptomic and modelling analysis of the bacterial response to zinc reveals transient behaviour of the fur and soxs regulators,” Microbiology, vol. 158, no. 1, pp. 284–292, 2012.
  • [64] J. L. Robbins-Manke, Z. Z. Zdraveski, M. Marinus, and J. M. Essigmann, “Analysis of global gene expression and double-strand-break formation in dna adenine methyltransferase- and mismatch repair-deficient escherichia coli,” Journal of bacteriology, vol. 187, no. 20, pp. 7027–7037, October 2005. [Online]. Available: http://europepmc.org/articles/PMC1251628
  • [65] M. Babu et al., “Genetic interaction maps in escherichia coli reveal functional crosstalk among cell envelope biogenesis pathways,” PLoS Genet, vol. 7, no. 11, pp. 1–15, 11 2011.
  • [66] G. Butland et al., “esga: E. coli synthetic genetic array analysis,” Nat Meth, vol. 5, no. 3, pp. 789–795, jan 2008.
  • [67] P. Hu, S. C. Janga et al., “Global Functional Atlas of Escherichia coli Encompassing Previously Uncharacterized Proteins,” PLoS Biol, vol. 7, no. 4, pp. e1 000 096+, Apr. 2009. [Online]. Available: http://dx.doi.org/10.1371/journal.pbio.1000096
  • [68] T. Saito and M. Rehmsmeier, “The precision-recall plot is more informative than the roc plot when evaluating binary classifiers on imbalanced datasets,” PLoS ONE, vol. 10, no. 3, pp. 1–21, 03 2015.
  • [69] L. Lovász, “Random walks on graphs: A survey,” in Combinatorics, Paul Erdős is Eighty, D. Miklós, V. T. Sós, and T. Szőnyi, Eds.   Budapest: János Bolyai Mathematical Society, 1996, vol. 2, pp. 353–398.
  • [70] L. Lan, N. Djuric, Y. Guo, and V. S., “MS-kNN: protein function prediction by integrating multiple data sources,” BMC Bioinformatics, vol. 14, no. Suppl 3:S8, 2013.