1 Introduction
The constant increase in the volume and variety of publicly available genomic and proteomic data is a characteristic trait of modern biomedical sciences. A fundamental problem in this area is the assignment of functions to biological macromolecules, especially proteins. Indeed, the accurate annotation of protein function would also have great biomedical and pharmaceutical implications, since several human diseases have genetic causes. While molecular experiments provide the most reliable annotation of proteins, their relatively low throughput and restricted scope have led to an increasing role for automated function prediction (AFP). AFP is characterized by unbalanced functional classes with rare positive instances. Moreover, since only positive membership to functional classes is usually assessed, negative instances are not uniquely defined, and different approaches to choose them have been proposed [1, 2, 3]. Other peculiarities of AFP include: (1) the need of integrating several heterogeneous sources of genomic, proteomic, and transcriptomic data in order to achieve more accurate predictions [4, 5]; (2) the presence of multiple labels and dependencies among class labels; (3) the hierarchical structure of functional classes (a direct acyclic graph for the Gene Ontology GO [6], a forest of trees for the FunCat taxonomy [7]) with different levels of specificity.
Recently, two international challenges for Critical Assessment of Functional Annotation, (CAFA [8] and CAFA2 [9]) were organized to evaluate computational methods that automatically assign protein functions. In particular, CAFA2 emphasized the need for multilabel or structuredoutput learning algorithms for predicting a set of terms, or a subgraph of the GO ontology for a given protein. In this work we mainly focus on this problem, whose solution however requires paying attention also to the other aspects of AFP.
Several approaches to the predicton of protein functions were proposed in the literature, including sequencebased [10, 11, 12] and networkbased methods [13, 14, 15], structured output algorithms based on kernels [16, 17, 3] and hierarchical ensemble methods [18, 19, 20]
. In particular, the availability of largescale networks, in which nodes are genes/proteins and edges their functional pairwise relationships, has promoted the development of several machine learning methods where novel annotations are inferred by exploiting the topology of the resulting biomolecular network. Initially, networkbased approaches relied on the so called
guiltbyassociation (GBA) rule, which makes predictions assuming that interacting proteins are likely to share similar functions [21, 22, 23]. Indirect neighbours were also exploited to modify the notion of pairwisesimilarities among nodes by accounting for pairs of nodes connected through intermediate ones [24, 25]. Protein functions can be also propagated through the network with an iterative process until convergence [26, 27], by tuning the amount of propagation allowed in the graph through Markov random walks [28, 29], by evaluating the functional flow through the nodes [30], by exploiting kernelized score functions [31], and by modelling protein memberships through Markov Random Fields [32] and Gaussian Random Fields [33, 34]. Furthermore, methods based on the convergence of classical [35, 36] and multicategory Hopfield networks [37] were recently proposed to specifically tackle the class imbalance.Although protein functions are clearly dependent (see, e.g., the GO functions, where parent terms include all the proteins of their children), most AFP methods described above predict biological functions independently from each other. Multitask methods, on the other hand, take advantage of existing dependencies by transferring information between related tasks, which typically leads to learning faster than algorithms trained independently on each task.
In this paper we investigate an alternative approach to multitask learning based on exploiting task dissimilarities rather than similarities. In particular, we consider two multitask extensions of a known label propagation algorithm [26]: the first extension follows a standard multitask approach based on task similarities; the second extension learns instead from task dissimilarities. Both approaches can be naturally applied to the multilabel prediction of proteins. The prediction tasks we consider are the GO protein functions of fly, human, and bacteria
model organisms. We compute different measures of similarity/dissimilarity between GO terms, taking into account both GO structure and protein annotations. We show that the approach learning from task dissimilarities greatly helps in unbalanced tasks (by helping instances labeled with the rare class labels to be correctly classified), and does not hurt in the more balanced cases. This is a crucial point in protein function prediction, since terms better describing protein functions —i.e., the most specific ones— are the most unbalanced (proteins annotated with these terms are very rare). On the other hand, learning from similar tasks tends to be more effective on balanced settings. Note that the proposed multitask extensions of label propagation do not increase the overall running time of the algorithm, allowing its application on largesized datasets. Finally, we compare our methods with the stateoftheart methodologies for AFP by considering both “termcentric” and “proteincentric” evaluation settings.
2 Automated Protein Function Prediction
The Automated protein Function Prediction (AFP) problem can be formalized as semisupervised learning problem on a weighted and undirected graph
, where is the set of vertices, is the set of edges, and is the symmetric weight matrix, where is the weight on the edge between vertices and (we assume and for all ).A set of binary classification tasks on is defined by labelings of the nodes in , where is the label of node for task . For any subset
and any vector
, we use to denote the vector obtained from by retaining only the coordinates in .The multitask prediction problem on the graph is then defined as follows. Given a set of training vertices and the complement set of test vertices, the learner must predict the test labels for each task given the training labels for the same tasks.
3 Methods
We first describe the standard label propagation algorithm [26, 38, 39] for singletask classification on graphs. This will be later extended to the multitask setting.
3.1 Label Propagation (LP)
In the singletask setting, a standard notion of regularity of a labeling on a graph is the weighted cutsize induced by and defined as follows:
(1) 
The weighted cutsize can be also expressed as a quadratic form
The matrix is the Laplacian of , where is the diagonal matrix with entries . The Label Propagation algorithm minimizes the above quadratic form over realvalued (rather than binary) labels. More precisely, LP finds the unique solution of
(2) 
The solution of (2) is smooth on . Namely, if two vertices are connected with a large weight , then is close to . Indeed, the components of satisfy the harmonic property [26]
The vector can be also written in closed form as
(3) 
where
is the weight matrix partitioned in blocks to emphasize the labeled and unlabeled part of the graph (similarly for the matrix ). As the components of given by (3) are not in , the final labeling produced by LP is obtained by thresholding each component for .
3.2 Multitask label propagation (MTLP)
It is fairly easy to use similarity or dissimilarity information between tasks in order to generalize LP to multitask learning, while preserving the regularity of every task in the sense of (1).
We start by considering multitask LP based on similarity information. Suppose a symmetric matrix is given, where each entry quantifies the relatedness between tasks and . Let be the matrix where , is the identity matrix, and is the Laplacian of . The matrix is symmetric and positive definite, since is diagonally dominant with positive diagonals, and thus invertible. Denote by the label matrix whose th column is the vector , and by the matrix whose th column is the vector .
When learning multiple related tasks, a widely used approach is requiring that similar tasks be assigned similar labelings. To this end, we introduce the linear map , defined as follows:
(4) 
It can be shown that the map acts on a multitask labeling matrix by getting closer (in Euclidean distance) the labelings (columns of ) corresponding to tasks that are similar according to .
By means of , the exploitation of task similarities can be encoded into the learning problem (2) as follows:
(5) 
where . The solution to (5) is
where is the submatrix of including only the rows indexed by , and is the submatrix of including only the rows indexed by . By observing that , we have
where is the solution of (5) with constraints for and . The equality shows that it is equivalent to apply the task feature map (4) before or after performing label propagation. This ensures that the multitask mapping does not increase the label propagation complexity.
As we show next, this solution does not perform well on unbalanced classification problems, where some class (typically the positive class) is largely underrepresented. We propose here an alternative approach, which exploits the prior information about task relatedness in an “inverse” manner. Specifically, we propose a multitask label propagation algorithm which learns multiple tasks by requiring that dissimilar tasks be assigned dissimilar labelings. As we see in the experiments, this approach turns out to work particularly well on unbalanced classification problems.
The first component of this method is a dissimilarity matrix , where is measure of dissimilarity between tasks and (we discuss in Section 3.2.2 possible choices for the matrices and ).
Given the matrix , we consider the multitask map , defined as
(6) 
where , , and is the Laplacian matrix of . Unlike the inverse transformation (4), the map moves the columns of matrix farther away from each other in the corresponding dimensional space (in the sense of the Euclidean distance). We formally show that in Section 3.2.1. Using instead of in (5), we obtain the following optimization problem:
(7) 
with . Similarly to (5), the solution of (7) is
where is the solution of (7) with constraints for and . Just like in the previous case, the equality shows that it is equivalent to apply the task feature map (6) before or after performing label propagation.
We call MTLPinv the similaritybased method (5) and MTLP the dissimilaritybased method (7). In the next section we show some interesting properties of the map which make MTLP suitable for unbalanced classification problems.
3.2.1 Analysis of the multitask map
Given , let and be, respectively, the th row and the th column of the matrix . Let also be the set of tasks for which the instance is positive, and the set of tasks for which the instance is negative. We introduce the following notation: for each
and
The next result shows that the action of the linear map on the label matrix is to change the value of each label without altering the sign. The label of an instance in task is made roughly proportional to the weighted sum of tasks in that are dissimilar to and have a different label for instance —see also Corollary 1.
Fact 1
Given , the task interaction matrix , and the map such that , where , then for all it holds
Proof:
By definition, . We distinguish the following two cases.
Case 1. . In this case we have , since by definition for any . Moreover, since , we have (by the definition of Laplacian), and accordingly
Case 2. . In this case, it holds , whereas . It follows
The property is proven by observing that implies and implies .
Using Fact 1 we can show that the map tends to increase the distance between the labelings and , for any pair of distinct tasks . Indeed, we can prove the following.
Fact 2
Given , the task interaction matrix , and the map such that , where . Then for every it holds
for every , where is the Euclidean norm.
Proof:
We prove this property by showing that for all . We distinguish the following four cases:

. In this case , and by Fact 1, .

. Even in this case , whereas .

. In this case, , and . Since both and , it follows .

. Again , and . As and we have, like the previous case, .
The map not only increases the distance between the instanceindexed label vector for two distinct tasks (as we just showed), but it also increases the distance between the taskindexed label vector for two distinct instances. Indeed, since is positive semidefinite, it is easy to show that when the transformation increases the distance between the labelings and , for any pair of distinct instances .
We now focus our discussion on another important feature of the algorithm, which makes our multitask label propagation appropriate for tasks with very unbalanced labelings. Specifically, when most entries of each column in the label matrix are . In this case, the rows of also contain mostly negative entries. Accordingly, by Fact 1, we can compensate the preponderance of negatives by applying the map . We show that with an example.
Consider the task interaction matrix such that for all . That is, all tasks are strongly dissimilar to each other. Then
(8) 
By Fact 1, it is straightforward to prove the following.
Corollary 1
Fix and the map such that , where is defined as in (8). Then, for all it holds that
Corollary 1 shows that, when (that is, the multitask labeling for vertex is unbalanced towards negatives), the map assigns to positives () an absolute value higher than that assigned to negatives (). An analogous behaviour characterizes our method when a generic matrix is considered, as stated in Fact 1. This simple property allows the rare positive labels to propagate in the graph. This is unlike the standard LP algorithm, where positive vertices are easily overwhelmed by the negative vertices during the label propagation process. The toy example in Figure 1 shows that the application of the map , where is defined as in (8), allows to improve the final classification of vertices. These observations are empirically confirmed in Section 4.
3.2.2 Task similarities
While MTLP and MTLPinv are designed to work with any task matrix, similarity and dissimilarity measures are typically tailored to specific domains. Different tasks may share different types of similarities, or may be organized in a hierarchy with a specific structure —such as a tree or a directed acyclic graph— where the positive instances of the children tasks are subsets of the positive instances of their parent tasks. In the case of a hierarchy, different approaches for computing the task matrix are possible: considering only the structure of the hierarchy [40, 41], or combining the hierarchical information with the information content of the tasks [42].
In this work we consider two dissimilarity measures ( and ) and three similarity measures (). The similarity measures and were introduced by Jiang [43] and Lin [44], respectively. Both measures are derived from the dissimilarity measure , whose definition requires a hierarchy over the tasks. The dissimilarity is computed directly from the similarity , which does not require any hierarchical information.
When tasks are organized in a hierarchy, we denote by the set of ancestor tasks of task in the hierarchy. Moreover, we use to denote the frequency of positive instances for task . Since a positive instance for a task is also positive for any , it holds that . Finally, we denote by the common ancestor of tasks and whose frequency is the lowest among all ancestors of and .
Let be the information content of task . We start by recalling the hierarchical dissimilarity measure introduced in [44],
This is the sum of the information content of and minus the information content of their closest common ancestor . Note that is always positive, as . The two hierarchical similarity measures associated with are defined as follows.
Jiang similarity measure:
Lin similarity measure:
Our third similarity measure does not rely on a hierarchy of tasks. Let the set of instances that are positive for the task .
Information content measure:
This is the ratio between the number of examples that are positive for both tasks and the number of examples that are positive for at least one task. The higher the number of shared positive examples, the higher the similarity (up to ). When two tasks do not share any positive example, their similarity is zero. In a hierarchy of tasks, tasks with many positive examples are usually closer to the root (less specific). In this case the denominator of tends to reduce the similarity between the two tasks as opposed to the case in which tasks have a small number of positive annotations. Indeed, sharing annotations between two specific tasks (closer to leaves) is more informative than sharing annotations between two more general tasks (closer to the root).
In the experiments, we compare learning with similarities and against learning with the dissimilarity . We also compare learning with against . For each one of the similarity/dissimilarity measures defined above, we set and (where necessary, values are normalized so that all matrix entries lie in the range ).
4 Results and Discussion
In this section we evaluate our multitask algorithms on the prediction of the biomolecular functions of proteins belonging to some considered model organisms. We start by describing the experimental setting. Then we compare the performance of our algorithms against that of stateoftheart methods.
4.1 Experimental setting
4.1.1 Data
We considered three different experiments to predict the protein functions of three model organisms: Drosophila melanogaster (fly), Homo sapiens (human) and Escherichia coli (bacteria). Gene networks for model organisms have been downloaded from the GeneMANIA website (www.genemania.org), and selected in order to cover different types of data, including coexpression, genetic interactions, shared domains, and physical interactions. The selected networks are described in Tables I, II and III.
Type  Source  Nodes 

Coexpression  BaradaranHeravi et al. [45]  8857 
Coexpression  Busser et al. [46]  8857 
Coexpression  Colombani et al. [47]  8857 
Coexpression  Lundberg et al. [48]  8857 
Genetic interactions  BioGRID [49]  929 
Genetic interactions  Yu et al. [50]  1414 
Physical interactions  Guruharsha et al. A [51]  1866 
Physical interactions  Guruharsha et al. B [51]  3833 
Physical interactions  BioGRID [49]  558 
Shared protein domains  InterPro [52]  5627 
Type  Source  Nodes 

Coexpression  Bahr et al. [53]  7611 
Coexpression  Balgobind et al. [54]  17522 
Coexpression  Bigler et al. [55]  17522 
Coexpression  Botling et al. [56]  17522 
Coexpression  Clarke et al. [57]  17458 
Coexpression  Vallat et al. [58]  17521 
Common biological  PATHWAYCOMMONS [59]  2133 
pathways  
Common biological  Wu et al. [60]  5319 
pathways  
Physical interactions  BioGRID [49]  15800 
Physical interactions  iRrefGRID [61]  9403 
Physical interactions  iRrefHPRD [61]  9403 
Physical interactions  iRrefOPHID [61]  9403 
Physical interactions  IREF SMALLSCALESTUDIES [61]  9036 
Shared protein  InterPro [52]  15800 
domains  
Shared protein  Pfam [62]  15251 
domains 
Type  Source  Nodes 

Coexpression  Graham et al. [63]  3959 
Coexpression  RobbinsManke et al. [64]  3912 
Genetic interactions  Babu et al. [65]  715 
Genetic interactions  Butland et al. [66]  3497 
Physical interactions  Hu at al [67]  1537 
Physical interactions  IREFDip [61]  633 
Physical interactions  Y2H  PPI  1063 
Shared protein domains  InterPro [52]  3005 
Shared protein domains  Pfam [62]  2726 
For every organism, networks were integrated through unweighted sum on the union of genes in the individual networks. No preprocessing was applied to the individual networks, whereas each network, denoted by the corresponding connection matrix , was normalized as follows:
where is the diagonal matrix with diagonal entries .
Protein functions were downloaded from the Gene Ontology. This ontology is structured as a directed acyclic graph with different levels of specificity and contains three branches: Biological Process (BP), Molecular Functions (MF), and Cellular Components (CC). We considered the experimental annotations in the releases 07.03.16, 16.03.16, and 17.10.16 respectively for fly, human and bacteria organisms. We performed a dedicated experiment for every branch.
For predicting the most specific terms in the ontology (i.e., those best describing protein functions), and in order to consider terms with a minimum amount of prior information, we selected all the GO terms with positive annotated genes, obtaining ( BP, MF, CC), ( BP, MF, CC), and ( BP, MF, CC) terms for fly, human, and bacteria, respectively. We considered two groups of GO terms according to their specificity: GO terms with  and  annotated proteins, for a total of categories for every GO branch. In the end, we obtained a total of fly, human, and bacteria genes which have at least one GO positive annotation in the considered GO release. The obtained tasks are therefore severely unbalanced toward negatives.
4.1.2 Evaluation metrics
In order to evaluate the generalization performance of the compared methods, we applied a fold crossvalidation experimental setting and adopted the Area Under the PrecisionRecall Curve (AUPRC) as “per term” ranking measure. AUPRC is indeed more informative on unbalanced settings than the classical area under the ROC curve [68]. Furthermore, following the recent CAFA2 international challenge, we also considered a “proteincentric evaluation” to assess performance accuracy in predicting all ontological terms associated with a given protein sequence [9]
. In this scenario, the multiplelabel Fscore is used as performance measure. More precisely, if we indicate as
, and respectively the number of true positives, true negatives, and false positives for the protein at threshold , we can define the “perprotein” multiplelabel precision and recall at a given threshold as:where is the number of proteins. In other words, (resp., ) is the average multilabel precision (resp., recall) across proteins. The multilabel Fmeasure depends on and according to CAFA2 experimental setting, the maximum achievable Fscore () is adopted as the main multilabel “perprotein” metric:
(9) 
4.2 Results
4.2.1 Evaluating GO semantic similarities
This section investigates the impact of the task similarity/dissimilarity measures described in Section 3.2.2 on the performance of the proposed multitask label propagation algorithms. Table IV shows the obtained results. In this experiment we set (the choice of parameter is discussed in Section 4.2.5). When MTLPinv uses the similarity measures and MTLP uses , MTLP outperforms MTLPinv in both AUPRC and . Nevertheless, the GO term similarity is much more informative for MTLPinv, which achieves in this case results competitive with MTLP (whose performance instead is nearly indistinguishable when using or ), and in some cases even better. The difference in favor of MTLP seems to increase with the data imbalance: on human data set, the most unbalanced, we observe the highest gap in favor of MTLP; whereas on the Bacteria data set, the least unbalanced, the gap is reduced and —in some cases like for the MF terms— MTLPinv significantly outperforms MTLP in terms of average AUPRC. In terms of , however, MTLP is always the top method.
Overall, these results suggests that MTLP should be preferred when the proportion of positives is drastically smaller than that of negatives. When data are more balanced, MTLPinv better exploits the similarities among tasks and, at least in term of AUPRC, is a valid option. In terms of multilabel accuracy, MTLP is always better than MTLPinv. Finally, it is worth noting that both methods outperforms LP in terms of AUPRC (see Section 4.2.3 for LP results), whereas in terms of only MTLP achieves better results than LP.
METHODS  BP  MF  CC  

All      All      All      
FLY  
MTLP  0.140  0.133  0.153  0.247  0.333  0.322  0.355  0.411  0.262  0.265  0.253  0.354 
MTLP  0.140  0.133  0.153  0.246  0.333  0.322  0.355  0.410  0.262  0.265  0.253  0.357 
MTLPinv  0.020  0.013  0.031  0.183  0.198  0.179  0.238  0.374  0.150  0.138  0.181  0.306 
MTLPinv  0.020  0.014  0.031  0.170  0.192  0.172  0.235  0.351  0.101  0.082  0.147  0.259 
MTLPinv  0.135  0.129  0.146  0.244  0.328  0.318  0.352  0.381  0.261  0.265  0.251  0.333 
HUMAN  
MTLP  0.144  0.133  0.165  0.273  0.248  0.247  0.250  0.383  0.224  0.259  0.156  0.317 
MTLP  0.145  0.134  0.165  0.275  0.249  0.248  0.250  0.385  0.224  0.259  0.156  0.318 
MTLPinv  0.008  0.005  0.014  0.200  0.093  0.083  0.152  0.330  0.105  0.113  0.090  0.274 
MTLPinv  0.008  0.005  0.012  0.182  0.059  0.050  0.079  0.294  0.066  0.064  0.068  0.223 
MTLPinv  0.139  0.129  0.159  0.244  0.243  0.241  0.244  0.355  0.220  0.256  0.160  0.299 
BACTERIA  
MTLP  0.119  0.107  0.169  0.210  0.173  0.157  0.238  0.269  0.122  0.105  0.220  0.348 
MTLP  0.119  0.107  0.168  0.212  0.173  0.157  0.238  0.276  0.122  0.105  0.219  0.348 
MTLPinv  0.069  0.056  0.123  0.181  0.106  0.092  0.165  0.235  0.101  0.086  0.187  0.246 
MTLPinv  0.053  0.043  0.094  0.109  0.057  0.045  0.107  0.117  0.106  0.089  0.207  0.281 
MTLPinv  0.121  0.108  0.176  0.189  0.181  0.165  0.247  0.247  0.123  0.107  0.212  0.289 
In order to investigate the reasons why, unlike MTLPinv, MTLP performance slightly varies with the task dissimilarity measure, we run MTLP on the fly organism and CC tasks by randomly generating the matrix . We generated matrices with different sparsity (from to , with steps of ) and with different ranges of weight values. Specifically, we uniformly selected weights in the interval , with ranging from to , by steps of . In Figure 2, we show the heatmap of the average AUPRC obtained in each experiment.
As expected, the results are considerably worse than those obtained when considering real dissimilarity matrices (see Table IV). There is a small AUPRC variation from the different random data, with higher AUPRC when the dissimilarity matrix is denser and with larger entries (the former seems to affect the results more than the latter). This is consistent with Fact 1, since the lower the weight and/or the sparser the matrix, the closer MTLP is to LP. Finally, on randomly generated dissimilarity matrices MTLP performs even worse than LP, as we can see from Figure 4.
4.2.2 Grouping GO terms for multitask mapping
Following the approach proposed in [4], in addition to the strategy grouping GO terms by branch (i) adopted in the previous section, we have examined an alternative way for grouping the terms to be considered in the multitask maps (4) and (6) when running MTLPinv and MTLP algorithms, respectively. Specifically, we grouped GO terms not just by GO branch (BP, MF, and CC), but also by taking into account the number of annotated proteins (ii), obtaining groups: BP with  ( terms) and  ( terms) annotations, MF  ( terms),  ( terms), and CC  ( terms),  ( terms). The corresponding results on fly data are reported in Figure 3. AUPRC results show negligible differences between strategies (i) and (ii), for both MTLP and MTLPinv. More clear is the difference in terms of Fmax, with opposite behaviour between MTLP and MTLPinv: MTLP has worse performance in all GO branches; MTLPinv instead tends to perform better (see for instance MF results). Indeed, black lines (grouping strategy (ii)) in correspondence of Fmax are always between grey lines (grouping strategy (i)). However, the best results are still achieved by MTLP when grouping terms by GO branch, and accordingly we consider this strategy in the rest of the paper.
4.2.3 Prediction of GO functions for fly, human, and bacteria organisms
MTLP () was compared with stateoftheart graphbased methodologies applied to the prediction of protein functions. We considered: LP, the label propagation algorithm described in Section 3.1; COSNetM [15], an extension of a node classifier designed for unbalanced settings [36]; RW, the classical step random walk algorithm [69]; GBA, a method based on the guiltbyassociation assumption [23];
MSkNN
, one of the best methods in the recent CAFA2 challenge applying the kNN algorithm to each network independently, and then combining the obtained predictions [70].In order to deal with label imbalance in LP, we applied a label normalization step before running label propagation. This step normalizes the labels of each GO term so that positive and negative labels sum to . In our experiments, this variant of LP performs much better than the vanilla LP algorithm. For the RW algorithm we set the limit on the number of iterations to , since higher values did not improve the performance while increasing the computational burden. Finally, we set to the parameter for the kNN algorithm, as a result of a tuning process on training data.
In Figures 4 and 5 we show the obtained results in terms of AUPRC and , on BP and CC terms respectively (on MF terms the methods showed a similar behaviour). Interestingly, MTLP always achieves the highest AUPRC averaged over all tasks (All), with statistically significant improvements over the second top method (), except for bacteria data and for BP terms on fly data. When comparing with LP method, the improvement is always significant, except for CC (bacteria data). COSNetM is the second method on human and fly data sets, while on bacteria LP (CC) and RW (BP) rank as second method. Furthermore, and more importantly, MTLP improvements are more noticeable on the most unbalanced terms, which are those best characterizing the biological functions of genes. GBA, MSkNN and RW methods seem suffer the strongly unbalanced setting, and perform worse than LP, with the exception of RW on bacteria data set. The good performance of COSNetM in this unbalanced setting is likely due to its costsensitive strategy, which requires learning two model parameters. This extra learning step increases its computation time. Indeed, COSNetM takes on average around seconds on a Linux machine with Intel Xeon(R) CPU 3.60GHz and 32 Gb RAM to perform an entire cross validation cycle for one task on fly data, whereas both LP and MTLP take on average slightly less than one second. This confirms our observation that applying the map after label propagation does not increase the algorithm complexity, and just slightly increases the execution time for computing .
Even in terms of Fmax MTLP obtains the best results, with LP secondbest method (except on BP — fly data). This shows that our method can achieve good predictive capabilities both when predicting single GO terms and when predicting a GO multilabel for single proteins. On the other side, the compared methods tend to have competitive performance in only one scenario. For instance, RW poorly performs in terms of , whereas, unlike AUPRC, MSkNN achieves good Fmax results: on BP (fly data) it is the best method after MTLP. Even COSNetM, which is the second method in terms of AUPRC, achieves the third or the fourth best rank.
4.2.4 Evaluating different powers of the Laplacian matrix
A further experiment was carried out to analyze how MTLP performance changes when using the map for , instead of . We empirically tested on the fly organism different values of , fixing the parameter and using the measure. The results are shown in Figure 6. We considered . Except for BP terms, where the map performs slightly better than , all choices of lead to worse results. In particular, the performance strongly decays for .
4.2.5 Impact of parameter
Large values of the parameter, introduced in Section 3.2, tend to reduce the multitask contribution encoded in , since is diagonally dominant and absolute labels assigned to positives and negatives vertices by the map tend to be almost the same (see Fact 1). Hence, this allows to “regulate” to some extent the method between multitask and singletask label propagation. We experimentally tuned on fly and human data from to with step size . It turns out there is a negligible difference, with results reported in Table IV and corresponding to . This is expected, since is much larger than in the considered experiments. For this reason, we also performed another experiment in which we selected a smaller subset of terms in the BP branch (a similar trend is observed for the MF and CC branches). Specifically, we ran our algorithm on a subset of terms for the fly organism, by varying in the specified range. The results are shown in Table V. Confirming our observations, MTLP is more sensitive to values in this setting, and the overall trend is that the average AUPRC tends to decrease when becomes larger (similarly to ). This not surprising: as we explained, with large values of MTLP behaves closer to LP, whose results are lower in this setting.
All      

0.25  0.158  0.150  0.182 
0.5  0.157  0.151  0.177 
0.75  0.145  0.134  0.178 
1  0.144  0.133  0.178 
1.25  0.140  0.130  0.175 
1.5  0.139  0.129  0.174 
5 Conclusions
We have shown that task relatedness information represented through task dissimilarity is better suited for label propagation in unbalanced protein function prediction than task similarity. The proposed multitask label propagation algorithm compared favourably with the stateoftheart methodologies for protein function prediction on three model organisms. Although we gained some intuition and collected empirical evidence, we are still investigating the multitask problems where our approach is most effective. Specifically, it would be useful to study whether dissimilarity information helps when coupled with multitask algorithms different from label propagation. For example, linear learning algorithms such as SVM or Perceptron. Laplacian spectral theory is also likely to help us shed some further light on the properties of our method.
Acknowledgments
The authors would like to thank the reviewers of BIOKDD16 for useful comments on an earlier draft of this paper.
References
 [1] N. Youngs, D. PenfoldBrown, K. Drew, D. Shasha, and R. Bonneau, “Parametric bayesian priors and better choice of negative examples improve protein function prediction,” Bioinformatics, vol. 29, no. 9, pp. 1190–1198, 2013.

[2]
M. Frasca and D. Malchiodi, “Selection of negative examples for node label
prediction through fuzzy clustering techniques,” in
Advances in Neural Networks: Computational Intelligence for ICT
, S. Bassis, A. Esposito, C. F. Morabito, and E. Pasero, Eds. Cham: Springer International Publishing, 2016, pp. 67–76. [Online]. Available: http://dx.doi.org/10.1007/9783319337470$_$7 
[3]
S. Mostafavi and Q. Morris, “Using the gene ontology hierarchy when predicting
gene function,” in
Proceedings of the TwentyFifth Annual Conference on Uncertainty in Artificial Intelligence (UAI09)
. Corvallis, Oregon: AUAI Press, 2009, pp. 419–427.  [4] S. Mostafavi and Q. Morris, “Fast integration of heterogeneous data sources for predicting gene function with limited annotation,” Bioinformatics, vol. 26, no. 14, pp. 1759–1765, 2010.
 [5] M. Frasca, A. Bertoni et al., “UNIPred: unbalanceaware Network Integration and Prediction of protein functions,” Journal of Computational Biology, vol. 22, no. 12, pp. 1057–1074, 2015.
 [6] The Gene Ontology Consortium, “Gene ontology: tool for the unification of biology,” Nature Genet., vol. 25, pp. 25–29, 2000.
 [7] A. Ruepp, A. Zollner, D. Maier, K. Albermann, J. Hani, M. Mokrejs, I. Tetko, U. Guldener, G. Mannhaupt, M. Munsterkotter, and H. Mewes, “The FunCat, a functional annotation scheme for systematic classification of proteins from whole genomes,” Nucleic Acids Research, vol. 32, no. 18, pp. 5539–5545, 2004.
 [8] P. Radivojac et al., “A largescale evaluation of computational protein function prediction,” Nature Methods, vol. 10, no. 3, pp. 221–227, 2013.
 [9] Y. Jiang, T. R. Oron et al., “An expanded evaluation of protein function prediction methods shows an improvement in accuracy,” Genome Biology, vol. 17, no. 1, p. 184, 2016. [Online]. Available: http://dx.doi.org/10.1186/s1305901610376
 [10] D. Martin, M. Berriman, and G. Barton, “Gotcha: a new method for prediction of protein function assessed by the annotation of seven genomes.” BMC Bioinformatics, vol. 5, p. 178, 2004.
 [11] T. Hawkins, M. Chitale, S. Luban et al., “Pfp: Automated prediction of gene ontology functional annotations with confidence scores using protein sequence data.” Proteins, vol. 74, no. 3, pp. 566–82, 2009.
 [12] A. Juncker, L. Jensen, A. Perleoni, A. Bernsel, M. Tress, P. Bork, G. von Heijne, A. Valencia, A. Ouzounis, R. Casadio, and S. Brunak, “Sequencebased feature prediction and annotation of proteins,” Genome Biology, vol. 10:206, 2009.
 [13] A. Vazquez, A. Flammini, A. Maritan, and A. Vespignani, “Global protein function prediction from proteinprotein interaction networks,” Nature Biotechnology, vol. 21, pp. 697–700, 2003.
 [14] R. Sharan, I. Ulitsky, and R. Shamir, “Networkbased prediction of protein function,” Mol. Sys. Biol., vol. 8, no. 88, 2007.
 [15] M. Frasca, “Automated gene function prediction through gene multifunctionality in biological networks,” Neurocomputing, vol. 162, pp. 48 – 56, 2015.
 [16] A. Sokolov and A. BenHur, “Hierarchical classification of Gene Ontology terms using the GOstruct method,” Journal of Bioinformatics and Computational Biology, vol. 8, no. 2, pp. 357–376, 2010.
 [17] A. Sokolov, C. Funk, K. Graim, K. Verspoor, and A. BenHur, “Combining heterogeneous data sources for accurate functional annotation of proteins,” BMC Bioinformatics, vol. 14, no. Suppl 3:S10, 2013.
 [18] G. Obozinski, G. Lanckriet, C. Grant, J. M., and W. Noble, “Consistent probabilistic output for protein function prediction,” Genome Biology, vol. 9, no. S6, 2008.
 [19] Y. Guan, C. Myers, D. Hess, Z. Barutcuoglu, A. Caudy, and O. Troyanskaya, “Predicting gene function in a hierarchical context with an ensemble of classifiers,” Genome Biology, vol. 9, no. S2, 2008.
 [20] G. Valentini, “Hierarchical Ensemble Methods for Protein Function Prediction,” ISRN Bioinformatics, vol. 2014, no. Article ID 901419, pp. 1–34, 2014.
 [21] E. Marcotte, M. Pellegrini, M. Thompson, T. Yeates, and D. Eisenberg, “A combined algorithm for genomewide prediction of protein function,” Nature, vol. 402, pp. 83–86, 1999.
 [22] S. Oliver, “Guiltbyassociation goes global,” Nature, vol. 403, pp. 601–603, 2000.
 [23] B. Schwikowski, P. Uetz, and S. Fields, “A network of proteinprotein interactions in yeast.” Nature biotechnology, vol. 18, no. 12, pp. 1257–1261, Dec. 2000.
 [24] Y. Li and J. Patra, “Integration of multiple data sources to prioritize candidate genes using discounted rating systems,” BMC Bioinformatics, vol. 11, no. Suppl I:S20, 2010.
 [25] P. Bogdanov and A. Singh, “Molecular function prediction using neighborhood features,” IEEE ACM Transactions on Computational Biology and Bioinformatics, vol. 7, no. 2, pp. 208–217, 2011.
 [26] X. Zhu et al., “Semisupervised learning with gaussian fields and harmonic functions,” in Proc. of the 20th Int. Conf. on Machine Learning, Washintgton DC, USA, 2003.
 [27] H. Zhou and T. Hastie, “Regularization and variable selection via the elastic net,” Journal of the Royal Statistical Society B, vol. 67, no. 2, pp. 301–320, 2007.
 [28] M. Szummer and T. Jaakkola, “Partially labeled classification with markov random walks,” in NIPS 2001, vol. 14, Whistler BC, Canada, 2001.
 [29] A. Azran, “The rendezvous algorithm: Multi class semisupervised learning with Markov random walks,” in Proceedings of the 24th International Confer ence on Machine Learning (ICML), 2007.
 [30] E. Nabieva, K. Jim, A. Agarwal, B. Chazelle, and M. Singh, “Wholeproteome prediction of protein function via graphtheoretic analysis of interaction maps,” Bioinformatics, vol. 21, no. S1, pp. 302–310, 2005.
 [31] G. Valentini, G. Armano, M. Frasca, J. Lin, M. Mesiti, and M. Re, “RANKS: a flexible tool for node label ranking and classification in biological networks,” Bioinformatics, 2016, in press. Accepted on 22 April 2016.
 [32] M. Deng, T. Chen, and F. Sun, “An integrated probabilistic model for functional prediction of proteins,” J. Comput. Biol., vol. 11, pp. 463–475, 2004.
 [33] K. Tsuda, H. Shin, and B. Scholkopf, “Fast protein classification with multiple networks,” Bioinformatics, vol. 21, no. Suppl 2, pp. ii59–ii65, 2005.
 [34] S. Mostafavi, D. Ray, D. WardeFarley, C. Grouios, and Q. Morris, “GeneMANIA: a realtime multiple association network integration algorithm for predicting gene function,” Genome Biology, vol. 9, no. S4, 2008.
 [35] U. Karaoz et al., “Wholegenome annotation by using evidence integration in functionallinkage networks,” Proc. Natl Acad. Sci. USA, vol. 101, pp. 2888–2893, 2004.
 [36] A. Bertoni, M. Frasca, and G. Valentini, COSNet: A Cost Sensitive Neural Network for Semisupervised Learning in Graphs. Berlin, Heidelberg: Springer Berlin Heidelberg, 2011, pp. 219–234. [Online]. Available: http://dx.doi.org/10.1007/9783642237805$_$24
 [37] M. Frasca, S. Bassis, and G. Valentini, “Learning node labels with multicategory hopfield networks,” Neural Computing and Applications, vol. 27, no. 6, pp. 1677–1692, 2016. [Online]. Available: http://dx.doi.org/10.1007/s0052101519651
 [38] M. Belkin and P. Niyogi, “Laplacian eigenmaps for dimensionality reduction and data representation,” Neural Computation, vol. 15, pp. 1373–1396, 2002.
 [39] B. Kveton, M. Valko, A. Rahimi, and L. Huang, “Semisupervised learning with maxmargin graph cuts.” in AISTATS, ser. JMLR Proceedings, Y. W. Teh and D. M. Titterington, Eds., vol. 9. JMLR.org, 2010, pp. 421–428. [Online]. Available: http://dblp.unitrier.de/db/journals/jmlr/jmlrp9.html$#$KvetonVRH10
 [40] Z. Wu and M. Palmer, “Verbs semantics and lexical selection,” in Proceedings of the 32nd annual meeting on Association for Computational Linguistics. Morristown, NJ, USA: Association for Computational Linguistics, 1994, pp. 133–138. [Online]. Available: http://portal.acm.org/citation.cfm?id=981751
 [41] C. Leacock and M. Chodorow, “Combining local context and wordnet similarity for word sense identification,” in MIT Press, C. Fellfaum, Ed., Cambridge, Massachusetts, 1998, pp. 265–283.
 [42] L. Meng, R. Huang, and J. Gu, “A review of semantic similarity measures in wordnet,” International Journal of Hybrid Information Technology, vol. 6, no. 1, pp. 1–12, 2013.
 [43] J. J. Jiang and D. W. Conrath, “Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy,” in International Conference Research on Computational Linguistics (ROCLING X), Sep. 1997, pp. 9008+. [Online]. Available: http://adsabs.harvard.edu/cgibin/nphbib_query?bibcode=1997cmp.lg....9008J
 [44] D. Lin, “An informationtheoretic definition of similarity,” in Proceedings of the Fifteenth International Conference on Machine Learning, ser. ICML ’98. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 1998, pp. 296–304. [Online]. Available: http://dl.acm.org/citation.cfm?id=645527.657297
 [45] A. BaradaranHeravi, K. S. Cho, B. Tolhuis et al., “Penetrance of biallelic SMARCAL1 mutations is associated with environmental and genetic disturbances of gene expression,” Human Molecular Genetics, vol. 21, no. 11, pp. 2572–2587, Jun. 2012.
 [46] B. W. Busser, L. Shokri, S. A. Jeager et al., “Molecular mechanism underlying the regulatory specificity of a Drosophila homeodomain protein that specifies myoblast identity.” Development (Cambridge, England), vol. 139, no. 6, pp. 1164–1174, Mar. 2012.
 [47] J. Colombani, D. S. Andersen, and P. Léopold, “Secreted peptide dilp8 coordinates drosophila tissue growth with developmental timing,” Science, vol. 336, no. 6081, pp. 582–585, 2012.
 [48] L. E. Lundberg, M. Fiqueiredo, P. Stenberg et al., “Buffering and proteolysis are induced by segmental monosomy in Drosophila melanogaster,” Nucleic Acids Research, Mar. 2012.
 [49] C. Stark, B. joe Breitkreutz, T. Reguly et al., “Biogrid: a general repository for interaction datasets.” Nucleic Acids Research, no. DatabaseIssue, pp. 535–539, 2006.
 [50] J. Yu, S. Pacifico, G. Liu et al., “DroID: the Drosophila Interactions Database, a comprehensive resource for annotated gene and protein interactions,” BMC Genomics, vol. 9, no. 1, pp. 461+, Oct. 2008.
 [51] K. G. Guruharsha, J. Rual, B. Zhai et al., “A Protein Complex Network of Drosophila melanogaster,” Cell, vol. 147, no. 3, pp. 690–703, Oct. 2011.
 [52] R. Apweiler, T. K. Attwood, A. Bairoch et al., “The InterPro database, an integrated documentation resource for protein families, domains and functional sites,” Nucleic Acids Research, vol. 29, no. 1, pp. 37–40, Jan. 2001.
 [53] T. M. Bahr, G. J. Hughes et al., “Peripheral Blood Mononuclear Cell Gene Expression in Chronic Obstructive Pulmonary Disease,” American Journal of Respiratory Cell and Molecular Biology, vol. 49, no. 2, pp. 316–323, 2013.
 [54] B. V. Balgobind, M. M. Van den HeuvelEibrink et al., “Evaluation of gene expression signatures predictive of cytogenetic and molecular subtypes of pediatric acute myeloid leukemia,” Haematologica, vol. 96, no. 2, pp. 221–230, 2011.
 [55] J. Bigler, H. A. Rand et al., “Crossstudy homogeneity of psoriasis gene expression in skin across a large expression range,” PLoS ONE, vol. 8, no. 1, pp. 1–15, 01 2013.
 [56] J. Botling, K. Edlund, M. Lohr, B. Hellwig, L. Holmberg, M. Lambe, A. Berglund, S. Ekman, M. Bergqvist, F. Pontén, A. König, O. Fernandes, M. Karlsson, G. Helenius, C. Karlsson, J. Rahnenführer, J. G. Hengstler, and P. Micke, “Biomarker discovery in non–small cell lung cancer: Integrating gene expression profiling, metaanalysis, and tissue microarray validation,” Clinical Cancer Research, vol. 19, no. 1, pp. 194–204, 2013. [Online]. Available: http://clincancerres.aacrjournals.org/content/19/1/194.abstract
 [57] C. Clarke, S. F. Madden et al., “Correlating transcriptional networks to breast cancer survival: a largescale coexpression analysis,” Carcinogenesis, vol. 34, no. 10, pp. 2300–2308, 2013. [Online]. Available: http://carcin.oxfordjournals.org/content/34/10/2300.abstract
 [58] L. Vallat, C. A. Kemper et al., “Reverseengineering the genetic circuitry of a cancer cell with predicted intervention in chronic lymphocytic leukemia,” Proceedings of the National Academy of Sciences, vol. 110, no. 2, pp. 459–464, 2013. [Online]. Available: http://www.pnas.org/content/110/2/459.abstract
 [59] E. G. Cerami, B. E. Gross et al., “Pathway commons, a web resource for biological pathway data,” Nucleic Acids Research, vol. 39, no. suppl 1, pp. D685–D690, 2011.
 [60] G. Wu, X. Feng, and L. Stein, “A human functional protein interaction network and its application to cancer data analysis,” Genome Biology, vol. 11, no. 5, pp. 1–23, 2010.
 [61] S. Razick, G. Magklaras, and I. M. Donaldson, “irefindex: A consolidated protein interaction database with provenance,” BMC Bioinformatics, vol. 9, no. 1, pp. 1–19, 2008. [Online]. Available: http://dx.doi.org/10.1186/147121059405
 [62] R. D. Finn et al., “The pfam protein families database: towards a more sustainable future,” Nucleic Acids Research, vol. 44, no. D1, pp. D279–D285, 2016. [Online]. Available: http://nar.oxfordjournals.org/content/44/D1/D279.abstract
 [63] A. I. Graham, G. Sanguinetti, N. Bramall, C. W. McLeod, and R. K. Poole, “Dynamics of a starvationtosurfeit shift: a transcriptomic and modelling analysis of the bacterial response to zinc reveals transient behaviour of the fur and soxs regulators,” Microbiology, vol. 158, no. 1, pp. 284–292, 2012.
 [64] J. L. RobbinsManke, Z. Z. Zdraveski, M. Marinus, and J. M. Essigmann, “Analysis of global gene expression and doublestrandbreak formation in dna adenine methyltransferase and mismatch repairdeficient escherichia coli,” Journal of bacteriology, vol. 187, no. 20, pp. 7027–7037, October 2005. [Online]. Available: http://europepmc.org/articles/PMC1251628
 [65] M. Babu et al., “Genetic interaction maps in escherichia coli reveal functional crosstalk among cell envelope biogenesis pathways,” PLoS Genet, vol. 7, no. 11, pp. 1–15, 11 2011.
 [66] G. Butland et al., “esga: E. coli synthetic genetic array analysis,” Nat Meth, vol. 5, no. 3, pp. 789–795, jan 2008.
 [67] P. Hu, S. C. Janga et al., “Global Functional Atlas of Escherichia coli Encompassing Previously Uncharacterized Proteins,” PLoS Biol, vol. 7, no. 4, pp. e1 000 096+, Apr. 2009. [Online]. Available: http://dx.doi.org/10.1371/journal.pbio.1000096
 [68] T. Saito and M. Rehmsmeier, “The precisionrecall plot is more informative than the roc plot when evaluating binary classifiers on imbalanced datasets,” PLoS ONE, vol. 10, no. 3, pp. 1–21, 03 2015.
 [69] L. Lovász, “Random walks on graphs: A survey,” in Combinatorics, Paul Erdős is Eighty, D. Miklós, V. T. Sós, and T. Szőnyi, Eds. Budapest: János Bolyai Mathematical Society, 1996, vol. 2, pp. 353–398.
 [70] L. Lan, N. Djuric, Y. Guo, and V. S., “MSkNN: protein function prediction by integrating multiple data sources,” BMC Bioinformatics, vol. 14, no. Suppl 3:S8, 2013.
Comments
There are no comments yet.