Ranking rare class items for highly-unbalanced two class datasets is a common problem in many applications. Some examples are as following: identifying fraudulent activities in credit card transactions (Bhusari and Patil, 2011; Bolton and Hand, 2002), intrusion detection in internet surveillance (Lippmann and Cunningham, 2000), detection of terrorist attacks or threats (Fienberg and Shmueli, 2005; Peeters, 2006), finding information on the world wide web (Gordon and Pathak, 1999), and finding drug-candidate biomolecules in a chemical library (Tomal et al., 2016).
The main application in this paper is the ranking of homologous proteins. Protein homology means biological homology between proteins. Two or more proteins are homologous in a sense that they have a common evolutionary origin. Knowing homology status helps scientists inferring evolutionary sequences of proteins (Koonin and Galperin, 2003). Detection of homologous proteins is of high importance in bioinformatics (Söding, 2005), and has widespread applications in prediction of protein’s function, 3D structure and evolution (Bork and Koonin, 1998; Kinch et al., 2003; Henn-Sax et al., 2001). Evolutionarily related homologous proteins have similar amino-acid sequences and three dimensional structures. Thus, homology modeling refers to comparing proteins based on their amino-acid sequences and three-dimensional structures to predict if the candidate proteins are homologous to the native protein.
be a random variable representing theclass status of a case, taking value for the rare class and for the uninteresting class. Let that a , that is, . We build our model using training data for which the class status and the feature variables are known. The trained model is then used to predict the unknown class status of test data for which only the feature variables are known.
In the case of homology modeling, the variables represent similarity scores between candidate proteins and the native (target) one. The feature variables exploit three-dimensional description of the native protein and sequence-related information of the candidate protein. The target sequence-related characteristics of the candidate protein are: amino acid frequency profile, prediction of the secondary structure, and prediction of solvent-exposed surface area. The target characteristics for the native protein are: the raw sequence and profile, the actual secondary structure, and the actual exposed surface area. The comparisons of characteristics are converted into various scores - such as raw scores, reverse scores, E-values, -scores etc. - are called the feature variables. Details are in Vallat et al. (2009).
Several methods have already been applied in protein homology prediction: Hidden Markov Models (Karplus et al., 1998), Support Vector Machines, Neural Network (Hochreiter et al., 2007), Markov Random Fields (Ma et al., 2014) and more. The cited methods are non-ensemble in nature, which train one model only.
Ensemble methods that combine multiple models are generally considered more powerful, in terms of prediction power, than non-ensemble methods (Dietterich, 2000). For example, the overall winner in the protein homology section of the knowledge discovery and data mining (KDD) competition was the Weka Group (Pfahringer, 2004). Weka111http://www.cs.waikato.ac.nz/ml/weka/: Accessed November 04, 2016.
is an open platform software which offers a collection of machine learning algorithms for data mining tasks. The winning group tried a large number of algorithms and selected the top three classifiers:a boosting ensemble (Freund and Schapire, 1997) of
unpruned decision trees,a linear support vector machines (Cortes and Vapnik, 1995) with a logistic model fitted to the raw outputs for improved probabilities, and
or more random rules - a variant of random forest(Breiman, 2001). The top three classifiers, where two are already ensemble methods, were aggregated to obtain the winning ensemble.
The history of injecting diversity into models to build powerful ensembles goes back to the s. The most popular method for injecting diversity into models is random perturbations of the training data (Melville and Mooney, 2003; Breiman, 1996, 2001). Another popular method for injecting diversity is through random perturbations and/or clustering of the feature variables (Breiman, 2001; Wang, 2005; Gupta et al., 2014; Tomal et al., 2015). In contrast, injecting diversity into ensembles through diverse evaluation metrics is new. In this article, we injected diversity in our ensemble through both diverse models and metrics.
We propose an ensemble of models as following
where each model is trained on a subset of feature variables . Here, two subsets of feature variables and ; () are disjoint and diverse to each other. The number of models to be ensembled is data dependent and determined adaptively. We propose an algorithm to group the variables into subsets in a way that variables in a group appear to work better together in a model and variables in different groups appear to work better in separate models. The subsets of variables are aggregated in an ensemble by fitting a model to each subset and averaging across the subsets. So we call phalanx a subset of feature variables which work well together in a particular model rather than in separate models and the ensemble of these models is then called an ensemble of phalanges (EPX). Our work here improves the method proposed by Tomal et al. (2015).
In the type of applications we have in mind the response variableis very sparse. Therefore, it is not a good idea to evaluate a model by minimizing misclassification error. For example, in the case of homology status of candidate proteins, approximately out of proteins are homologous to the native protein. A naive classifier that calls all the candidate proteins “non-homologous” would achieve a missclassification error. Instead we evaluate a classification model by its ability to rank the test cases in such a way that the rare cases appear on the top in a list. In highly unbalanced class problems, the ranking performance of a model is better reflected by a hit curve. Consider that there are cases in a test set, of which are of class 1, which are ranked using their probability of being of class 1. Consider that we have a shortlist of the top test cases of which are of class 1. Then the plot of against is called a hit curve. Figure ((a)a) shows three hit curves: , and . In this example, we have and . The hit curve is uniformly better than the hit curve , as at every . So, we want hit curves towards the top left corner with most rapid early rise. On the other hand, the diagonal hit curve shows performance similar to random ranking.
A hit curve can effectively capture the ranking performance of a method. But the hit curves are difficult to optimize as they do not provide single performance value. Sometimes the hit curves may criss-cross each other making the comparison even more difficult. For example, see Figure ((b)b) with two hit curves and , in an example with and , crossing each other at several locations.
There are several derived metrics which evaluate the ranking performances of methods by summarizing hit curve into a single numeric value. The use of those metrics make the comparisons of ranking performances among several methods easier. However, some of the metrics focus on towards the beginning of the hit curve and the others focus on towards the tail of the hit curve. In other words, there is considerable diversity among evaluation metrics as well. In this article, we exploit the diversity in metrics to construct ensemble of models and metrics as following:
where is the ensemble of models optimized using th metric. Here, the value of , the number of diverse evaluation metrics to ensemble, is application dependent. In this article, the ensemble of models and metrics is obtained by aggregating two diverse ensembles of models: one obtained on optimizing an evaluation metric called average precision (see Section 3.1) and the other obtained on optimizing another evaluation metric called rank last (see Section 3.2).
The significance of using average precision and rank last lies in the fact of their applications in protein homology. In this protein homology data, there are candidate proteins where some are close and some are distant homologs to the native proteins. The ensemble of models optimized using average precision (APR) ranks close homologs well and pays less attention to distant or remote homologs. But it is often interesting to detect remote homologs as it can facilitate the assignment of putative function to uncharacterized proteins improving genomic function annotations (Söding, 2005; Eddy, 1998, 2011; Kaushik et al., 2016). Optimization of rank last (RKL) to build an ensemble of models helps us achieving this latter goal. Aggregating two ensemble of models, one based on APR and the other based on RKL, helps us achieving both of the goals. The performances of the aggregated ensemble of models and metrics are robust in detecting both the close and distant homologous proteins.
Another metric TOP1 (Section 3.3) could also used be to evaluate ranking performance of the ensembles. This metric, however, is not used to optimize the ensemble of models. The reasons are following. First, TOP1 achieves its largest possible value when one of the homologous proteins is ranked in the top position, and is completely ignorant to distant homologs
. This metric does not pay any attention to the homologous proteins other than the one ranked in the top position. Thus, a method might not need many predictor variables to do well in this evaluation metric. For example, in a particular run of the ensemble of models optimized using TOP1, only one subset withvariables (out of ) was selected. This phenomenon is intuitively opposite to the ensemble of models which requires many complementary subsets of predictor variables to aggregate. Second, the behavior of the metric TOP1 is not complementary to the metric average precision. It will be further shown in the results section that one can achieve a decent value of TOP1 by maximizing average precision.
The intend of this paragraph is to make our argument of complementary metrics clearer. Table 1 shows behavior of APR, RKL, and TOP1 using a toy example. In this example, we have considered candidate proteins out of which one is homologous to the native protein. Homologous and non-homologous proteins are denoted by and , respectively. Rows to show five possible orderings of the proteins, where the homologous protein is found in the st, rd, th, th and th positions, respectively. The value of three evaluation metrics APR, RKL, and TOP1 are shown in the last columns. According to orderings numbers to , the homologous protein is easy to rank high in the list: The worst ordering ranks the homologous protein in the th position with normalized rank . The ranks are normalized to and in a way that low ranks stand for ranking in the top positions and large ranks stands for ranking in the bottom-down positions. In this cohort of orderings, APR better reflect ranking performances than RKL. For example, in orderings & , the change in APR is whereas the change in RKL is . Note that, the maximum possible APR and RKL are and , respectively. Using TOP1, to which the maximum value is , the change in ranking performances in orderings & and & is straight . Here, TOP1 does not reflect the difference between orderings & .
In cases & , the homologous protein is tough to rank high in the list: the best possible orderings ranks the homologous protein in the th position with normalized rank . Here, RKL reflects better the difference in ranking performances than APR. The changes in APR and RKL are and , respectively. In contrast, TOP1 could not detect the change in ranking performances. This example shows that APR and RKL are good choices to reflect complementary behavior of ranking performances.
Unlike the approach of the winning procedures of the KDD cup competition - where the sole purpose was to achieve top predictive performance on each single metric separately - we have developed methodologies for building robust predictive ensembles by carefully inducing diversity through complementary subsets of feature variables/phalanxes and evaluation-metrics. While most participants in the the KDD cup (including the winners) used separate sets of probabilities for each metric we have used here a single set of probabilities because we are seeking a robust model that nearly maximizes the entire hit curve.
The rest of the article is organized as follows. Sections 2 describes the protein homology data and the feature variables. Section 3 defines three evaluation metrics specific to ranking. Section 4 describes the base learner logistic regression model which is the main ingredient of our ensemble. The details of the algorithm of subset/phalanx formation and its optimization through complementary evaluation metrics are presented in Section 5. Section 6 showcases the results of the ensemble of models and metrics and compares to the winners of the KDD cup competition and other state-of-the-art ensembles. Finally, Section 7 summarizes the results and draws conclusion.
2 Data: Protein Homology
The protein homology data are downloaded from the knowledge discovery and data mining (KDD) cup competition website333http://osmot.cs.cornell.edu/kddcup/datasets.html: Accessed November 06, 2016.. Registration is required to download the data, submit predictions, and getting results.
In the data file, the columns represent variables and the rows represent proteins. The first column is BLOCK ID where each block relates to one native protein, and each row within a block relates to a candidate protein which is tested for homology against that native protein. Hundreds of candidate proteins are tested for homology against each native protein. The variable BLOCK ID is apparently less important in a sense that is not used for model building. However, it is required to compute the evaluation metrics (Section 3).
Two important types of variables in this data are the response variable and feature variables. The response variable representing homology status scores if a candidate protein is homologous to the native protein, and otherwise. The feature variables represent structural similarity or match between a candidate protein with known amino acid sequence and a native protein with known structural template. The feature variables are: global sequence alignment, local sequence alignment, global threading, local threading, PSI-BLAST, global secondary structure alignment, sequence to profile matching - global, local exposed surface area alignment, and so forth. For details of the feature variables, please see the appendix of Vallat et al. (2008). There are feature variables in total.
Many native proteins are considered in this protein homology data. A total of native proteins provide BLOCK ID numbers from to . The blocks are randomly assigned into two sets of data: training and test. The training set contains native proteins (i.e., blocks) and the test set contains native proteins (i.e., blocks). The homology status is known for the training set, and unknown for the test set. The training and test sets contain and candidate proteins, respectively.
The minimum, first quartile, median, third quartile and maximum block sizes of the training set are, , , and , respectively; and for the test set are , , , and , respectively. The distributions of the training and test block sizes are similar except that the test set contains smaller blocks of sizes , and . The prediction for the smaller test blocks is not a major issue here as the training blocks are larger than these test blocks. We consider that the training blocks contain sufficient information to build a model which might predict test proteins with reasonable accuracies.
The blocks in the training set contain a few homologous proteins comparing to the non-homologous proteins. The minimum, first quartile, median, third quartile and maximum of within-block-proportions of homologous proteins are , , , and , respectively. More than of the training blocks contain at most homologous proteins per candidate proteins. It’s a highly unbalanced classification problem, and appropriate evaluation metrics will be considered in section 3 to tackle this issue.
Some of the feature variables in this protein homology data are useful and some are not. Figure 2 shows two examples. Panel ((a)a) shows the density plots of the feature variable for the homologous proteins (the solid line) and for the non-homologous proteins (the dashed line) in the training data. This variable seems to differentiate the homologous and non-homologous proteins fairly well. There is also evidence of less-informative feature variables. Panel ((b)b) shows the density plots of the feature variable for the homologous proteins (the solid line) and for the non-homologous proteins (the dashed line) in the training data. This variable does not differentiate the homologous and non-homologous proteins well. The presence of such less-informative feature variables motivates us to perform variable selection in this article.
3 Evaluation Metrics
We evaluate a model by its ability to rank the homologous proteins ahead of the non-homologous proteins. By ranking we mean sequencing candidate proteins using their probabilities of homogeneity to the native protein. The candidate protein with the largest probability of homogeneity is ranked first, followed by the protein with the second largest probability and so on.
Three metrics, specific to ranking rare homologous proteins, are used to evaluate prediction performances of a model. Since the protein homology data come in blocks, each of the three metrics is computed in each block independently. For a metric, the average performance across the blocks is used as the final value of the metric. Thus, in order to perform well a model has to rank the rare homologous proteins well across many blocks. The definitions of the three evaluation metrics are given below.
3.1 Average Precision
Average precision (APR) is a variant of the metric average hit rate (AHR) (Zhu, 2004), and the application of APR is common in information retrieval. Suppose candidate proteins are ranked in a block using their probabilities of homogeneity. Consider that we have a shortlist of top candidate proteins of which are homologous to the native protein. Then
is called precision (or hit rate) computed at , the number of top ranked candidate proteins. Naturally, is expected to be as large as possible at every . Let be the ranking positions of the homologous proteins in that block. Then APR is defined as
APR averages “precisions” at the ranking positions of the homologous proteins, and larger APR implies better predictive ranking of the homologous proteins. APR reaches its maximum , if all of the homologous proteins are ranked before the non-homologous proteins. If a model ranks more homologous proteins ahead of the non-homologous proteins, APR rewards the model by assigning a bigger number. The lowest value of average precision depends on the number of candidate proteins and proportion of homologous proteins in a block. Our goal is to maximize APR.
APR gives large weight to the homologous proteins rank at the start of a hit curve. The weight becomes smaller at a very fast rate for the homologous proteins rank towards the back of the hit curve. Hence, APR focuses at the start of a hit curve.
3.2 Rank Last
The metric rank last (RKL) gives the rank of the last homologous protein. RKL measures how far the last true homologous protein falls among the sorted candidate proteins. Ties are treated conservatively: if there is a tie, the last protein in the tied group determines rank last. So ties are not helpful. An RKL of means that the last true homologous protein is sorted in the top position. This is excellent but can happen in a block containing one homologous protein only. The maximum possible value of RKL is the size of the block, i.e., the number of candidate proteins in that block. Our goal is to minimize RKL. In summary, RKL is the rank of the homologous protein found at the very back of a hit curve.
After sorting the candidate proteins if the top ranked candidate is found homologous to the native protein, TOP1 scores , otherwise . TOP1 is calculated conservatively when there are ties: If multiple proteins are tied for rank , all of them must be homologous to score a . If any of the tied proteins is non-homologous, TOP1 scores . It is never beneficial to have ties. Our goal is to maximize TOP1. However, it would be difficult to maximize TOP1 if a block contains a few homologous proteins, and vice versa. In summary, TOP1 evaluates a hit curve only through the top ranked protein.
The codes for computing the metrics can be downloaded from the KDD Cup Website.444http://osmot.cs.cornell.edu/kddcup/metrics.html: Accessed November 04, 2016.
4 Logistic Regression Model
Let be the response variable homology status and be the vector of feature variables. Consider denotes the probability of homogeneity of a candidate protein as following
The logistic regression model, which describes the relationship between the response and the feature variables, is defined as
where stands for expectation on the response variable conditioned on the vector of feature variables . The model parameters and are called regression coefficients.
Considering Bernoulli as the distribution of with mean , the model parameters are estimated using maximum likelihood method applied to the training data. The statement glm of the package MASS is used to fit the model. The feature variables corresponding to the test data are then sent down the fitted model to obtain probability of homogeneity. The test probability of homogeneity is then uploaded to the KDD Cup website to get the numbers for APR, RKL and TOP1.
The reasons for using logistic regression model in this application are the following. First, The research group comprised of Ruben Zamar, William Welch, Guohua Yan, Hui Shen, Ying Lin, Weiliang Qiu, Fei Yuan222http://stat.ubc.ca/~will/ddd/kdd_result.html: Accessed November 04, 2016. of the University of British Columbia did relatively well in the KDD cup competition in predicting protein homology. The group used logistic regression model for the prediction, and so did we in this article. Second, the fitting of logistic regression model is computationally much cheaper than other models such as random forests.
5 Algorithm of Subset/Phalanx Formation
This section describes the algorithm of phalanx formation (APF) using logistic regression model (LRM). The APF has three basic steps: Filtering predictor variables: The predictor variables are filtered down to predictors, Merging predictors into phalanxes: The post-filtered predictors are clustered hierarchically into candidate phalanxes, and Filtering candidate phalanxes: The candidate phalanxes are filtered down to phalanxes.
The APF is an updated version of the algorithm proposed by Tomal et al. (2015). The old algorithm requires repeated fits of random forests. But random forests itself is fairly computationally demanding and the repeated fits of RF increases the computational time substantially. In this new APF, a popular and widely used logistic regression model is incorporated to form the subsets and to build the ensemble. The goals are three-folds: improving the performance of the ensemble of models in terms of predictive ranking, reducing computational burden of the algorithm, and showing that the ensemble of models is adaptable to a base learner other than random forests.
Besides, the new algorithm optimizes two complementary evaluation metrics - average precision and rank last - independently to form the phalanxes. This shows that the algorithm can be adapted to optimize diverse evaluation metrics. Let and represent average precision and rank last, respectively. The optimization of APF by maximizing will be explained in details, and by minimizing will be introduced briefly.
1. Filtering predictor variables. The vector of predictor variables is denoted by , where . A predictor variable is filtered out if it is: weak by itself, weak when putting together with other variables, and weak when ensembling with other variables.
To detect whether the performance of a predictor variable is weak, the reference distribution of is determined using random permutation. In step one, the orders of the candidate proteins are recorded within each block. In step two, the response variable is randomly permuted. In step three, the metric is computed and averaged over the blocks. The reference distribution of is then obtained by repeating that second and third steps many (, here) times. Finally,
th quantile () and median - and - of the reference distribution of are recorded.
We record the indexes of a random -fold cross-validation (CV) defined at the block level. In this work, we have considered . This cross-validation is used to evaluate every models during phalanx formation. Note that the old algorithm uses sub-optimal out-of-bag (OOB) samples in random forests to evaluate models during phalanx formation. The reason for using OOB samples is to avoid computational expenses of multiple fits of random forests in cross-validation.
For each predictor variable , fit an LRM and obtain cross-validated probability of homogeneity vector . The strength of the th predictor variable is reflected through the metric . The higher the values of , the stronger the strength of the th predictor is. Note that the old algorithm uses random forests as the base learner. As LRM is computationally fast, the near-optimal cross-validated estimate of the evaluation metric is now feasible.
To determine if the performance of a predictor variable is strong while putting together in a model with another variable , fit an LRM using and obtain CV probability of homogeneity vector . The joint strength of and is reflected through the evaluation metric . The larger the values of , the stronger the marginal strength of is while putting together in a model with another predictor variable .
The predictor variable is ensembled with another predictor variable by averaging individual probability of homogeneity vectors and computing evaluation metric as . The larger the values of , the stronger the marginal strength of is while ensembling with .
Combining all aspects, the th predictor variable is considered weak/noise and filtered out if
where and is the th quantile of the reference distribution of .
While optimizing APF by minimizing , the th predictor variable is considered weak/noise and filtered out if
where and is the th quantile of the reference distribution of .
After filtering, a vector of gleaming predictor variables is passed to the next step merging predictors into subsets/phalanxes. The presented step offers non-aggressive filtering of the predictor variables. In other words, it ensures that every useful predictor variable gets its due chance to be used in the final ensemble of models or phalanxes.
2. Merging predictors into subsets/phalanxes. Assign to , where is a set of predictor variables. At start, each set contains one predictor variable only.
The APF uses LRM as the base learner, and the performance of an LRM might degrade from its optimal in the presence of unimportant predictors. In order to cluster only the important predictors in the phalanxes and thus to avoid mixing up unimportant predictors, the following criterion is minimized
over all possible pairs . If , the joint performance of the groups of predictors and is better than their ensemble performance
and individual performances
When both (4) and (5) hold, is less than and the groups and are merged together. After each merge, the number of groups is reduced down by , and one of the new groups is the union of the two old groups. The algorithm continues until for all , suggesting that merging either degrades individual performances or their ensembling performance. Note that the old algorithm in Tomal et al. (2015) might appear suboptimal as it minimizes , and by which justifies only (4) and ignores (5).
While optimizing APF by minimizing , the following criterion is maximized
over all pairs . Here, the two groups of variables and are merged if .
Assign and and pass the vector of candidate phalanxes to the filtering candidate phalanxes step.
3. Filtering candidate phalanxes. The APF identifies candidate phalanxes that help all other phalanxes in the final ensemble. The candidate phalanxes that do not help other phalanxes in the final ensemble are filtered out. To detect weak phalanxes to filter, the following criterion is minimized
over all possible pairs of candidate phalanxes . If , the ensembling performance of the th and th candidate phalanxes and is weaker than that of the top performing individual phalanx. Equality of and shows that the ensemble of the two does not improve individual performances. In these cases, the criterion is less than or equal to () and the weak phalanx is filtered out: If , the th phalanx is filtered out; otherwise, the th phalanx is filtered out. After filtering a phalanx, the number of candidate phalanx is reduced down by . The filtering of phalanxes then proceed hierarchically until or . Such filtering always keeps the strongest candidate phalanx, and all phalanxes in the set help all others.
The old algorithm keeps a candidate phalanx in the final ensemble if the phalanx is strong by itself
or strong in an ensemble with any other phalanx
The old algorithm is vulnerable to including a phalanx that is strong individually but harmful to other phalanxes in the ensemble. At the same time, a retained phalanx can be helpful to some and harmful to the other phalanxes. Such weaknesses of the old algorithm are taken care of in this new APF.
While optimizing the algorithm by minimizing , the following criterion is maximized
over all pairs . Here, a candidate phalanx is filtered if . Finally, APF outputs a set of final phalanxes .
5.1 Ensemble of Models/Phalanxes
Let be the probability of homogeneity vector obtained from an LRM applied to the subset . The probability of homogeneity vector for the ensemble of models (EM) is obtained by averaging individual probability vectors as following:
This probability vector is used to compute evaluation metrics for EM.
To construct phalanxes, the algorithm is run on the protein homology training data. Table 2 shows the total number of variables, post-filtered variables, candidate phalanxes and post-filtered phalanxes. The algorithm is independently optimized twice, first by maximizing APR and second by minimizing RKL.
While optimizing the algorithm with respect to APR, none of the variables were filtered out. The post-filtered variables are then clustered into candidate phalanxes of which passed the filtering. A total of variables survived eventually, and phalanxes , and bagged , and variables, respectively.
When optimized using RKL, the algorithm filtered none of the variables. The post-filtered variables are then clustered into candidate phalanxes of which survived filtering. A total of variables survived eventually, and phalanxes and bagged , variables, respectively.
Given the phalanxes, the ensemble of models (EM) is obtained as presented in equation 6. The results of EM relating to APR will be presented first followed by the results relating to RKL.
6.1 Ensemble of Models - Optimized on APR
The second column of Table 3 shows cross-validated (-fold) training average precisions for the three phalanxes and for the ensemble of models/phalanxes (EM-APR). The -fold cross-validation is defined at the block level of the training data. The average precisions for the first, second and third phalanxes are , and , respectively. As larger values of average precision imply better predictive ranking, the ensemble of models (EM-APR) with average precision shows an improvement over the three individual models/phalanxes.
Figure 3 shows the scatter plots of phalanx 2 versus phalanx 3 based on a -fold cross-validated training probabilities of homogeneity. The reason for choosing phalanx 2 and phalanx 3 is the clearer reflection of diversity in their scatter plots. Panels ((a)a) and ((b)b) are for the homologous and non-homologous proteins, respectively, against the target native proteins. Note that proteins with larger probabilities of homogeneity are ranked higher in the sequence.
Both phalanxes and are able to rank a good proportion of the homologous proteins: The first top-right and second top-right corners of panel (a)a contain and homologous proteins of all of the homologous proteins in the training data, respectively. The ranking accuracies of the two phalanxes are very high: The first top-right and second top-right corners of panel (a)a contain and homologous proteins of all of the proteins within each corner, respectively. It is also observed that some of the homologous proteins are hard to rank high (remote homologs): The bottom-left corner of panel (a)a contains homologous proteins of all the homologous proteins in the training data. Within this cell there are and homologous (panel (a)a) and non-homologous (panel (b)b) proteins, respectively. The two phalanxes are diverse which is reflected through the location of homologous proteins in the top-left and bottom-right corners. Similar results are obtained in the other scatter plots (phalanx versus phalanx and phalanx versus phalanx ). Such diversity among phalanxes helps to improve overall performance of the ensemble: If one phalanx fails to rank one homologous protein well, the other phalanx ranks that protein high up and so does their ensemble.
The logistic regression model corresponding to the phalanxes are applied to the test proteins and probability of homogeneity vectors are recorded. The probability of homogeneity vectors are averaged over phalanxes to obtain a probability of homogeneity vector for the ensemble of models/phalanxes (EM-APR). The vectors of probabilities are submitted to the KDD Cup website to obtain numbers for APR. The column of Table 3 shows average precisions of , , and for phalanxes , , and EM-APR, respectively. It is evident that the ensemble secured a very good average precision of and outperformed individual phalanxes.
It is diversity and strengths of individual phalanxes which helped to obtain such a good ensemble EM-APR. The metrics APR and RKL measure two completely opposite characteristics of ranking performance. The next subsection is intended two show that the algorithm of subset formation is generalizable to other metric such as RKL as well, where the algorithm is optimized by minimizing the metric.
6.2 Ensemble of Models - Optimized on RKL
Optimization of the algorithm of subset formation by minimizing rank last (RKL) resulted in two phalanxes. Ten-fold cross-validated training RKLs for the two phalanxes and for their ensemble (EM-RKL) are shown in column of Table 4. The RKLs for phalanxes , and EM-RKL are , and , respectively. As smaller values of RKL imply better predictive ranking, we claim that the ensemble improves over the individual phalanxes.
Figure 4 shows the scatter plots of cross-validated probabilities of homogeneity for phalanx 1 and phalanx 2. Panels ((a)a) and ((b)b) are for the homologous and non-homologous proteins, respectively, against the target native proteins. The first top-right, second top-right, and bottom-left corners of panel (a)a contain , , and homologous proteins, respectively, of all of the homologous proteins in the training data. They contain , , and homologous proteins, respectively, of all the proteins within each cell. The numbers for the first and second top-right corners show that the combined strength of phalanx 1 and phalanx 2 are good as they contain more homologous proteins than their counterparts in Figure 3. Note that the bottom-left corner contains less proportion of homologous proteins than its counterpart in Figure 3. This shows that the EM-RKL is good in terms of bottom ranked (i.e., remote) homologous proteins. However, EM-RKL also pulls non-homologous proteins high-up in the list (panel (b)b).
Phalanxes and reflect moderate level of diversity as the bottom-right and top-left corners of panel (a)a contain a handful of homologous proteins. We speculate that the strengths of the phalanxes and may overturn moderate diversity to build a strong ensemble.
The rd column of Table 4 shows test results for phalanxes , and their ensemble EM-RKL. Phalanxes , and ensemble EM-RKL produced RKLs of , and , respectively. The results show that the ensemble EM-RKL outperforms individual models/phalanxes in terms of RKL.
6.3 Ensemble of Models and Metrics
We first show that the ensembles of models EM-APR and EM-RKL are diverse to each other, and then construct ensemble of models and metrics EMM-APR&RKL as following
This probability vector is used to compute evaluation metrics for EMM-APR&RKL.
The ensembles of models optimizing APR and RKL both provide good ranking performances for the test proteins. The reason for such good performances is that the ensembles aggregate over strong and diverse phalanxes. Here, we will show that the two ensembles EM-APR and EM-RKL are diverse too, and ensembling across them improves ranking performances for the test proteins as well.
The columns to of Table 7 show the -fold cross-validated training performances of EM-APR, EM-RKL and their ensemble EMM-APR&RKL. The second column shows APRs of , and for EM-APR, EM-RKL and EMM-APR&RKL, respectively. As the larger values of average precision imply better predictive ranking, the aggregated ensemble EMM-APR&RKL improves over individual ensembles. The second column shows TOP1s of , and of EM-APR, EM-RKL and EMM-APR&RKL, respectively. As like average precision, the aggregated ensemble EMM-APR&RKL improves over individual ensembles in terms of TOP1 as well. The smaller values of rank last imply better predictive ranking. The corresponding RKLs of EM-APR, EM-RKL and EMM-APR&RKL are: , and , respectively. The aggregated ensemble EMM-APR&RKL shows improvement over EM-APR and EM-RKL.
Figure 5 shows scatter plots of probabilities of homogeneity for EM-APR and EM-RKL based on a -fold cross-validation of the training blocks. Panels ((a)a) and ((b)b) are for the homologous and non-homologous proteins, respectively, against the target native proteins. The disagreement between EM-APR and EM-RKL is visible in this plot: EM-RKL assigns larger probability of homogeneity to the proteins than EM-APR. This disagreement may be considered as diversity in probabilities between EM-APR and EM-RKL. The following table makes the argument more clear.
Table 5 shows the number of winnings of EM-APR and EM-RKL over the other in various ranking scenarios. The ranking scenarios are constructed from the ranks of the aggregated ensemble EMM-APR&RKL. Note that EMM-APR&RKL is the average of the two ensembles of models EM-APR and EM-RKL. The ranks are normalized to and in a way that low ranks stand for ranking in the top positions and large ranks stands for ranking in the bottom-down positions. The first column shows ranges of ranks based on three quartiles: , and . The winning numbers in the four scenarios are highlighted in dark gray. In the first scenario , where the homologous proteins are very easy to rank high, EM-APR wins over EM-RKL by a big margin. In the second scenario , where the homologous proteins are relatively easy to rank high, EM-APR wins over EM-RKL. In the third scenario , where the homologous proteins are relatively hard to rank high, EM-RKL wins over EM-APR. In the fourth scenario , where the homologous proteins are very hard to rank high, EM-RKL wins over EM-APR by a big margin. In a nutshell, EM-APR rank easy to rank homologous proteins better than EM-RKL, and vice versa. Hence, the two ensembles of models EM-APR and EM-RKL are diverse to each other.
We now focus on checking robust ranking performances by the aggregated ensemble EMM-APR&RKL. The robustification would depend on the complementary behavior of EM-APR and EM-RKL in ranking homologous proteins. First we claim that when the homologous proteins are easy to rank at the start of the list, optimization through APR is preferable, when some of the homologous proteins are hard to rank at the start of the list, optimization through RKL is preferable. Four examples are provided below.
Table 6 shows cross-validated APR, TOP1 and RKL for EM-APR, EM-RKL and EMM-APR&RKL for the training blocks 95, 216, 96 and 238. In each block, the top performance among EM-APR and EM-RKL is marked by dark gray.
The homologous proteins in the two blocks and are relatively easy to rank high as the largest normalized rank by any method is . In another word, the proteins in those two blocks are close homologs to their respective native proteins. In those blocks EM-APR performs better than EM-RKL. The other two blocks and show difficulties in ranking homologous proteins: the smallest normalized rank by any method is . In another word, some of the proteins are distant homologs to their respective native proteins. In the latter two blocks EM-RKL outperforms EM-APR.
1. Ranking close homolog in Block . A small change in ranks of the homologous proteins in the top positions causes large change in APR, and an ensemble that optimizes APR provides better predictive ranking. The block contains candidate proteins out of which is homologous to the native protein. The ensembles EM-APR and EM-RKL rank this homologous protein in the first and eleventh positions, respectively. As a result, the APR of EM-APR and EM-RKL are and , respectively. The change in the two APR is . On the other hand, the two RKLs are 1 and 11, respectively, and the change in the two RKL is only .
2. Ranking close homologs in Block 216. The block contains candidate proteins out of which are homologous to the native protein. The ensemble EM-APR ranks the three homologous protein in the first, second and fourth positions; whereas the other ensemble EM-RKL ranks them in the first, second and sixth positions. As a result of such ranking, the APR for EM-APR and EM-RKL are and , respectively. The change in the two APR is . On the other hand, the RKL for EM-APR and EM-RKL are 4 and 6, respectively, and the change in the two RKL is only . The change in ranks of the homologous proteins is better reflected by APR than RKL. And an ensemble which optimizes APR is recommended.
3. Ranking distant homolog in Block 96. The block contains candidate proteins out of which are homologous to the native protein. The ensemble EM-APR ranks the two homologous proteins in the th and th positions; whereas EM-RKL ranks them in the th and th positions. Here, the homologous proteins that are ranked high and bottom in the sequence are close and distant homologs, respectively. The ranking performances of EM-APR and EM-RKL in terms of APR are and , respectively. The change in the ranking performances through APR is only . The RKL of EM-APR and EM-RKL are and , respectively, and the change in the ranking performances is . One of the homologous proteins was hard to rank, and EM-RKL did better ranking than EM-APR as the change in ranking performances was better reflected by RKL.
4. Ranking distant homologs in Block 238. This block contains candidate proteins out of which are homologous to the native protein. A handful of proteins of this block are ranked high by both of the methods. However, some of the homologous proteins are not easy to rank high. That is, there are some close homologs and some distant homologs. For example, EM-APR and EM-RKL rank the last homologous protein in the th and th positions, respectively. The changes in ranking performances of EM-APR and EM-RKL are better reflected by RKL than APR , and hence the ensemble of phalanxes which optimizes RKL provides better predictive ranking.
The two metrics APR and RKL are complementary to each other: the former focuses at the start of the list and the latter focuses at the bottom of the list. If a block contains homologous proteins that are ranked at the start of the sequence, the APF is recommended to be optimized by maximizing APR. On the other hand, if a block contains some homologous proteins that are ranked far down the sequence, the APF is recommended to be optimized by minimizing RKL. In the first and second situations EM-APR and EM-RKL would do better ranking, respectively, than the other. However, we don’t know the difficulty level of ranking homologous proteins in the test blocks in advance. How would then one robustify performances of EM-APR and EM-RKL? Our proposal is aggregating over EM-APR and EM-RKL.
Table 6 shows cross-validated APR, TOP1 and RKL for EMM-APR&RKL as well. For the easy-to-rank blocks and , the performance of EMM-APR&RKL is exactly the same as the top performer EM-APR. For the hard-to-rank blocks and , the performance of EMM-APR&RKL is closer to the top performer EM-RKL. As we have both hard-to-rank and easy-to-rank blocks in this protein homology data, aggregation of EM-APR and EM-RKL is recommended.
The results of EM-APR, EM-RKL and EMM-APR&RKL corresponding to the test proteins are produced in columns to of Table 7. The fifth column shows APRs of , and for EM-APR, EM-RKL and EMM-APR&RKL, respectively. The ensemble EMM-APR&RKL improves over individual ensembles. The sixth column shows their TOP1s of , and , respectively: EMM-APR&RKL retains the results of the best performing individual ensemble. The seventh column shows the results in terms of RKLs of , and : EMM-APR&RKL outperforms the top performing individual ensemble. The result of the -fold cross-validated training data resembled the corresponding test data: averaging ensembles across APR and RKL improves performances over individual ensembles.
The test performances of EM-APR, EM-RKL and EMM-APR&RKL are compared to the winners of the knowledge discovery and data mining (KDD) cup competition in Table 8. Four different research groups (Weka, ICT.AC.CN, MEDai/Al Insight and PG445 UniDo) won the competition of its protein homology section. The overall winner among the four groups was Weka. Table 8 includes results for all of the four winners of the competition. The first part of Table 8 shows results corresponding to EM-APR, EM-RKL and EMM-APR&RKL, the central part shows results for the winners of the competition.
The results are also ranked. For example, the performances of the methods specific to APR are ranked, and the numbers are produced in the third column. The top performing method (EMM-APR&RKL) is ranked first followed by the second top performing method (ICT.AC.CN) and so on. The ranks of the methods corresponding to TOP1 and RKL are produced independently in columns and , respectively. For TOP1, the best three methods (EMM-APR&RKL, EM-APR and MEDai/Al Insight) are tied and received an average rank of (average of , and ). The th column shows average ranks (average ranks across the columns , and ) corresponding to each method.
As the metrics change, the performances of the methods vary. Among the four winners of the KDD Cup competition, the top performing methods in terms of APR, TOP1 and RKL were ICT.AC.CN, MEDai/Al Insight and PG445 UniDo, respectively. Surprisingly, the top performing method in terms of RKL (PG445 UniDo) was found at the bottom in terms of APR and TOP1.
The smaller the overall rank in column , the better the method is. The lowest average rank corresponds to the method EMM-APR&RKL followed by methods Weka and MEDai/Al Insight both with rank . The method EMM-APR&RKL achieves the largest APR () and tied with the largest TOP1 (). This method ranks in terms of RKL among all of the seven methods. While the performances of the methods EM-APR and EM-RKL are marginally behind the winners of the KDD Cup competition, the ensemble of the two methods outperforms the four winners.
The bottom part of Table 8 shows results from a random forests (RF by Breiman (2001)) and ensemble of phalanxes based on random forests (EPX-RF by Tomal et al. (2015)). The ensembles of models and metrics based on logistic regression model provide better results than both RF and EPX-RF.
Note that, people can submit new predictions to the KDD Cup and get results back. Hence, it is important to report that no new results - dated November 04, 2016 - are better or even equal to our results in terms of APR () and TOP1 ().
6.4 Computational Gain of EM-LRM over EPX-RF
The computational time for the ensemble of models based on logistic regression model (EM-LRM) is much smaller than the ensemble of phalanxes using random forests (EPX-RF). A total of processors of the bugaboo machine555http://www.westgrid.ca/support/quickstart/bugaboo: Accessed November 06, 2016. in the Western Canada Research Grid (westgrid) computing network were parallelized . For each of those processors gigabyte of memory was allocated. The computation times to train and test EPX-RF and EM-LRM on the protein homology data were 1569 and 80 minutes, respectively. Here, EM-LRM ran much faster than EPX-RF.
Through parallel computation, the computing time of EM-LRM was brought down similar to a random forests. For an RF, the task of training and testing was completed using processor with gigabyte memory. The total running time was minutes. Note that, one can easily assign more processors to further reduce down the computation time of EM-LRM.
7 Summary and Conclusion
An ensemble is an aggregated collection of models. To build a powerful ensemble, the constituent models need to be strong and diverse. In this article, the term strong stands for predictive ranking ability of a model. The proposed algorithm of subset/phalanx formation clusters a number of predictor variables into a subset. The variables in a subset, make the model strong in terms of predictive ranking. On the other hand, two or more models are diverse in a sense that they predict different sets of homologous proteins. The algorithm keeps putting predictor variables into subsets until a set of predictors help each other in a model, and ends up with more than one subsets (generally) of predictor variables. The subsets are diverse between each other and can predict different sets of homologous proteins well. In a nutshell, the ensemble of models is equipped with methods to search for and putting good variables together into a subset to increase its strength, and clustering variables into different subsets to induce diversity between the models.
One nice properties of the ensemble of models is filtering weak variables. A variable is considered weak if it shows poor predictive performance, individually in a model, jointly when paired with another variable in a model, and jointly when ensembled with another variable in different models. The algorithm also filters unhelpful subsets/phalanxes of variables. The term unhelpful stands for weaker predictive ranking through ensembling than the maximum predictive ranking of two individual subsets/phalanxes. A subset of variables is filtered out if it appears unhelpful with any other subsets used in the final ensemble. The merging/clustering step of the algorithm of phalanx formation is sandwiched by filtering weak variables and subsets of variables.
An apparently less highlighted properties of our ensemble is model selection. Regular model selection methods - such as forward stepwise, backward elimination and regularization (Hastie et al., 2009) - select a subset of variables and build one model. In contrast, our ensemble selects a collection of models and aggregates them in an ensemble. Each of those models can be considered as an alternative to the models conveniently selected by the regular model selection methods. Furthermore, the constituent models in our ensemble predict diverse collection of homologous proteins, which is similar to examining a problem from different vantage points to have a clear perspective of a situation.
The old ensemble of phalanxes uses random forests as base learner. The old ensemble is also applied to the protein homology data, and its results are found better than the base learner random forests. Our ensemble which uses logistic regression model outperforms the old ensemble. Furthermore, the results of our ensemble are comparable to the winners of the KDD cup competition. Those findings are encouraging and are in favor of using any suitable base learners such as classification tree, recursive partitioning (Breiman et al., 1984), neural networks (McCulloch and Pitts, 1943; Widrow and Hoff, 1960; Rosenblatt, 1962) etc. whichever is reasonable to an application.
Our ensemble of models is independently optimized by maximizing average precision and minimizing rank last
. This exemplifies the fact that the algorithm of subset formation is flexible to any evaluation metric. The choice of evaluation metrics will depend on specific applications, and the merits of the chosen metric to quantify overall performance of the ensemble. These facts drive us towards a new goal of extending the ensemble of phalanxes in linear regression and multi-class classification optimizing evaluation metrics such as mean-squared loss, misclassification error etc.
To induce diversity between two ensembles of models, the algorithm was optimized using two complementary evaluation metrics, APR and RKL. The metric APR was useful to rank close homologs and the other metric RKL was useful to rank distant homologs. The aggregated ensemble provided robust ranking performances than its constituents EM-APR and EM-RKL. Now, it would be interesting to examine if such improvement can be achieved by aggregating ensembles of models based on independent base learners as well. At this point it is not clear how the diversity would be induced through different base learners, but might appear to be an interesting research question to be examined.
We thank Professor Ron Elber, University of Texas at Austin, Texas, USA, for his help to us understanding the feature variables of protein homology data. We would like to acknowledge support for this project from the Natural Sciences and Engineering Research Council of Canada (NSERC grants RGPIN-2014-04962 and RGPIN-2014-05227). The WestGrid computational facilities of Compute Canada are also gratefully acknowledged.
- Bhusari and Patil (2011) Bhusari, V. and Patil, S. (2011) Application of hidden markov model in credit card fraud detection. International Journal of Distributed and Parallel Systems (IJDPS), 2, 203–211.
- Bolton and Hand (2002) Bolton, R. J. and Hand, D. J. (2002) Statistical fraud detection: A review. Statistical Science, 17, 235–249.
- Bork and Koonin (1998) Bork, P. and Koonin, E. V. (1998) Predicting functions from protein sequences - where are the bottlenecks? Nature Genetics, 18, 313–318.
- Breiman (1996) Breiman, L. (1996) Bagging predictors. Machine Learning, 24, 123–140.
- Breiman (2001) — (2001) Random forests. Machine Learning, 45, 5–32.
- Breiman et al. (1984) Breiman, L., Friedman, J., Stone, C. and Olshen, R. A. (1984) Classification and Regression Trees. Chapman and Hall/CRC.
- Cortes and Vapnik (1995) Cortes, C. and Vapnik, V. (1995) Support-vector networks. Machine Learning, 20, 273–297.
- Dietterich (2000) Dietterich, T. G. (2000) An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting, and randomization. Machine Learning, 40, 139–157.
- Eddy (1998) Eddy, S. R. (1998) Profile hidden markov models. Bioinformatics, 14, 755–763.
- Eddy (2011) — (2011) Accelerated profile HMM searches. PLOS Computational Biology, 7, 1–16.
- Fienberg and Shmueli (2005) Fienberg, S. E. and Shmueli, G. (2005) Statistical issues and challenges associated with rapid detection of bio-terrorist attacks. Statistics in Medicine, 24, 513–529.
- Freund and Schapire (1997) Freund, Y. and Schapire, R. E. (1997) A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55, 119–193.
- Gordon and Pathak (1999) Gordon, M. and Pathak, P. (1999) Finding information on the world wide web: The retrieval effectiveness of search engines. Information Processing and Management: An International Journal, 35, 141–180.
- Gupta et al. (2014) Gupta, R., Audhkhasi, K. and Narayanan, S. (2014) Training ensemble of diverse classifiers on feature subsets. In 2014 IEEE International Conferences on Acoustic, Speech and Signal Processing (ICASSP), 2951–2955.
- Hastie et al. (2009) Hastie, T., Tibshirani, R. and Friedman, J. (2009) The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, second edn.
- Henn-Sax et al. (2001) Henn-Sax, M., Höcker, B., Wilmanns, M. and Sterner, R. (2001) Divergent evolution of -barrel enzymes. Biological Chemistry, 382, 1315–1320.
- Hochreiter et al. (2007) Hochreiter, S., Heusel, M. and Obermayer, K. (2007) Fast model-based protein homology detection without alignment. Bioinformatics, 23, 1728–1736.
- Karplus et al. (1998) Karplus, K., Barrett, C. and Hughey, R. (1998) Hidden markov models for detecting remote protein homologies. Bioinformatics, 14, 846–856.
- Kaushik et al. (2016) Kaushik, S., Nair, A. G., Mutt, E., Subramanian, H. P. and Sowdhamini, R. (2016) Rapid and enhanced remote homology detection by cascading hidden markov model searches in sequence space. Bioinformatics, 32, 338–344.
- Kinch et al. (2003) Kinch, L. N., Wrabl, J. O., Sri Krishna, S., Majumdar, I., Sadreyev, R. I., Qi, Y., Pei, J., Cheng, H. and Grishin, N. V. (2003) CASP5 assessment of fold recognition target predictions. Proteins: Structure, Function, and Bioinformatics, 53, 395–409.
- Koonin and Galperin (2003) Koonin, E. V. and Galperin, M. Y. (2003) SEQUENCE - EVOLUTION - FUNCTION: Computational Approaches in Comparative Genomics. Boston: Kluwer Academic.
- Lippmann and Cunningham (2000) Lippmann, R. P. and Cunningham, R. K. (2000) Improving intrusion detection performance using keyword selection and neural network. Computer Networks, 34, 597–603.
- Ma et al. (2014) Ma, J., Wang, S., Wang, Z. and Xu, J. (2014) MRFalign: Protein homology detection through alignment of Markov random fields. PLoS Computational Biology, 10, 1–12.
- McCulloch and Pitts (1943) McCulloch, W. S. and Pitts, W. (1943) A logical calculus of the ideas immanent in nervous activity. The Bulletin of Mathematical Biophysics, 5, 115–133.
- Melville and Mooney (2003) Melville, P. and Mooney, R. J. (2003) Constructing diverse classifier ensembles using artificial training examples. In Proceedings of the IJCAI-2003, 505–510.
- Peeters (2006) Peeters, J. (2006) Method and apparatus for wide area surveillance of a terrorist or personal threat. URL: https://www.google.com/patents/US7109859. US Patent 7,109,859.
- Pfahringer (2004) Pfahringer, B. (2004) The weka solution to the 2004 kdd cup. ACM SIGKDD Explorations Newsletter, 6, 117–119.
Rosenblatt, F. (1962)
Principles of Neurodynamics: Perceptrons and the Theory of Brain Mechanisms. Spartan, Washington, D.C.
- Söding (2005) Söding, J. (2005) Protein homology detection by HMM-HMM comparison. Bioinformatics, 21, 951–960.
- Tomal et al. (2015) Tomal, J. H., Welch, W. J. and Zamar, R. H. (2015) Ensembling classification models based on phalanxes of variables with applications in drug discovery. Annals of Applied Statistics, 9, 69–93.
- Tomal et al. (2016) — (2016) Exploiting multiple descriptor sets in QSAR studies. Journal of Chemical Information and Modeling, 56, 501–509.
- Vallat et al. (2008) Vallat, B. K., Pillardy, J. and Elber, R. (2008) A template-finding algorithm and a comprehensive benchmark for homology modeling of proteins. Proteins, 72, 910–928.
- Vallat et al. (2009) Vallat, B. K., Pillardy, J., Májek, P., Meller, J., Blom, T., Cao, B. and Elber, R. (2009) Building and assessing atomic models of proteins from structural templates: Learning and benchmarks. Proteins, 76, 930–945.
- Wang (2005) Wang, Y. M. (2005) Statistical Methods for High Throughput Screening Drug Discovery Data. Ph.D. thesis, University of Waterloo, Waterloo, ON, Canada.
- Widrow and Hoff (1960) Widrow, B. and Hoff, M. (1960) Adaptive switching circuits. IRE WESCON Convention Record, 4, 96–104. Reprinted in Andersen and Rosenfeld (1988).
- Zhu (2004) Zhu, M. (2004) Recall, precision and average precision. Tech. rep. Department of Statistics and Actuarial Science, University of Waterloo, ON, Canada.