On the bias of H-scores for comparing biclusters, and how to correct it

07/24/2019 ∙ by Jacopo Di Iorio, et al. ∙ Penn State University Politecnico di Milano 6

In the last two decades several biclustering methods have been developed as new unsupervised learning techniques to simultaneously cluster rows and columns of a data matrix. These algorithms play a central role in contemporary machine learning and in many applications, e.g. to computational biology and bioinformatics. The H-score is the evaluation score underlying the seminal biclustering algorithm by Cheng and Church, as well as many other subsequent biclustering methods. In this paper, we characterize a potentially troublesome bias in this score, that can distort biclustering results. We prove, both analytically and by simulation, that the average H-score increases with the number of rows/columns in a bicluster. This makes the H-score, and hence all algorithms based on it, biased towards small clusters. Based on our analytical proof, we are able to provide a straightforward way to correct this bias, allowing users to accurately compare biclusters.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The H-score (or Mean Squared Residue score, MSR) underlies Cheng and Church’s biclustering algorithm (2000), one of the best-known and most widely employed algorithms in bioinformatics and computational biology, and many subsequent algorithms (e.g., FLOC, Yang et al. 2005, and Huang et al. 2011). Cheng and Church’s algorithm has ~2400 citations to date, 597 since 2015, and 179 in 2018-19 alone. It was the first to be applied to gene microarray data, and it is one of the main tools available in biclustering packages (e.g., the “biclust” R library) as well as in gene expression data analysis packages (e.g., IRIS-EDA, Monier et al. 2019

). In addition, it is widely used as a benchmark: almost all published biclustering algorithms include a comparison with it. The role of the H-score in a biclustering algorithm is to allow validation and comparisons of biclusters, which may have different numbers of rows and columns. Our findings document a bias that can distort biclustering results. We prove, both analytically and by simulation, that the average H-score increases with the number of rows/columns in a bicluster – even in the “ideal” (and simplest) case of a single bicluster generated by an additive model plus a white noise. This biases the H-score, and hence all H-score based algorithms, towards small biclusters. Importantly, our analytical proof provides a straightforward way to correct this bias.

2 H-scores as a measure of bicluster coherence

Cheng and Church (2000)

were the first to introduce biclustering as a way to identify (possibly overlapping) subsets of genes and/or conditions showing high similarity in a gene expression data matrix. The H-score they proposed to measure (dis)similarity is defined as ”the variance of the set of all elements in the bicluster, plus the mean row variance and the mean column variance”. Unlike measures employed by traditional clustering algorithms, the H-score is not a function of pairs of genes or conditions, but rather a quantitation of the coherence of all genes and conditions within a bicluster. Let

be a data matrix. The H-score of the submatrix identified by the pair of index subsets is defined as

where , and are the means of row , column , and of the whole submatrix , respectively.

An optimal bicluster is a submatrix with the lowest possible H-score (Madeira and Oliveira, 2004). This corresponds to a bicluster perfectly defined by the additive model , where is the mean of the bicluster, and and are additive adjustments for rows and columns, respectively. As an example, in the case of gene expression data, such a bicluster is a group of genes with expression levels that tend to fluctuate in unison across a group of conditions. In general, an additive error term is also present, leading to the for the model .

The algorithm proposed by Cheng and Church (2000) starts from the entire matrix , iteratively deletes rows and/or columns which contribute to the H-score the most, and stops when the the current submatrix has – a given threshold. This identifies a so-called -bicluster. To find additional -biclusters, the procedure is repeated after replacing the entries of the prior -bicluster(s) with random numbers. Following Cheng and Church (2000), many other biclustering algorithms based on the H-score have appeared in the literature. For example, Yang et al. (2005) proposed FLOC, a probabilistic algorithm that simultaneously identifies a set of (possibly overlapping) biclusters with low H-score. The procedure iteratively reduces the H-scores of randomly initialized biclusters, until the overall biclustering quality stops improving. Angiulli et al. (2008) proposed an algorithm based on a greedy technique combined with a local search strategy to escape poor local minima. This also employs the H-score, together with the row (gene) variance and the size of the bicluster. The Reactive GRASP Biclustering (RGRASP-B, Dharan and Nair, 2009) also uses the H-score to evaluate bicluster quality, and the algorithm in Bryan et al. (2006) uses a modified version of it. The algorithms cited here are only a small subset of those that rely on the H-score to validate and evaluate results (see e.g., Pontes et al., 2015, for an extensive review).

(a)
(b)
Figure 3: LABEL:sub@subfig:Hscore H-score as function of the number of rows (columns), in a single bicluster of 10 columns (rows) generated by an additive model plus a white noise . Red dots indicate average H-scores . LABEL:sub@subfig:ratio Ratio between average H-scores, with according to simulation (black dot) and Theorem 1 (red cross).

3 Bias of H-scores for bicluster comparison

Consider a bicluster generated by the additive model . As mentioned above, the error , renders . However, in addition to the amount of noise, the H-score depends also on size, i.e. on the number of rows and columns in the bicluster. Figure 3LABEL:sub@subfig:Hscore shows that the average H-score () of all possible submatrices of the bicluster having fixed number of columns (rows) and rows ( columns), increases with (). In particular, the relationship between and is expressed by the following Theorem, which proof can be found in the Appendix. An analogous result hold for and .

Theorem 1.

Let be a bicluster generated by the additive model . Let be the average H-score of all the possible sub-matrices of having rows and a fixed number of columns . Then

(1)

Analogously, we have

(2)

with the average H-score of all the possible sub-matrices of having a fixed number of rows and columns.

Focusing on rows (an identical reasoning holds for columns), we thus have that the ratio is fully determined by the number of rows . From Theorem 1 it also follows that

(3)

and therefore that knowing is sufficient to compute for every . Since , for we obtain

(4)

The infinite product in (4) converges, and in particular . This is due to the fact that . As a consequence, the H-score bias is at most , and it becomes small for comparisons between biclusters of large size (Table 1). We also observe that the value only depends on the variance of the error term , and not on the values of , and , nor on the distribution of (see Table 1, Figures 8-13).

Model
0.42 1.33 0.57 1.49 0.84 1+3 0.84
0.44 0.58 0.86 0.86
0.42 1.33 0.57 1.49 0.84 1+3 0.84
0.44 0.58 0.86 0.86

Table 1: Average H-scores and ratios for different models (with/without row and column differential terms), errors (Gaussian, Uniform) and bicluster sizes.
(a)
(b)
(c)
(d)
Figure 8: LABEL:sub@subfig:HscoreN1-LABEL:sub@subfig:HscoreU1 H-score as function of the number of rows (columns), in a single bicluster of 10 columns (rows). Red dots indicate average H-scores . LABEL:sub@subfig:ratioN1-LABEL:sub@subfig:ratioU1 Ratio between average H-scores, with according to simulation (black dot) and Theorem 1 (red cross). The bicluster is generated by an additive model plus noise , in two scenarios with : LABEL:sub@subfig:HscoreN1-LABEL:sub@subfig:ratioN1 ; LABEL:sub@subfig:HscoreU1-LABEL:sub@subfig:ratioU1 .
(a)
(b)
(c)
(d)
Figure 13: LABEL:sub@subfig:HscoreN2-LABEL:sub@subfig:HscoreU2 H-score as function of the number of rows (columns), in a single bicluster of 10 columns (rows). Red dots indicate average H-scores . LABEL:sub@subfig:ratioN2-LABEL:sub@subfig:ratioU2 Ratio between average H-scores, with according to simulation (black dot) and Theorem 1 (red cross). The bicluster is generated by an additive model plus noise , in two scenarios with : LABEL:sub@subfig:HscoreN2-LABEL:sub@subfig:ratioN2 ; LABEL:sub@subfig:HscoreU2-LABEL:sub@subfig:ratioU2 .

4 Recommendations

Our results show that employing the H-score to compare biclusters with different numbers of rows or columns could lead to biased results. While this bias is small and likely inconsequential for large biclusters, it can be substantial and rather misleading for small biclusters. Suppose one is comparing biclusters with and rows; Equation (3) suggests that this bias can be straightforwardly corrected normalizing the H-score ratio by the factor (an identical reasoning holds for columns). Notably, this correction should also be employed to adjust the H-score thresholds when finding -biclusters. Considering the seminal role and ubiquitousness of H-scores in biclustering algorithms, and the importance of biclustering algorithms in bioinformatics and computational biology, we believe this bias should be taken into serious consideration. The correction we propose is simple to implement, and could help shape the conclusions and insights provided by a broad range of applications.

Funding

This work was partially funded by the Eberly College of Science, the Institute for Cyberscience and the Huck Institutes of the Life Sciences (Penn State University); NSF award DMS-1407639; and Tobacco Settlement and CURE funds of the PA Department of Health (the Dept. specifically disclaims responsibility for any analyses, interpretations or conclusions).

References

  • Angiulli et al. (2008) Angiulli, F., E. Cesario, and C. Pizzuti (2008). Random walk biclustering for microarray data. Information Sciences 178(6), 1479–1497.
  • Bryan et al. (2006) Bryan, K., P. Cunningham, and N. Bolshakova (2006). Application of simulated annealing to the biclustering of gene expression data. IEEE transactions on information technology in biomedicine 10(3), 519–525.
  • Cheng and Church (2000) Cheng, Y. and G. M. Church (2000). Biclustering of expression data. In Proceedings of the 8th International Conference on Intelligent Systems for Molecular Biology, La Jolla, CA, pp. 93–103.
  • Dharan and Nair (2009) Dharan, S. and A. S. Nair (2009). Biclustering of gene expression data using reactive greedy randomized adaptive search procedure. BMC bioinformatics 10(1), S27.
  • Huang et al. (2011) Huang, Q., D. Tao, X. Li, and A. Liew (2011). Parallelized evolutionary learning for detection of biclusters in gene expression data. IEEE/ACM Transactions on Computational Biology and Bioinformatics 9(2), 560–570.
  • Madeira and Oliveira (2004) Madeira, S. C. and A. L. Oliveira (2004). Biclustering algorithms for biological data analysis: a survey. IEEE/ACM Transactions on Computational Biology and Bioinformatics 1(1), 24–45.
  • Monier et al. (2019) Monier, B., A. McDermaid, C. Wang, J. Zhao, A. Miller, A. Fennell, and Q. Ma (2019). Iris-eda: An integrated RNA-Seq interpretation system for gene expression data analysis. PLOS Computational Biology 15(2), 1–15.
  • Pontes et al. (2015) Pontes, B., R. Giráldez, and J. S. Aguilar-Ruiz (2015). Biclustering on expression data: A review. Journal of biomedical informatics 57, 163–180.
  • Yang et al. (2005) Yang, J., H. Wang, W. Wang, and P. S. Yu (2005). An improved biclustering method for analyzing gene expression profiles.

    International Journal on Artificial Intelligence Tools

     14(05), 771–789.

Appendix: Proof of Theorem 1

Proof.

Given a bicluster , where is a set of rows, and is a set of columns, the mean squared residue score is defined as

where we have

can then be rewritten in the following way:

Without loss of generality, we have and , so

Let us notice that is the sum of all the elements in the bicluster, each weighted by a particular coefficient. Then is a weighted sum of all the squared elements, and of their double products. Therefore is also a weighted sum of all the squared elements in the bicluster, and of their double products.

Let us calculate , the coefficient referring to a generic squared element in . When computing , we obtain the coefficient corresponding to . From with (same row), we have , hence in total. Similarly, from with (same column), we have , hence in total. Finally, from with and (elements outside the row and the column ), we have , hence in total. As a consequence, the coefficient referring to the generic squared element in is

Now let us focus on the double products in . There are three kinds of double products according to the positions of the elements involved in the double product: the case in which two elements belong to the same row, the case in which they belong to the same column, and the case in which they belong to different rows and columns.

Case 1: same row. Let be the coefficient corresponding to the double product of two elements and belonging to the same row and different columns . From and we have . From (same row) with we get , leading to in total. From (same column of the first element) and (same column of the second element) with we get , leading to in total. Finally, from all the other with and we have , for a total of . Hence the coefficient is

Case 2: same column. Considering the symmetrical nature of the H-score formulation, the same calculations explained in the case of elements belonging to the same row work for the case of double products of elements belonging to the same column . Hence the coefficient referring to the double product of two elements and belonging to the same column and different rows is

Case 3: different rows and columns. Let be the coefficient of the double product of two elements and which belong to different rows and columns . From and we have . From and we get . From and with (same row as one of the two elements) we have , leading to in total; similarly from and with (same column as one of the two elements) we have , leading to in total. Finally, from all the other with and we have , for a total of . Hence the coefficient is

Let be the H-score of a submatrix of composed by rows and all the columns. In a bicluster of rows there are exactly submatrices with rows. Let be their average H-score:

where is the H-score of the -th submatrix having rows and columns. Since each can be written as a weighted sum of the squared elements belonging to the bicluster and of their double products, so does .

Let us start computing , the coefficient referring to a generic squared element in . It is useful to notice that the term is present in if and only if the -th submatrix having rows and columns contains the row . Of the different submatrices, only present the row . Therefore the coefficient is:

Now let be the coefficient in referring to the double product of two elements and belonging to the same row and different columns . Considering the fact that and belong to the same row , there are exactly submatrices composed by rows having the row . Therefore the coefficient is:

Now let be the coefficient in referring to the double product of two elements and belonging to the same column and different rows . Since and do not belong to the same row, there are exactly submatrices composed by rows having both row and row . Hence the coefficient is:

Finally, let be the coefficient in referring to the double product of two elements and which belong to different rows and columns . Since and do not belong to the same row, there are exactly submatrices composed by rows presenting both row and row . Therefore the coefficient is:

Being interested on the relationship between and , we compute (i.e. the coefficient of the generic squared element in ), (i.e. the coefficient referring to the double product of two elements belonging to the same row in ), (i.e. the coefficient referring to the double product of two elements belonging to the same column in ), and (i.e. the coefficient referring to the double product of two elements which belong to different rows and columns in ). We have:

Then we have:

Now let us consider and with . Both are weighted sums of squared elements of the bicluster and their double products. Being the ratio of each pair of coefficients in and equal to , then:

Analogously, considering the average H-score of all the submatrices of composed by rows and columns, we obtain: