# Biclustering Via Sparse Clustering

In many situations it is desirable to identify clusters that differ with respect to only a subset of features. Such clusters may represent homogeneous subgroups of patients with a disease, such as cancer or chronic pain. We define a bicluster to be a submatrix U of a larger data matrix X such that the features and observations in U differ from those not contained in U. For example, the observations in U could have different means or variances with respect to the features in U. We propose a general framework for biclustering based on the sparse clustering method of Witten and Tibshirani (2010). We develop a method for identifying features that belong to biclusters. This framework can be used to identify biclusters that differ with respect to the means of the features, the variance of the features, or more general differences. We apply these methods to several simulated and real-world data sets and compare the results of our method with several previously published methods. The results of our method compare favorably with existing methods with respect to both predictive accuracy and computing time.

## Authors

• 14 publications
• 4 publications
• 23 publications
• 4 publications
• ### A robust and sparse K-means clustering algorithm

In many situations where the interest lies in identifying clusters one m...
01/29/2012 ∙ by Yumi Kondo, et al. ∙ 0

• ### Semi-supervised clustering methods

Cluster analysis methods seek to partition a data set into homogeneous s...
07/01/2013 ∙ by Eric Bair, et al. ∙ 0

• ### A semi-supervised sparse K-Means algorithm

We consider the problem of data clustering with unidentified feature qua...
03/16/2020 ∙ by Avgoustinos Vouros, et al. ∙ 0

• ### Identification of relevant subtypes via preweighted sparse clustering

Cluster analysis methods are used to identify homogeneous subgroups in a...
04/13/2013 ∙ by Sheila Gaynor, et al. ∙ 0

• ### A Simple Approach to Sparse Clustering

Consider the problem of sparse clustering, where it is assumed that only...
02/23/2016 ∙ by Ery Arias-Castro, et al. ∙ 0

• ### Scams in modern societies: how does China differ from the world?

We study a set of high-profile scams that were well engineered and have ...
12/29/2020 ∙ by Jeff Yan, et al. ∙ 0

• ### Sparse Proteomics Analysis - A compressed sensing-based approach for feature selection and classification of high-dimensional proteomics mass spectrometry data

Background: High-throughput proteomics techniques, such as mass spectrom...
06/11/2015 ∙ by Tim Conrad, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Unsupervised exploratory methods play an important role in the analysis of high-dimension low sample size (HDLSS) data, such as microarray gene expression data. Such data sets can be expressed in the form of a matrix X

, where each row corresponds to one observation each column corresponds to a feature. Unsupervised learning is a powerful tool for discovering interpretable structures within HDLSS data without reference to external information. In particular, clustering methods partition observations into subgroups based on their overall feature patterns. In many situations, these underlying subgroups may differ with respect to only a subset of the features. Such subgroups could be overlooked if one clusters using all the features.

Biclustering methods may be useful in situations where clusters are formed by only a subset of the features. Biclustering aims to identify sub-matrices U within the original data matrix X. The results may be visualized as two-dimensional signal blocks (after reordering the rows and columns) containing only a subset of the observations and features. For example, in a gene expression data set collected from cancer patients, there may exist a subset of genes whose expression levels differ among patients with a more aggressive form of cancer. Identifying such a bicluster may aid in the treatment of cancer patients.

We define biclusters as sub-matrices U of the original data matrix X such that the observations within U are different from the observations not contained in U with respect to the features in U. In other words, the choice of features influences which observations form the biclusters. In general, we can view clustering as a one-dimensional partitioning method that partitions only the set of observations. Biclustering, on the other hand, is a two-dimensional partitioning method that identifies partitions with respect to both features and observations. However, given a set of features, the problem of biclustering reduces to the problem of partitioning the observations with respect to this set of features, a problem which can be solved using conventional clustering methods. Thus, one may identify biclusters by identifying the features that define the biclusters and then clustering with respect to these features. In recent years several methods have been proposed for identifying features that define such clusters. We will show how the “sparse clustering” method of witten2010framework may be used to identify biclusters under this framework. The proposed method can be used to detect biclusters with heterogeneous means and/or variances as well as more complex differences. We compare our algorithms with some other existing biclustering approaches by applying the methods to a series of simulation studies and biomedical data sets.

## 2 Methods

### 2.1 Sparse Clustering

The standard -means clustering algorithm partitions a data set into sub-categories by maximizing the between cluster sum of squares (BCSS). The BCSS is calculated by taking the sum of the BCSS’s for each individual feature. This implies that all features are equally important. However, in many situations the clusters differ with respect to only a fraction of the features. In such situations, giving equal weight to all features when clustering may produce inaccurate results. This is especially true for HDLSS problems. To overcome this problem, witten2010framework proposed a novel clustering method which they called “sparse clustering.” Under sparse clustering, each feature is given a nonnegative weight , and the following weighted version of the BCSS is maximized:

 (1)

Here represents observation for feature of the data matrix X and if and only if observation belongs to cluster . is a distance metric between any pair of observations in X with respect to feature . For -means clustering, we take .

witten2010framework describe an iterative procedure for maximizing (1):

1. Initially let .

2. Maximize (1) with respect to by applying the standard -means algorithm with the appropriate weights. In other words, apply the -means algorithm where the dissimilarity between observations and is defined to be .

3. Maximize (1) with respect to the ’s by letting

 wj=S(bj,Δ)∥S(bj,Δ)∥2 (2)

Here is the (unweighted) between cluster sum of squares for feature and is a soft-threshold operator. is chosen so that ( if ). See witten2010framework for the justification for (2).

4. Iterate steps 2 and 2 until the algorithm converges.

Note that (2) implies that as decreases, the number of nonzero ’s decreases. Thus, for sufficiently small values of , only a subset of the features contribute to the cluster assignments, so this method is useful in situations where the clusters differ with respect to only a subset of the features.

A variant of this procedure can be used to perform sparse hierarchical clustering. In sparse hierarchical clustering, each feature is once again given a weight and the cluster hierarchy is constructed using these weighted features. The value of the weights again depends on a tuning parameter, and some weights are forced to 0 when the tuning parameter is sufficiently small. See witten2010framework for details.

### 2.2 Biclustering Via Sparse Clustering

As described earlier, the objective of biclustering is to identify submatrices U of a data matrix X such that the observations containing in U differ from the observations not contained in U with respect to the features contained in U. One possible strategy to identify such biclusters is to apply 2-means sparse clustering. One could define the observations of U to be the observations in the smaller cluster identified by the procedure and the features in U to be the features with nonzero weights.

The list of features with nonzero weights depends on the tuning parameter , so this approach to biclustering requires one to choose the correct value of this tuning parameter. One possible approach for choosing is described in witten2010framework, but in our experience it tends to give nonzero weights to too many features. Thus, we propose an alternative method for identifying the features that belong to the bicluster. First, note that if sparse clustering is applied with , then no soft thresholding will be performed on the weights and all weights be nonzero. The motivation for our method is the following: Suppose that sparse 2-means clustering is applied with . Let denote the order statistics of the weights produced by the sparse clustering procedure, and let

denoted the expected values of these order statistics under the null hypothesis that no bicluster exists. If this null hypothesis is true, then we would expect that

for all . However, if features form a bicluster, then we would expect that for and otherwise. See Figure 1 for an illustration.

Thus, our proposed biclustering method is described below:

1. Apply the 2-means sparse clustering algorithm with to obtain clusters and and weights .

2. Perform a Kolmogorov-Smirnov test of the null hypothesis that the distribution of is the same as the expected distribution of the weights under the null hypothesis of no clusters.

3. If the test in Step 2 fails to reject the null hypothesis, then terminate the procedure and report that no biclusters were identified.

4. If the test in Step 2 rejects the null hypothesis, then let

 m=argmaxj(w(p−j+1)−w(p−j+1)0)−(w(p−j)−w(p−j)0) (3)

Intuitively, we are choosing an such that observation is “above the line” in Figure 1 and observation is “below the line.”

5. Return a bicluster containing the features with the largest weights and the observations belonging to either or (whichever is smaller).

We recommend that the data matrix be normalized such that all features have mean 0 and standard deviation 1 before applying the procedure. We will call this procedure “SC-Biclust” (an abbreviation for biclustering based on sparse clustering).

Let denote the between cluster sum of squares for feature . Suppose that the mean of the observations in is and the mean of the observations in is . Then it is easy to verify that

 E(bj)=1+np(1−p)(μ1,j−μ2,j)2, (4)

where

is the probability that a given observation belongs to

. This implies that if , which would be the case of feature does not belong to the bicluster. However, if , then will increase as increases. Thus, assuming that for at least one (which will always be the case when a bicluster exists), (2) implies that as increases for all such that (i.e., all that do not belong to the bicluster). This indicates that the criteria (3) is consistent for selecting the features that belong to the bicluster, assuming that and

are correctly identified and the conditions of the law of large numbers are satisfied.

One may wish to identify secondary biclusters in a data set after identifying a primary bicluster. One simple approach to identify such secondary biclusters is described below:

1. Identify a primary bicluster as described above.

2. Define a matrix as follows:

 x′ij={xijif xij∉U1xij−¯XU1,j+¯XU′1,jif xij∈U1 (5)

Here denotes the sample mean of the th feature of and denotes the sample mean of the th feature of the elements of that are not in .

3. Apply the biclustering algorithm to the matrix .

The above procedure may be repeated as many times as desired to identify multiple biclusters in the same data sets (although the procedure should be terminated if it fails to reject the null hypothesis that no biclusters exist in Step 2).

### 2.3 Estimating the Null Distribution of the Weights

This method requires one to know the expected order statistics of the weights under the null hypothesis that no clusters exist. If this distribution is unknown, it may be approximated as follows:

1. Apply the 2-means sparse clustering algorithm with to obtain clusters and , as before.

2. Fix and and permute the rows of to calculate weights .

3. Repeat Step 2 times.

4. Approximate as , where represents the th order statistic of the weights from the th iteration of Step 2.

This procedure will provide an estimate of the expected values of the order statistics of the weights, but it is very expensive computationally for large data sets. It would be desirable to develop a faster alternative. Fortunately, if the sparse clustering procedure is modified slightly, the exact distribution of the weights can be calculated under mild assumptions.

First, note that the criterion in (1) can be written as , where is the between cluster sum of squares for feature . If we modify the procedure to minimize rather than , then (2) implies that the optimal ’s are given by

 wj=√bj√∑kbk (6)

assuming (implying that in (2)). Now under the null hypothesis that no clusters exist, there is no difference in the means of the observations in and for all features, implying that for all . Thus, (6) implies that has a distribution if the ’s are independent. Thus, if we use this criterion to select the clusters, we can test the null hypothesis that no bicluster exists by performing a Kolmogorov-Smirnov test of the null hypothesis that the ’s have a distribution. Similarly, in (3), , where . Although there is no simple closed form expression for , it can be easily approximated numerically. We will use this method to approximate the null distribution of the weights in all subsequent examples unless otherwise noted.

### 2.4 Variance Biclustering and Other Variations

Note that sparse 2-means clustering is only used in the initial step of our biclustering procedure. In principle any clustering procedure that produces two clusters could be used in place of sparse 2-means clustering. Sparse 2-means clustering is an obvious choice to identify putative biclusters since it is designed to identify clusters that differ with respect to only a subset of the features. However, in some situations it may be desirable to use a different clustering procedure to identify the putative biclusters.

One important application where it may be useful to use an alternative clustering procedure is variance biclustering. The biclustering method described in Section 2.2 is designed to identify biclusters whose mean differs from the mean of the observations that do not belong to the bicluster. In some situations, however, one may wish to identify biclusters that have unusually high (or low) variance compared to observations that are not in the bicluster. For example, when analyzing DNA methylation data, biclusters that exhibit high variance may reveal possible functional regions in the genome.

To identify variance biclusters, we propose the following simple modification of 2-means clustering in order to identify clusters whose variances differ from one another:

1. Initially assign each observation to either cluster 1 or cluster 2.

2. For , move observation from cluster 1 to cluster 2 (or from cluster 2 to cluster 1) if

 p∑j=1log(|s2j,C1−s2j,C2|+1) (7)

is increased after moving the observation to the other cluster. Here represents the standard deviation of feature for the observations in cluster .

3. Repeat Step 7 until the procedure converges.

Note that we did not specify how the initial cluster assignments in step 1 were performed. The simplest approach is to simply assign each observation to a cluster randomly. An alternative approach is to calculate the variance of the data for each observation across the features. The observations are then partitioned based on their variances: half of the observations with the largest variances are initially assigned to cluster 1 and the other half of the observations (with the smallest variances) are initially assigned to cluster 2. Our preliminary work suggests that both approaches produce comparable results but the latter approach tends to be faster, so we will use this approach in all subsequent examples.

Also, note that this procedure can be easily modified to consider feature weights by replacing (7) with

 p∑j=1wjlog(|s2j,C1−s2j,C2|+1) (8)

A sparse version of this algorithm (motivated by the sparse clustering algorithm) is also possible, as described below:

1. Initially let .

2. Maximize (8) with respect to and by applying the above procedure with the appropriate weights.

3. Maximize (8) with respect to the ’s by letting

 wj=S(bj,Δ)∥S(bj,Δ)∥2 (9)

where .

4. Iterate steps 2 and 9 until the algorithm converges.

By replacing 2-means sparse clustering with the procedure described above, the biclustering algorithm described in Section 2.2 can be used to identify variance biclusters. If one wishes to identify secondary variance biclusters, one may define a matrix as follows:

 x′ij=⎧⎨⎩xijif xij∉U1xijσU′1,jσU1,jif xij∈U1 (10)

where denotes the standard deviation of the th feature of and denotes the standard deviation of the th feature of the elements of that are not in .

Note that this procedure requires an estimate of the null distribution of the ’s. This null distribution may be estimated by permuting the rows of as described in Section 2.3. Alternatively, one can take advantage of the fact that and for all under the null hypothesis of no variance biclusters, where and are the number of observations in and , respectively. The null distribution of the ’s (and hence the

’s) can be estimated by simulating chi-square random variables and calculating the

’s for each set of simulated values. We will use this method to approximate the null distribution in all examples in this manuscript, since the permutation-based approach is much slower.

Other variations of this biclustering procedure are possible. For example, rather than using sparse 2-means clustering to identify the putative biclusters in the first step of the procedure, one could use some form of hierarchical clustering and then partition the cluster hierarchy into two clusters. We will provide a simulated example below where applying hierarchical clustering with single linkage to identify the biclusters produces better results than sparse 2-means clustering.

### 2.5 Existing Biclustering Methods

A variety of biclustering methods have been proposed. One simple and commonly used approach is to independently apply hierarchical clustering to both the rows and columns of a data set [eisen1998cluster]. Several improvements of this simple approach have been proposed [getz2000coupled, weigelt2005molecular]. Other biclustering methods directly search for submatrices U such that the mean of the observations in U is higher than the mean of the observations not in U. The “Plaid” method of lazzeroni2002plaid approximates a data matrix as a sum of submatrices whose entries follow two-way ANOVA models. At each step of the procedure, the algorithm searches for a submatrix that maximizes the reduction in the overall sum of squares. Similarly, the “Large Average Submatrix” (LAS) method of shabalin2009finding assumes that the data matrix can be expressed as a sum of constant submatrices plus Gaussian noise. These submatrices are identified using an iterative search procedure. Also, the “sparse biclustering” method of tan2013sparse assumes that the observations belong to unknown and non-overlapping classes, and the features belong to unknown and non-overlapping classes. The mean value of all the features in each class is assumed to be the same. Class labels are obtained by maximizing the log likelihood, and sparsity is obtained by imposing an penalty on the log likelihood.

Other methods for identifying biclusters utilize the singular value decomposition (SVD) of the data matrix. lee2010biclustering propose a method that searches for a low-rank “checkerboard-structured” approximation for a data matrix by calculating a weighted form of the SVD. An adaptive lasso penalty

is applied to the weights, forcing both the left and right singular vectors to be sparse. The nonzero entries in the resulting (sparse) singular vectors correspond to the observations and features forming the bicluster. chen2013biclustering develop a generalization of this method called “Heterogeneous Sparse Singular Value Decomposition” (HSSVD). HSSVD approximates the data as the sum of a “mean layer” and a “variance layer” (plus random noise) and identifies biclusters in these two layers. The inclusion of a “variance layer” allows one to identify variance biclusters as well as mean biclusters.

While these methods have been useful for many problems, they have certain shortcomings. As we will demonstrate below, they may fail to identify biclusters in simple simulations. Also, with the exception of the HSSVD method, these existing methods can only identify biclusters whose means differ from the observations not in the bicluster. (HSSVD can also identify biclusters whose variances differ.) However, biclustering methods based on the SVD have other shortcomings. These methods can identify the presence of biclusters but cannot determine which observations and features belong to the bicluster without using arbitrary cutoffs.

### 2.6 Evaluating the Reproducibility of Biclusters

We propose an intuitive method to evaluate the reproducibility of the biclusters identified by each method. We randomly partition the original data matrix into two submatrices and , each of which contains half of the observations. Denote the primary bicluster identified within as , and let and be the primary biclusters within and , respectively. We treat as the reference or the “correct” bicluster, and record four rates: 1) The percentage of observations that are misclassified (i.e. the percentage of observations that are either in / but not or in but not in /); 2) The percentage of false negatives (i.e. the average percentage of features in that are not in /); 3) The percentage of false positives (i.e. the average number of features in / that are not in ); and 4) The percentage of features that are misclassified (i.e. features that are identified as significant on sub-matrix but not , or vice versa). We repeated the procedure 10 times on each simulated data set and averaged over the 10 iterations.

### 2.7 Computational Details

In the later sections, we will compare our proposed biclustering algorithm with several existing methods, specifically Plaid, LAS, SSVD, and HSSVD. The Plaid algorithm was implemented in the R package “biclust.” The default setting was used, and the data sets were feature/column scaled before running through the algorithm. The LAS algorithm is available at https://genome.unc.edu/las/. The default settings were used, including the data transformation step if recommended by the method. The SSVD functions are available at http://www.unc.edu/haipeng/, and the HSSVD functions can be found at http://impact.unc.edu/impact7/HSSVD. Again, the default settings were used for both methods. Note that this implementation of HSSVD does not use a pre-specified rank. For easy visualization of these two SVD based methods, we transformed the resulting singular vectors/matrices to be 0/ based on the signs of the entries making the plots. When comparing the prediction accuracy and reproducibility of these methods, we further dichotomized the results as 0/1, since we only care about whether an object or feature is inside the sub-matrix . The sparse biclustering algorithm was implemented using the “sparseBC” R package with the default settings. We used and to force the sparse biclustering algorithm to identify only a single bicluster so that its results can be compared with other methods that identify one bicluster at a time. Our proposed method was implemented using a modified version of the “sparcl” R package. All calculations were performed using a single core of a 2.66 GHz Intel Core 2 Quad processor on a Linux-based system.

## 3 Results

### 3.1 Simulation Studies

We first evaluated the performance of our method on a variety of simulated data sets and compared its performance with the biclustering methods described in Section 2.5. The methods were compared with respect to computing time, prediction accuracy, and reproducibility (defined in Section 2.6). To evaluate the prediction accuracy when identifying a single bicluster, we compared three rates: observation misclassification rate, feature false positive rate (FPR), and feature false negative rate (FNR).

For the sequential biclustering simulations (simulations 3 and 5), the identification of the current bicluster depends on all the biclusters identified previously and there is no “correct” sequence for identification. Thus, we recorded the prediction accuracy in a different manner. Specifically, when there existed two biclusters, the reasonable result would be the identification of either bicluster 1 or 2, or a larger bicluster that covers both the signal blocks, which will be referred to as “bicluster 1+2.” For each simulation, we determined which of the three biclusters was identified by each method. For each method, we recorded the percentage of simulations when each of the three possible biclusters was identified. Also, instead of comparing the mismatch rates for observations and features separately, we recorded the FPR and FNR of the entries. The reproducibility analysis described in Section 2.6 was only performed on simulation studies 1, 2, and 4 for computational reasons.

Some methods failed to produce “valid” biclusters for some of the simulated data sets. We defined a “valid” bicluster to be a bicluster consisting of at least two observations and two features. For each simulation, the number of invalid results over the 100 simulations was tabulated, and invalid results were not used when calculating the average accuracy of each method. Finally, we examined the number of biclusters identified by each method for simulations 1 and 3 and compared it to the number of biclusters that truly exist.

#### 3.1.1 Primary Bicluster Identification

In this study, each simulated data set contained four non-overlapping bicluster signals generated from normal distributions, and the comparison was focused on identifying the primary bicluster. Each simulated data set comprised a 100

200 matrix with independent entries where each column represents a feature and each row represents an observation. The background entries followed a standard normal distribution with mean 0 and standard deviation 1. We denote the distribution as , where represents a normal random variable with mean and standard deviation . The four non-overlapping rectangular shaped biclusters were constructed in the following manner: bicluster 1, consisting of observations 1-20 and features 1-20 (denoted as [1-20, 1-20]) added a layer to the background, bicluster 2 [16-30, 51-80] added a layer to the background, bicluster 3 [51-90, 61-130] added a layer to the background, and bicluster 4 [66-100, 151-200] added a layer to the background. Bicluster 3 was the primary bicluster, since it was the largest bicluster and had the largest mean difference from the background, so we expected the algorithms to detect this bicluster as the first layer. Figure 2 shows the biclustering results from one of the simulations. Under the given data structure, the Plaid algorithm failed to identify any biclusters for all the simulations. Each simulated data set was partitioned as described in Section 2.6 to evaluate the reproducibility of the biclusters.

#### 3.1.2 Departure from Normality

In this study, we simulated data sets with four non-overlapping bicluster signals similar to the data sets that were simulated in Section 3.1.1

. The main difference is that the data were generated from Cauchy distributions with infinite moments. Each simulated data set comprised a 100

200 matrix with independent entries. The background entries followed a Cauchy distribution with location shift and scale . We denote the distribution as , where represents a Cauchy random variable with location shift and scale . The four non-overlapping rectangular shaped biclusters were constructed in the following manner: bicluster 1 [1-20, 1-20] added a layer to the background, bicluster 2 [16-30, 51-80] added a layer to the background, bicluster 3 [51-90, 71-110] added a layer to the background, and bicluster 4 [71-100, 156-200] added a layer to the background. Bicluster 3 was the primary bicluster, and we expected the algorithms to detect this bicluster as the first layer. Figure 3 compares the biclustering results of each method for one of the simulations. Each simulated data set was partitioned as described in Section 2.6 to evaluate the reproducibility of the biclusters.

#### 3.1.3 Sequential Biclusters with Overlap

In this study, we simulated data sets with overlap between two biclusters. Each simulated data set comprised of two layers, each of which was a 100 200 matrix with independent entries. The background data (i.e., observations that do not belong to the bicluster) were . The first layer contained a bicluster [1-40, 1-40] generated from , and the second layer contained a bicluster [21-60, 21-60] generated from . The final data set was the sum of the two layers. Note that observations 21-40 and features 21-40 are contained in both biclusters. Figure 4 shows the biclustering results from one of the simulations. Reproducibility of the biclusters was not evaluated for this simulation scenario.

#### 3.1.4 Non-Spherical Biclusters

Most existing biclustering methods seek to maximize the Euclidean distance between the center of the putative bicluster and the remaining data values. This assumes that the biclusters are approximately “spherical” (in the appropriate number of dimensions). Although this assumption is reasonable in many situations, it can cause these methods to fail if the assumption is violated. See Figure 5 for an example of non-spherical clusters in the case of two dimensions. Hierarchical clustering (with single linkage) will do a better job of identifying clusters similar to the clusters in Figure 5 than -means clustering (which also assumes that the clusters are spherical). One strength of SC-Biclust is the fact that it can use clustering methods other than 2-means clustering to identify biclusters (see Section 2.4). Thus, it is reasonable to expect that SC-Biclust (with single linkage hierarchical clustering) will outperform competing biclustering methods when the biclusters are non-spherical.

The purpose of this study was to provide an example where SC-Biclust using hierarchical clustering can identify biclusters that existing biclustering methods would fail to identify. Each 1200 75 data set was simulated as follows. For :

 Xi,2j =−2I(i≤500)+5sin(θi+πI(i>500))+ϵi Xi,2j−1 =5I(i≤500)+5cos(θi+πI(i>500))+ϵi

Here the ’s are iid and the ’s are iid . For all , the ’s are . Figure 5 shows the biclustering results from one of the simulations. Each simulated data set was partitioned as described in Section 2.6 to evaluate the reproducibility of the biclusters.

#### 3.1.5 Variance Biclustering

Another limitation of most existing biclustering methods is that they are only capable of detecting biclusters whose mean values differ from the data points not in the bicluster. In some situations, however, one may wish to identify biclusters with higher (or lower) variance than the data points not contained in the bicluster. As described in Section 2.4, SC-Biclust can be modified to identify biclusters with heterogeneous variance. The goal of this simulation is to evaluate the ability of SC-Biclust to identify such biclusters. We simulated data sets with two non-overlapping biclusters with heterogeneous variances. Each simulated data set consisted of a 150 500 matrix with independent entries. The background entries were all . The first bicluster [1-30, 1-200] was generated as , and the second bicluster [31-50, 201-400] was generated as . Figure 6 shows the biclustering results from one of the simulations. The prediction accuracy of the methods were evaluated in the same way as the third simulation scenario, and no reproducibility was assessed.

#### 3.1.6 Simulation Results

We simulated 100 data sets with the same structure for each simulation scenario. Table 1 shows the average computing time for each method for each simulation scenario. Tables 2 and 3 show the prediction accuracy and the number of valid biclusters, Table 4 shows the reproducibility results for simulations 1, 2, and 4, and Table 5 shows the stopping rule comparison for simulations 1 and 3.

SC-Biclust performed very well in the first simulation scenario. No observations were misclassified across all 100 simulations and the proportion of features that were misclassified was also very low. The reproducibility of the biclusters identified by SC-Biclust was also very good. The sparse biclustering method also produced good results, except for the relatively high feature misclassification rate in the reproducibility analysis. SSVD, HSSVD, and LAS tended to include spurious features in the bicluster (as evidenced by their higher FPR), and Plaid did not select any features for this simulation scenario. The results of the second simulation scenario were similar. Although the accuracy of SC-Biclust was lower when the assumption of normality was violated, it produced a noticeably lower error rate than competing methods (with the exception of LAS). More importantly, SC-Biclust produced valid biclusters in all 100 simulations whereas SSVD, Plaid, and sparse biclustering frequently failed to identify valid biclusters. Sparse biclustering tended to produce good results when it identified biclusters in the data, but it failed to detect any biclusters in 87 of the 100 simulations. LAS identified valid biclusters in all 100 simulations with comparable accuracy to SC-Biclust.

In the third simulation scenario, SC-Biclust identified both biclusters with perfect accuracy in all the simulations. LAS also identified the first bicluster with high accuracy but it tended to include many spurious entries when identifying the second bicluster. SSVD and HSSVD tended to identify bicluster 1+2 (combining the two biclusters into one), and the performance of Plaid was poor. The sparse biclustering method identified single biclusters, but with very high false negative rates.

SC-Biclust had a much lower proportion of misclassified observations in the fourth simulation scenario and excellent reproducibility. This is not surprising, since the other biclustering methods assume that the biclusters are spherical, and this assumption is violated for this simulation. However, these results illustrate that SC-Biclust can be used to identify biclusters in situations where existing methods will fail.

In the fifth simulation scenario, SC-Biclust identified the first variance bicluster with high accuracy. It usually detected the second variance bicluster as well, although many of the entries were false negatives. HSSVD tended to identify bicluster 1+2, with higher FNR and FPR than SC-Biclust. The other methods performed poorly, which is not surprising, since they are not designed to identify variance biclusters.

In terms of computing time, SC-Biclust was generally significantly faster than HSSVD and LAS and slightly faster than sparse biclustering but slightly slower than SSVD and Plaid. However, in simulation 3, SC-Biclust was significantly faster than all methods other than Plaid, and SC-Biclust was faster than all methods other than SSVD on simulation 5. On simulation 4, SSVD was noticeably slower than SC-Biclust (as well as Plaid and sparse biclustering). Note that these simulations use HSSVD without pre-specified rank, which increases the computing time of this method.

In simulation 1, SC-Biclust correctly determined that 4 biclusters are present in the data in 44% of the simulations. It incorrectly identified a 5th bicluster in 54% of simulations and identified a (non-existent) 6th bicluster in 2% of simulations. HSSVD correctly identified 4 biclusters in 54% of simulations. The remaining methods (namely LAS and sparse biclustering) consistently identified too many biclusters. SSVD was not included in this comparison since it always returns a single layer, and Plaid was not included since it did not return any valid results for this simulation scenario. In simulation 3, SC-Biclust correctly determined that 2 biclusters are present in the data in 99% of the simulations. HSSVD correctly determined that 2 biclusters were present in 63% of simulations, and Plaid determined that 2 biclusters were present in 30% of simulations. Again, LAS and sparse biclustering always overestimated the number of biclusters, and SSVD was not included in the comparison.

### 3.2 Analysis of OPPERA data

OPPERA is a prospective cohort study on Temporomandibular Disorders (TMD), which are a set of painful conditions that affect the jaw muscles, the jaw joint, or both. Both TMD-free participants and chronic TMD patients were enrolled in the study. Each study participant completed a quarterly questionnaire, and participants who showed signs of first-onset TMD returned to the clinic for a formal examination. The median follow up period was 2.8 years. The data set contained 185 chronic TMD patients and 3258 initially TMD-free individuals, 260 of whom developed TMD before the end of the study. Among the TMD-free individuals, 521 did not complete any follow up questionnaires and were excluded from the analysis. The remaining 2737 were used for survival analysis in the later sections, where development of first-onset TMD is the event of interest. For a more detailed description of the OPPERA study, see slade2011study or bair2013study.

Three sets of possible risk factors for TMD were measured in OPPERA, including autonomic measurements like blood pressure and heart rate (44 total variables), psychosocial measurements like depression and anxiety (39 total variables), and quantitative sensory testing (QST) measurements (33 total variables) that evaluate participants’ sensitivity to experimental pain. See fillingim2011potential, greenspan2011pain, and maixner2011potential for more detailed descriptions of these variables.

The SC-Biclust algorithm identified 3 significant biclusters within the OPPERA data set. The first bicluster contained 30 measures of autonomic function, the second bicluster contained 29 measures of psychological distress, and the third bicluster contained 6 measures of pain sensitivity. There were no overlap in the features selected in the three biclusters. Thus, the biclusters identified by SC-Biclust were consistent with the known structure of the data set. The biclusters identified by the other methods did not correspond to the three different types of measurements known to exist in this data set. See Table

6 for a summary of the results.

Membership in the biclusters identified by each method of interest was evaluated as a possible risk factor for both chronic TMD and first-onset TMD. (Subjects with chronic TMD were excluded from the analysis for first-onset TMD.) The association between each bicluster and chronic TMD is shown in Table 7, and the association between each bicluster and first-onset TMD is shown in Table 8. Kaplan-Meier plots for first-onset TMD for selected biclusters are shown in Figure 7. All three biclusters identified by SC-Biclust were associated with chronic TMD. The second bicluster was also associated with first-onset TMD. The second and third biclusters identified by LAS were associated with chronic TMD, and the second bicluster was also associated with first-onset TMD. The remaining biclusters were associated with neither chronic TMD nor first-onset TMD. SC-Biclust was faster than HSSVD and LAS but slower than SSVD and Plaid. The sparse biclustering algorithm failed to detect any biclusters.

### 3.3 Analysis of a breast cancer gene expression data set

The data set used in this section contains gene expression measurements on 4751 genes from a total number of 78 breast cancer subjects. The survival time of each subject is also available. See van2002gene for a more detailed description of this data set.

The first bicluster identified by the SC-Biclust algorithm contains 16 subjects and 8 features. The first bicluster identified by the LAS algorithm contains 16 observations and 1421 features. Interestingly, the 16 observations identified by SC-Biclust and LAS are exactly the same. The primary bicluster identified by the sparse biclustering method contains 60 observations and 553 features. HSSVD method identified 8 mean bicluster layers and 3 variance bicluster layers, for which we will only study the primary mean layer. The Plaid method failed to identify any biclusters within the data set, and the SSVD method and the HSSVD variance identification did not produce valid biclusters. Detailed biclustering results are provided in Table 9.

We tested the null hypothesis of no association between each putative bicluster and survival using log rank tests. Table 9 and Figure 8 show the associations between survival and the biclusters identified by SC-Biclust, HSSVD (mean layer only), LAS, and sparse biclustering. The putative biclusters identified by SC-Biclust, LAS, and sparse biclustering were associated with survival, but the putative bicluster identified by HSSVD was not. The running time for SC-Biclust was also significantly lower than the running time of the other methods.

### 3.4 Analysis of methylation data

We applied SC-Biclust (and existing biclustering methods) to a methylation data set comparing cancer patients with normal patients. Methylation data was evaluated at 384 different cancer-specific differentially methylated regions (cDMRs) for 138 normal samples and 152 cancer samples. Details of the data set are described in hansen2011increased, who reported that the cancer samples had hypervariability in certain cDMRs compared to controls.

We first applied the SC-Biclust algorithm to identify two mean biclusters and then used the residual matrix for variance bicluster identification, as described in Section 2.4. We chose the top two variance biclusters for comparison with the other methods. The HSSVD method identified two layers of mean biclusters and six layers of variance biclusters. The Plaid method identified two biclusters. The sparse biclustering method failed to detect any biclusters. For the LAS method, we report the top three biclusters. Comparison of the biclustering results are summarized in Table 10. The first mean bicluster identified by SC-Biclust was strongly associated with cancer, as were both of the variance biclusters. Indeed, we can see that the two variance biclusters identified by the SC-Biclust algorithm contained cancer samples exclusively. The other biclustering methods also identified biclusters that were associated with cancer. It is interesting to compare the variance biclusters identified by SC-Biclust to the variance biclusters identified by HSSVD. SC-Biclust tends to identify small variance biclusters that contain only cancer patients whereas HSSVD tends to identify larger biclusters that are more heterogeneous. Note that under the 0/1 transformation, SSVD and the mean layers of HSSVD identified biclusters containing all of the observations. The running time for SC-Biclust was greater than the running time of LAS and Plaid but less than the running time of HSSVD.

## 4 Discussion

Biclustering is an unsupervised learning algorithm that is a powerful tool for studying HDLSS data. In this paper, we have proposed a general framework for biclustering based on sparse clustering. We have developed algorithms for heterogeneous mean and variance biclusters as well as more complex structures that can be identified using hierarchical clustering. The algorithms we described in this paper are special cases of this framework, and similar methods can be developed for other bicluster structures of interest.

The biclusters identified by SC-Biclust compared favorably with the biclusters identified by competing methods for both the simulated and real data sets. We believe that SC-Biclust has several other advantages compared to existing biclustering methods. First, unlike some other biclustering methods [lazzeroni2002plaid, tan2013sparse], SC-Biclust does not assume that all features in a bicluster have the same mean. This is a strong assumption that is likely to be violated for many data sets. Indeed, SC-Biclust does not even necessarily assume that the bicluster has different means than the observations not in the bicluster. In general, SC-Biclust can be applied given an arbitrary function whose value increases as the “difference” between the bicluster and the remaining observations increases and a method for maximizing this function with respect to the observations. For example, as noted earlier, SC-Biclust can be used to identify biclusters with heterogeneous variance.

Note that singular value decomposition-based methods such as HSSVD can also be used to denoise and approximate the data in addition to bicluster detection. For this reason, although SC-Biclust performs better in biclustering detection under the scenarios considered in the paper, SVD-based methods can still be useful as a preprocessing tool. The performance of SC-Biclust may be improved by applying a preprocessing method first such as HSSVD, but this is beyond the scope of this paper.

Second, SC-Biclust is noticeably faster than other biclustering methods, particularly HSSVD and LAS. This was particularly true when these methods were applied to high dimensional data. Thus, SC-Biclust may be useful for Big Data problems where other methods are too expensive computationally.

Finally, SC-Biclust provides a simple statistical test of the null hypothesis that no biclusters exist. In general the problem of determining if a bicluster (or any type of cluster) represents signal or noise is difficult. Existing biclustering methods use various stopping criteria to determine if a bicluster represents signal. However, as demonstrated earlier, they frequently fail to identify true biclusters or return putative “biclusters” that do not actually exist. The stopping criteria used by SC-Biclust was generally more accurate than these existing methods. Development of better stopping criteria for biclustering methods is an important area for future research.

One limitation of using the beta distribution to approximate the null distribution of the weights for SC-Biclust is the fact that it requires the assumption that the BCSS of the features are independent of one another. This is unlikely to be satisfied if the features themselves are correlated, which is likely to be true in most real world data sets of interest. Fortunately our experience suggests that the correlations of the BCSS’s tend to be modest even when the original features are strongly correlated with one another, so SC-Biclust tends to be robust against violations of this assumption. In particular, when multiple biclusters exist in a data set, this assumption is necessarily violated, which was the case for most of our simulation examples. Nevertheless SC-Biclust generally did not identify spurious biclusters in these simulations. However, if there is evidence of strong correlations among the BCSS’s, it may be preferable to approximate the null distribution of the weights using the nonparametric permutation method.

It is interesting to compare the results of SC-Biclust and HSSVD for variance biclustering. In the examples considered in this manuscript, SC-Biclust tended to identify smaller, more homogeneous biclusters whereas HSSVD tended to identify biclusters that were larger and more heterogeneous. It is unclear if this result is true in general or if it is merely an artifact of these particular data sets. In any event, one potential advantage of HSSVD for this problem is that SC-Biclust mail be less likely to detect “small” biclusters than HSSVD. It is possible that the method used by SC-Biclust to identify variance biclusters could be improved. The identification of variance biclusters is a relatively new topic and an important area for future research.

## References

• [1] [Bair et al.]Bair, Brownstein, Ohrbach, Greenspan, Dubner, Fillingim, Maixner, Smith, Diatchenko, Gonzalez et al.2013bair2013study Bair, E., Brownstein, N. C., Ohrbach, R., Greenspan, J. D., Dubner, R., Fillingim, R. B., Maixner, W., Smith, S. B., Diatchenko, L., Gonzalez, Y. et al. 2013, “Study protocol, sample characteristics, and loss to follow-up: The OPPERA prospective cohort study,” The Journal of Pain, 14(12), T2–T19.
• [2] [Chen et al.]Chen, Sullivan and Kosorok2013chen2013biclustering Chen, G., Sullivan, P. F., and Kosorok, M. R. 2013, “Biclustering with heterogeneous variance,” Proceedings of the National Academy of Sciences, 110(30), 12253–12258.
• [3]

[Eisen et al.]Eisen, Spellman, Brown and Botstein1998eisen1998cluster Eisen, M. B., Spellman, P. T., Brown, P. O., and Botstein, D. 1998, “Cluster analysis and display of genome-wide expression patterns,”

Proceedings of the National Academy of Sciences, 95(25), 14863–14868.
• [4] [Fillingim et al.]Fillingim, Ohrbach, Greenspan, Knott, Dubner, Bair, Baraian, Slade and Maixner2011fillingim2011potential Fillingim, R. B., Ohrbach, R., Greenspan, J. D., Knott, C., Dubner, R., Bair, E., Baraian, C., Slade, G. D., and Maixner, W. 2011, “Potential psychosocial risk factors for chronic TMD: descriptive data and empirically identified domains from the OPPERA case-control study,” The Journal of Pain, 12(11), T46–T60.
• [5] [Getz et al.]Getz, Levine and Domany2000getz2000coupled Getz, G., Levine, E., and Domany, E. 2000, “Coupled two-way clustering analysis of gene microarray data,” Proceedings of the National Academy of Sciences, 97(22), 12079–12084.
• [6] [Greenspan et al.]Greenspan, Slade, Bair, Dubner, Fillingim, Ohrbach, Knott, Mulkey, Rothwell and Maixner2011greenspan2011pain Greenspan, J. D., Slade, G. D., Bair, E., Dubner, R., Fillingim, R. B., Ohrbach, R., Knott, C., Mulkey, F., Rothwell, R., and Maixner, W. 2011, “Pain sensitivity risk factors for chronic TMD: descriptive data and empirically identified domains from the OPPERA case control study,” The Journal of Pain, 12(11), T61–T74.
• [7] [Hansen et al.]Hansen, Timp, Bravo, Sabunciyan, Langmead, McDonald, Wen, Wu, Liu, Diep et al.2011hansen2011increased Hansen, K. D., Timp, W., Bravo, H. C., Sabunciyan, S., Langmead, B., McDonald, O. G., Wen, B., Wu, H., Liu, Y., Diep, D. et al. 2011, “Increased methylation variation in epigenetic domains across cancer types,” Nature genetics, 43(8), 768–775.
• [8] Lazzeroni and Owen2002lazzeroni2002plaid Lazzeroni, L., and Owen, A. 2002, “Plaid models for gene expression data,” Statistica sinica, 12(1), 61–86.
• [9] [Lee et al.]Lee, Shen, Huang and Marron2010lee2010biclustering Lee, M., Shen, H., Huang, J. Z., and Marron, J. 2010, “Biclustering via sparse singular value decomposition,” Biometrics, 66(4), 1087–1095.
• [10] [Maixner et al.]Maixner, Greenspan, Dubner, Bair, Mulkey, Miller, Knott, Slade, Ohrbach, Diatchenko et al.2011maixner2011potential Maixner, W., Greenspan, J. D., Dubner, R., Bair, E., Mulkey, F., Miller, V., Knott, C., Slade, G. D., Ohrbach, R., Diatchenko, L. et al. 2011, “Potential autonomic risk factors for chronic TMD: descriptive data and empirically identified domains from the OPPERA case-control study,” The Journal of Pain, 12(11), T75–T91.
• [11] [Shabalin et al.]Shabalin, Weigman, Perou and Nobel2009shabalin2009finding Shabalin, A. A., Weigman, V. J., Perou, C. M., and Nobel, A. B. 2009, “Finding large average submatrices in high dimensional data,” The Annals of Applied Statistics, pp. 985–1012.
• [12] [Slade et al.]Slade, Bair, By, Mulkey, Baraian, Rothwell, Reynolds, Miller, Gonzalez, Gordon et al.2011slade2011study Slade, G. D., Bair, E., By, K., Mulkey, F., Baraian, C., Rothwell, R., Reynolds, M., Miller, V., Gonzalez, Y., Gordon, S. et al. 2011, “Study methods, recruitment, sociodemographic findings, and demographic representativeness in the OPPERA study,” The Journal of Pain, 12(11), T12–T26.
• [13] Tan and Witten2013tan2013sparse Tan, K. M., and Witten, D. M. 2013, “Sparse biclustering of transposable data,” Journal of Computational and Graphical Statistics, . In press.
• [14] [van’t Veer et al.]van’t Veer, Dai, Van De Vijver, He, Hart, Mao, Peterse, van der Kooy, Marton, Witteveen, Shreiber, Kerkhoven, Roberts, Linsley, Bernards and Friend2002van2002gene van’t Veer, L. J., Dai, H., Van De Vijver, M. J., He, Y. D., Hart, A. A., Mao, M., Peterse, H. L., van der Kooy, K., Marton, M. J., Witteveen, A. T., Shreiber, G. J., Kerkhoven, R. M., Roberts, C., Linsley, P. S., Bernards, R., and Friend, S. H. 2002, “Gene expression profiling predicts clinical outcome of breast cancer,” Nature, 415(6871), 530–536.
• [15] [Weigelt et al.]Weigelt, Hu, He, Livasy, Carey, Ewend, Glas, Perou and van’t Veer2005weigelt2005molecular Weigelt, B., Hu, Z., He, X., Livasy, C., Carey, L. A., Ewend, M. G., Glas, A. M., Perou, C. M., and van’t Veer, L. J. 2005, “Molecular portraits and 70-gene prognosis signature are preserved throughout the metastatic process of breast cancer,” Cancer research, 65(20), 9155–9158.
• [16] Witten and Tibshirani2010witten2010framework Witten, D. M., and Tibshirani, R. 2010, “A framework for feature selection in clustering,” Journal of the American Statistical Association, 105(490).
• [17] Zou2006zou2006adaptive Zou, H. 2006, “The adaptive lasso and its oracle properties,” Journal of the American Statistical Association, 101(476), 1418–1429.
• [18]