# Convex Biclustering

In the biclustering problem, we seek to simultaneously group observations and features. While biclustering has applications in a wide array of domains, ranging from text mining to collaborative filtering, the problem of identifying structure in high dimensional genomic data motivates this work. In this context, biclustering enables us to identify subsets of genes that are co-expressed only within a subset of experimental conditions. We present a convex formulation of the biclustering problem that possesses a unique global minimizer and an iterative algorithm, COBRA, that is guaranteed to identify it. Our approach generates an entire solution path of possible biclusters as a single tuning parameter is varied. We also show how to reduce the problem of selecting this tuning parameter to solving a trivial modification of the convex biclustering problem. The key contributions of our work are its simplicity, interpretability, and algorithmic guarantees - features that arguably are lacking in the current alternative algorithms. We demonstrate the advantages of our approach, which includes stably and reproducibly identifying biclusterings, on simulated and real microarray data.

## Authors

• 12 publications
• 23 publications
• 66 publications
• ### Path Thresholding: Asymptotically Tuning-Free High-Dimensional Sparse Regression

In this paper, we address the challenging problem of selecting tuning pa...
02/23/2014 ∙ by Divyanshu Vats, et al. ∙ 0

• ### On the connections between algorithmic regularization and penalization for convex losses

In this work we establish the equivalence of algorithmic regularization ...
09/08/2019 ∙ by Qian Qian, et al. ∙ 0

• ### Multi-Domain Collaborative Filtering

Collaborative filtering is an effective recommendation approach in which...
03/15/2012 ∙ by Yu Zhang, et al. ∙ 0

• ### Algorithm Selection for Collaborative Filtering: the influence of graph metafeatures and multicriteria metatargets

To select the best algorithm for a new problem is an expensive and diffi...
07/23/2018 ∙ by Tiago Cunha, et al. ∙ 0

• ### Dynamic Visualization and Fast Computation for Convex Clustering via Algorithmic Regularization

Convex clustering is a promising new approach to the classical problem o...
01/06/2019 ∙ by Michael Weylandt, et al. ∙ 0

• ### Dynamic Visualization and Fast Computation for Convex Clustering and Bi-Clustering

Convex clustering is a promising new approach to the classical problem o...
01/06/2019 ∙ by Michael Weylandt, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

In the biclustering problem, we seek to simultaneously group observations (columns) and features (rows) in a data matrix. Such data is sometimes described as two-way, or transposable, to put the rows and columns on equal footing and to emphasize the desire to uncover structure in both the row and column variables. Biclustering is used for visualization and exploratory analysis in a wide array of domains. For example, in text mining, biclustering can identify subgroups of documents with similar properties with respect to a subgroup of words (Dhillon, 2001). In collaborative filtering, it can be used to identify subgroups of customers with similar preferences for a subset of products (Hofmann and Puzicha, 1999). Comprehensive reviews of biclustering methods can be found in (Madeira and Oliveira, 2004; Tanay, Sharan and Shamir, 2005; Busygin, Prokopyev and Pardalos, 2008).

In this work, we focus on biclustering to identify patterns in high dimensional cancer genomic data. While a cancer, such as breast cancer, may present clinically as a homogenous disease, it typically consists of several distinct subtypes at the molecular level. A fundamental goal of cancer research is the identification of subtypes of cancerous tumors that have similar molecular profiles and the genes that characterize each of the subtypes. Identifying these patterns is the first step towards developing personalized treatment strategies targeted to a patient’s particular cancer subtype.

Subtype discovery can be posed as a biclustering problem in which gene expression data is partitioned into a checkerboard-like pattern that highlights the associations between groups of patients and groups of genes that distinguish those patients. Biclustering has had some notable successes in subtype discovery. For instance, biclustering breast cancer data has identified sets of genes whose expression levels segregated patients into five subtypes with distinct survival outcomes (Sørlie et al., 2001). These subtypes have been reproduced in numerous studies (Sørlie et al., 2003). Encouraged by these results, scientists have searched for molecular subtypes in other cancers, such as ovarian cancer (Tothill et al., 2008). Unfortunately, the findings in many of these additional studies have not been as reproducable as those identified by Sørlie et al. (2001). The failure to reproduce these other results may reflect a genuine absence of biologically meaningful groupings. But another possibility may be related to issues inherent in the computational methods currently used to identify biclusters.

While numerous methods for biclustering genomic data have been proposed (Madeira and Oliveira, 2004; Busygin, Prokopyev and Pardalos, 2008)

, the most popular approach to biclustering cancer genomics data is the clustered dendrogram. This method performs hierarchical clustering

(Hastie, Tibshirani and Friedman, 2009) on the patients (columns) as well as on the genes (rows). The matrix is then re-ordered according to these separate row and column dendrograms and visualized as a heatmap with dendrograms plotted alongside the row and column axes. Figure (a)a illustrates an example of a clustered dendrogram of expression data from a lung cancer study. The data consists of the expression levels of 500 genes across 56 individuals, a subset of the data studied in Lee et al. (2010). Subjects belong to one of four subgroups: Normal, Carcinoid, Colon, or Small Cell.

This simple strategy seems to reasonably recover the clinical diagnoses and identify sets of genes whose dysregularization characterizes the subtypes.

As an algorithmic procedure, however, the clustered dendrogram has two characteristics that make it less than ideal for generating reproducible results. Dendrograms are constructed by greedily fusing observations (features) to decrease some criterion. Consequently the algorithm may return a biclustering that is only locally optimal with respect to the criterion. Since solutions may vary depending on how the algorithm is initialized, such procedures tend to be run from multiple initializations, but even then there is no guarantee that a global optimum will be reached. The algorithm is also not stable in the sense that small perturbations in the data can lead to large changes in the clustering assignments.

More sophisticated methods have been proposed for biclustering, some based on the singular value decomposition (SVD) of the data matrix

(Lazzeroni and Owen, 2002; Bergmann, Ihmels and Barkai, 2003; Turner, Bailey and Krzanowski, 2005; Witten, Tibshirani and Hastie, 2009; Lee et al., 2010; Sill et al., 2011), and others based on graph cuts (Dhillon, 2001; Kluger et al., 2003). Some approaches are similar in spirit to the clustered dendrogram and directly cluster the rows and columns (Coifman and Gavish, 2011; Tan and Witten, 2014). While these methods may provide worthwhile improvements in empirical performance, none of them address the two fundamental issues that dog the clustered dendrogram. Moreover, scientists may shy from using many of these methods, since their outputs are typically not as easy to visualize as compared to the simple clustered dendrogram. From a reproducible research perspective, a biclustering method should (i) give the same, ideally unique, answer regardless of how the algorithm is initialized, and (ii) be stable with respect perturbations in the data.

In this paper, we pose biclustering as a convex optimization problem and introduce a novel Convex BiclusteRing Algorithm (COBRA) for iteratively solving it. COBRA outputs results that retain the simple interpretability and visualization of the clustered dendrogram and also possesses several key advantages over existing techniques: (i) Stability and Uniqueness:COBRA produces the unique global minimizer to a convex program and this minimizer is continuous in the data. This means that COBRA always maps the data to a single biclustering assignment, and this solution is stable. (ii) Simplicity:COBRA employs a single tuning parameter that controls the number of biclusters. (iii) Data Adaptivity:COBRA admits a simple and principled data adaptive procedure for choosing the tuning parameter that involves solving a convex matrix completion problem.

Returning to our motivating lung cancer example, Figure (b)b illustrates the results of COBRA with the tuning parameter selected according to our data adaptive procedure. After reordering the columns and rows via the seriation package (Hahsler, Hornik and Buchta, 2008), with the ‘TSP’ option, a clearer biclustering patterns emerge.

## 2 A Convex Formulation of Biclustering

Our goal is to identify the groups of rows and groups of columns in a data matrix that are associated with each other. As seen in Figure (b)b, when rows and columns are reordered according to their groupings, a checkerboard pattern emerges, namely the elements of the matrix partitions defined by row and column groups tend to display a uniform intensity.

We now describe a probabilistic model that can generate the observed checkerboard pattern. Our data, , consists of samples drawn from a -dimensional feature space. Suppose that the latent checkerboard structure is defined by feature groups and observation groups. If the th entry in belongs to the cluster defined by the th feature group and th observation group, then we assume that the observed value is given by , where is a baseline or grand mean shared by all entries of the data matrix, is the mean of the cluster defined by the th row partition and th column partition, and are iid for some . To ensure identifiability of the mean parameters, we assume that , which can be achieved by removing the grand sample mean from the data matrix .

This biclustering model corresponds to a checkerboard mean model (Madeira and Oliveira, 2004). This model is most similar to that assumed by Tan and Witten (2014) who propose methods to estimate a checkerboard-like structure with some of the bicluster mean entries being sparse. The checkerboard model is exhaustive in that each matrix element is assigned to one bicluster. This is in contrast to other biclustering models that identify potentially overlapping row and column subsets that are not exhaustive; these are typically estimated using SVD-like methods (Lazzeroni and Owen, 2002; Bergmann, Ihmels and Barkai, 2003; Turner, Bailey and Krzanowski, 2005; Witten, Tibshirani and Hastie, 2009; Lee et al., 2010; Sill et al., 2011) or methods to find hot-spots (Shabalin et al., 2009).

Estimating the checkerboard model parameters consists of finding the partitions and the mean values of each partition. Estimating , given feature and observation clusterings, is trivial. Let and denote the indices of the th row partition and th column partition. The maximum likelihood estimate of is simply the sample mean of the entries of over the indices defined by and , namely

In contrast, estimating the row and column partitions, is a combinatorially hard problem and characterizes the main objective of biclustering. This task is akin to best subset selection in regression problems (Hastie, Tibshirani and Friedman, 2009). However, just as the best subset selection problem has been successfully attacked by solving a convex surrogate problem, namely the Lasso (Tibshirani, 1996), we will develop a convex relaxation of the combinatorally hard problem of selecting the row and column partitions.

We propose to identify the partitions by minimizing the following convex criterion

 Fγ(U) = 12∥X−U∥2F+γ[ΩW(U)+Ω~W(UT)]J(U), (2.1)

where , and ( denotes the th column (row) of the matrix .

We have posed the biclustering problem as a penalized regression problem, where the matrix is our estimate of the means matrix . The quadratic term quantifies how well approximates . The regularization term penalizes deviations away from a checkerboard pattern. The parameter tunes the tradeoff between the two terms. The parameters and are non-negative weights that will be explained shortly.

The penalty term is closely related to other sparsity inducing penalties. When only the rows or columns are being clustered, minimizing the objective function in (.4) corresponds to solving a convex clustering problem (Pelckmans et al., 2005; Hocking et al., 2011; Lindsten, Ohlsson and Ljung, 2011) under an -norm fusion penalty. The convex clustering problem in turn can be seen as a generalization of the Fused Lasso (Tibshirani et al., 2005). When the -norm is used in place of the -norm in , we recover a special case of the general Fused Lasso (Tibshirani and Taylor, 2011). Other norms can also be employed in our framework; see for example Chi and Lange (2015). In this paper, we restrict ourselves to the -norm since it is rotationally invariant. In general, we do not want a procedure whose biclustering output may non-trivially change when the coordinate representation of the data is trivially changed.

To understand how the regularization term steers solutions toward a checkerboard pattern, consider the effects of and separately. Suppose . The th column of the matrix can be viewed as a cluster center or centroid attached to the th column of the data matrix . When , the minimum is attained when , and each column occupies a unique column cluster. As increases, the cluster centroids are shrunk together and in fact begin to coalesce. Two columns and are assigned to the same column partition if . We will prove later that for sufficiently large , all columns coalesce into a single cluster. Similarly, suppose and view the rows of as the cluster centroids attached to the rows of . As increases, the row centroids will begin to coalesce, and we likewise say that the th and th rows of belong to the same row partition if their centroid estimates and coincide.

When includes both and , rows and columns of are simultaneously shrunk towards each other as the parameter increases. The penalized estimates exhibit the desired checkerboard structure as seen in Figure (b)b. Note this shrinkage procedure is fundamentally different from methods like the clustered dendrogram, which independently cluster the rows and columns. By coupling row and column clustering, our formulation explicitly seeks out a solution with a checkerboard mean structure.

We now address choosing the weights and . A judicious choice of the weights enables us to (i) employ a single regularization parameter , (ii) obtain more parsimonious clusterings, and (iii) speed up the convergence of key subroutines employed by COBRA. With these goals in mind, we recommend weights having following properties: (i) The column weights should sum to and the row weights should sum to . (ii) The column weight should be inversely proportional to the distance between the th and th columns . The row weights should be assigned analogously. (iii) The weights should be sparse, namely consist mostly of zeros.

We now discuss the rationale behind our weight recommendations. The key to ensuring that a single tuning parameter suffices for identifying the checkerboard pattern, is keeping the two penalty terms and on the same scale. If this does not hold, then either column or row clusterings will dominate. Consequently, since the columns are in and the rows are in , we choose the column weights to sum to and the row weights to sum to . Using weights that are inversely proportional to the distances between points, more aggressively shrinks rows and columns that are more similar to each other, and less aggressively shrinks rows and columns that are less similar to each other. Finally, sparse weights expedite computation. COBRA solves a sequence of convex clustering problems. The algorithm we employ for solving the convex clustering subproblem scales in storage and computational operations as , where is the number of non-zero weights. Shrinking all pairs of columns (rows) and taking () not only increases computational costs but also typically produces inferior clustering to sparser ones as seen in Chi and Lange (2015). We employ the sparse Gaussian kernel weights described in Chi and Lange (2015), which satisfy the properties outlined above. Additional discussion on this choice of weights is given in Web Appendix A.

## 3 Properties of the Convex Formulation and COBRA’s Solution

The solution to minimizing (.4) has several attractive properties as a function of the data , the regularization parameter , and its weights and , some of which can be exploited to expedite its numerical computation. We emphasize that these results are inherent to the minimization of the objective function (.4). They hold regardless of the algorithm used to find the minimum point, since they are a consequence of casting the biclustering problem as a convex program. Proofs of all propositions can be found in Web Appendix B. First, minimizing (.4) is a well-posed optimization problem.

###### Proposition 3.1.

The function defined in (.4) has a unique global minimizer.

Furthermore, since is convex, the only local minimum is the global minimum, and any algorithm that converges to a stationary point of will converge to the global minimizer. The next result will have consequences for numerical computation and is also the foundation underpinning COBRA’s stability.

###### Proposition 3.2.

The minimizer of (.4) is jointly continuous in .

Continuity of in the regularization parameter suggests employing warm-starts, or using the solution at one as the starting point for a problem with a slightly larger , because small changes in will result in small changes in . Continuity of in tells us that the solution varies smoothly with perturbations in the data, our main stability result. Recall that the th and th columns (rows) are assigned to the same cluster if (). Since varies smoothly in the data, so do the differences . While assigning entries to biclusters is an inherently discrete operation, we will see in Section 7 that this continuity result manifests itself in assignments that are robust to perturbations in the data. Continuity of in the weights also give us stability guarantees and assurances that COBRA’s biclustering results should be locally insensitive to changes in the weights.

The parameter tunes the complexity of COBRA’s solution. We next show that COBRA’s most complicated (small ) and simplest (large ) solutions coincide with the clustered dendrogram’s most complicated and simplest solutions. Clearly when , the solution is just the data, namely . To get some intuition on how the solution behaves as increases, observe that the penalty is a semi-norm. Moreover, under suitable conditions on the weights, spelled out in connectedness below, is zero if and only if is a constant matrix.

###### Assumption 3.1.

For any pair of columns (rows), indexed by and with , there exists a sequence of indices along which the weights, are positive.

###### Proposition 3.3.

Under connectedness, if and only if for some .

This result suggests that as increases the solution to the biclustering problem converges to the solution of the following constrained optimization problem:

 minU12∥X−U∥2Fsubject to U=c11T for some % c∈R,

the solution to which is just the global mean , whose entries are all identically the average value of over all its entries. The next result formalizes our intuition that the centroids eventually coalesce to as becomes sufficiently large.

###### Proposition 3.4.

Under connectedness, is minimized by the grand mean for sufficiently large.

Thus, as increases from 0, the centroids matrix traces a continuous solution path that starts from biclusters, consisting of , to a single bicluster, where for all .

## 4 Estimation of Biclusters with COBRA

Having characterized our estimator of the checkerboard means as the minimizer to (.4), we now turn to the task of computing it. From here on, we fix the data and the weights and and consider the biclustering solution as a function of the parameter , denoting the solution as . The penalty term in (.4) makes minimization challenging since it is non-smooth and not separable over any block partitioning of . Coordinate descent (Friedman et al., 2007; Wu and Lange, 2008) is an effective solution when the non-smooth penalty term is separable over some block partitioning of the variables, which is unfortunately not the case for (.4). Another popular iterative method for minimizing non-smooth convex functions is the alternating direction method of multipliers (ADMM) (Boyd et al., 2011). While an ADMM algorithm is feasible, we take a more direct approach with the Dykstra-like proximal algorithm (DLPA) proposed by Bauschke and Combettes (2008) because it yields a simple meta-algorithm that can take advantage of fast solvers for the convex clustering problem.

DLPA generalizes a classic algorithm for fitting restricted least squares regression problems (Dykstra, 1983) and solves minimization problems of the form

 minU12∥X−U∥2F+f(U)+g(U), (4.1)

where and are lower-semicontinuous, convex functions. The biclustering problem is clearly an instance of (C.37). Setting and in (C.37) gives us the pseudocode for COBRA shown in Algorithm 1. The operation is the proximal mapping of the function and is defined to be

 proxγΩW(Z) = argminV[12∥Z−V∥2F+γΩW(V)].

Each proximal mapping in Algorithm 1 corresponds to solving a convex clustering problem.

The COBRA is very intuitive. The matrices and are estimates of the means matrix at the th iteration. The matrices and encode discrepancies between these two estimates. We alternate between clustering the rows of the matrix and the columns of the matrix . The following result guarantees that and converge to the desired solution.

###### Proposition 4.1.

The COBRA iterates and in Algorithm 1 converge to the unique global minimizer of the convex biclustering objective (.4).

Proposition 4.1 not only ensures the algorithmic convergence of Algorithm 1, but it also provides a natural stopping rule. We stop iterating once falls below some tolerance . A proof of Proposition 4.1, as well as additional technical details and discussion on DLPA and COBRA, can be found in Web Appendix C.

The advantage of using DLPA is that COBRA is agnostic to the actual algorithm used to solve the proximal mapping. This is advantageous since we cannot analytically compute . In this paper we use the alternating minimization algorithm (AMA) introduced in Chi and Lange (2015) to solve the convex clustering problem. The algorithm performs projected gradient ascent on the Lagrangian dual problem. Its main advantage is that it requires computational work and storage that is linear in the size of the data matrix , when we use the sparse Gaussian kernel weights described in Web Appendix A. A second advantage is that hard clustering assignments are trivially obtained from variables employed in the splitting method. Nonetheless, the DLPA framework makes it trivial to swap in more efficient solvers that may become available in the future.

## 5 Model Selection

Estimating the number of clusters or biclusters in a data set is a major challenge. With many existing biclustering methods, this is further exacerbated by the many tuning parameters that must be selected and the fact that biclustering assignments do not always change smoothly with the number of biclusters. For example, the sparse SVD method (Lee et al., 2010)

requires three tuning parameters: two parameters controlling the sparsity of the left and right singular vectors of the sparse SVD and one controlling its rank. Furthermore, selecting the number of biclusters and other tuning parameters can be a major computational burden for large data sets. For example, the sparse biclustering method

(Tan and Witten, 2014) uses cross-validation to select the number of row and column partitions. This can be time consuming if a large range of possible number of row and column partitions are explored. In contrast, COBRA has one parameter , that controls both the number of biclusters and bicluster assignments; moreover, the number of biclusters and assignments varies smoothly with .

### 5.1 Hold-Out Validation

We present a simple but principled approach to selecting in a data-driven manner by posing the model selection problem as another convex program. We randomly select a hold-out set of elements in the data matrix and assess the quality of a model on how well it predicts the hold-out set. This idea was first proposed by Wold (1978)

for model selection in principal component analysis and has been used more recently to select tuning parameters for matrix completion problems

(Mazumder, Hastie and Tibshirani, 2010). Denote these index pairs , and let denote the cardinality of the set . We may select a relatively small fraction of the elements, say 10%, for validation, namely . Denote the projection operator onto the set of indices by . The th entry of is if and is zero otherwise. We then solve the following convex optimization problem

 minU~Fγ(U) := 12∥PΘc(X)−PΘc(U)∥2F+γJ(U) (5.1)

for a sequence of . Recall that we denote the minimizer of by . We choose the that minimizes the prediction error over the hold-out set , namely

### 5.2 Solving the Hold-Out Problem

The problem defined in (C.38) can be seen as a convex matrix completion problem. Algorithm 2 summarizes a simple procedure for reducing the problem of minimizing to solving a sequence of the complete biclustering problems (.4

). The solution from the previous iteration is used to fill in the missing entries in the current iteration; COBRA is then applied to the complete data matrix. This approach is identical to the soft-impute approach of

Mazumder, Hastie and Tibshirani (2010) for solving the matrix completion problem using a nuclear norm penalty instead of our fusion penalty . The similarity is not a coincidence as both procedures are instances of a majorization-minimization (MM) algorithm (Lange, Hunter and Yang, 2000) which apply the same majorization on the smooth quadratic term. We defer details on this connection to Web Appendix D. Algorithm 2 has the following convergence guarantees for the imputation algorithm.

###### Proposition 5.1.

The limit points of the sequence of iterates of Algorithm 2 are solutions to (C.38).

Thus, we have turned the model selection problem of selecting both the number of biclusters and bicluster assignments into a principled convex program with strong convergence guarantees.

## 6 Simulation Studies

We compare COBRA and two other biclustering methods that also assume an underlying checkerboard mean structure. The first is the clustered dendrogram; hard biclustering assignments are made for the clustered dendrogram using the widely used dynamic tree cutting algorithm (Langfelder, Zhang and Horvath, 2008) implemented in the package dynamicTreeCut. The second is the sparse biclustering method (Tan and Witten, 2014) implemented in the package sparseBC All parameters were selected according to methods in the R packages.

Assessing the quality of a clustering is almost as hard as the clustering problem itself, as evidenced by a plethora of quantitative measures for comparing how similar two clusterings are. In this paper, we use the following three measures: the Rand index (RI), the adjusted Rand index (ARI), and the variation of information (VI). We included the RI (Rand, 1971), since it is one of the most widely used criteria for comparing partitions; it maps a pair of partitions to a number between 0 and 1, where 1 indicates perfect agreement between two partitions. Despite its popularity, the RI has some limitations (See Web Appendix E). Consequently, we also use the ARI (Hubert and Arabie, 1985), which was engineered to address deficiencies in the RI. Like the RI, the ARI takes a maximum value of 1 if the clustering solution is identical to the true structure and takes a value close to 0 if the clustering result is obtained from random partitioning. Finally, we compared clustering results using the VI (Meilă, 2007). Unlike the RI and ARI, the VI is a metric. Consequently, under the VI we can speak rigorously about a neighborhood around a given clustering. In fact, the nearest neighbors of a clustering under the VI metric are clusterings obtained by splitting or merging small clusters in . While not as popular as the RI and ARI, the VI is perhaps the most appropriate for assessing two hierarchical clusterings. As a metric, the VI takes a minimum value of 0 when there is perfect agreement between two partitions. Definitions of these criteria are given in Web Appendix E.

To simulate data with a checkerboard partition mean pattern, we consider the partition of the data matrix as the set of biclusters induced by taking the cross-product of the row and column groups. For all experiments, we used a single fixed weight assignment rule (See Web Appendix A), therefore COBRA selects a single tuning parameter by validation. We perform computations in serial on a multi-core computer with 24 3.3 GHz Intel Xeon processors and 189 GB of RAM. Note that run times may vary depending on input parameters chosen. For example, COBRA estimates can be computed to high accuracy at greater computation, and sparse biclustering can explore a wider range of candidate row and column clusters at greater computation. To be fair, we did not cherry pick these parameters but picked some reasonable values and used them throughout our numerical experiments. For example, in sparse biclustering when there are 8 true row clusters and 8 true column clusters, we set the range of row and column clusters to be 1 to 12.

Finally, we also considered two variants of COBRA that employ standard refinements on the Lasso: the adaptive Lasso (Zou, 2006) and the thresholded Lasso (Meinshausen and Yu, 2009). These refinements address the well known issue that the Lasso tends to select too many variables. Thus, we anticipate that COBRA estimates may identify too many biclusters. Consequently, we also compared the performance of an adaptive COBRA and thresholded COBRA in our study. Details on these refinements are in Web Appendix G.

We compare COBRA, the clustered dendrogram, and the sparse biclustering (spBC) algorithm of Tan and Witten (2014) on their abilities to recover a checkerboard pattern. Again for the clustered dendrogram, row and column assignments are made with the dynamic tree cutting (DCT) method (Langfelder, Zhang and Horvath, 2008). We simulate a data matrix with a checkerboard bicluster structure, where iid  and took on one of 25 equally spaced values between -6 and 6, namely Uniform. We consider a high signal-to-noise ratio (SNR) situation where the minimum difference among bicluster means () is comparable to the noise () and a low SNR one where the minimum difference is dominated by the noise (

). To assess the performance as the number of column and row clusters are varied, we generated data using 16, 32, and 64 biclusters, corresponding to 2, 4, and 8 row groups and 8 column groups, respectively. Since typical clusters will not be equal in size, rows and columns are assigned to each of the groups randomly according to a non-uniform distribution. The probability that a row is assigned to the

th group is inversely proportional to . Columns are assigned analogously to groups.

We computed the RI, ARI, and VI between the true biclusters and those obtained by the COBRA variants, DCT, and spBC. Table 1 reports the average RI, ARI, and VI over 50 replicates as well as the average number of recovered biclusters and run times; denotes the true number of biclusters. In the high SNR scenario, all methods do well across all measures. In the low SNR scenario, the COBRA variants often perform nearly as well or better than the other two methods. While DCT is significantly faster than all other methods, it also performs the worst at recovering the true biclusters in the low SNR scenario. The run times for COBRA indicate that it is computationally competitive compared to alternative biclustering solutions.

In closing our discussion on simulations, we reiterate that COBRA is designed to recover checkerboard patterns. While checkerboard patterns feature prominently in a range of applications, we also acknowledge that they are not universal. Nonetheless, by examining both the estimated biclusters and bicluster means, COBRA can potentially identify the correct biclusters even when the checkerboard assumption is violated. We discuss how COBRA can accomplish this in more detail with a case study in Web Appendix F.

## 7 Application to Genomics

To illustrate COBRA in action on a real example, we revisit the lung cancer data studied by Lee et al. (2010)

. We have selected the 500 genes with the greatest variance from the original collection of 12,625 genes. Subjects belong to one of four subgroups; they are either normal subjects (Normal) or have been diagnosed with one of three types of cancers: pulmonary carcinoid tumors (Carcinoid), colon metastases (Colon), and small cell carcinoma (Small Cell).

We first illustrate how the solution evolves as varies. Figure 2 shows snap shots of the COBRA solution path of this data set, as the parameter increases. The path captures the whole range of behavior between under-smoothed estimates of the mean structure (small ), where each cell is assigned its own bicluster, to over-smoothed estimates (large ), where all cells belong to a single bicluster. In between these extremes, we see rows and columns “fusing” together as increases. Thus we have visual confirmation that minimizing (.4) over a range of , yields a convex formulation of the clustered dendrogram.

While generating the entire solution path enables us to visualize the hierarchical relationships between biclusterings for different , we may ultimately require a hard biclustering assignment. By applying the validation procedure described in Section 5, we arrive at the smoothed mean estimate shown previously in Figure (b)b

We next conduct a simulation experiment based on the lung cancer data to test the stability and reproducibility of biclustering methods, critical qualities for real scientific analysis. To expedite computation in these experiments, we restrict our attention to the 150 genes with the highest variance. We first apply the biclustering methods on the original data to obtain baseline biclusterings. We then add iid , noise where to create a perturbed data set on which to apply the same set of methods. We compute the RI, ARI, and VI between the baseline clustering and the one obtained on the perturbed data. Table 4 shows the average RI, ARI, and VI of 50 replicates as well as run times. For all values of , we see that the two COBRA variants tend to produce the most stable and reproducible results. The fact that the ARI scores are poor for plain COBRA but the RI and VI scores are good indicate that COBRA tends to shrink the same sets of rows and columns together even if it fails to fuse them together consistently. Again the run time results indicate that COBRA is computationally competitive. For completeness, results from an identical stability study for methods that do not assume a checkerboard pattern can be found in Web Appendix H.

## 8 Discussion

Our proposed method for biclustering, COBRA, can be considered a principled reformulation of the clustered dendrogram. Unlike the clustered dendrogram, COBRA returns a unique global minimizer of a goodness-of-fit criterion, but like the clustered dendrogram, COBRA is simple to interpret. COBRA also sports two key improvements over existing biclustering methods. First, it is more stable. COBRA biclustering assignments on perturbations of the data agree noticeably more frequently than those of existing biclustering algorithms. Second, it admits an effective and efficient model selection procedure for selecting the number of biclusters, that reduces the problem to solving a sequence of convex biclustering problems. The upshot of these two qualities is that COBRA produces results that are both simple to interpret and reproducible.

The simplicity of our means model is also its greatest weakness, since we consider only checkerboard patterns, namely we assign each observation to exactly one bicluster and do not consider overlapping biclusters (Cheng and Church, 2000; Lazzeroni and Owen, 2002; Shabalin et al., 2009). Nonetheless, while models that allow for overlapping biclusters might be more flexible, they are also harder to interpret.

While our simulation studies demonstrated the effectiveness of COBRA, there is room for improvement. We highlight an intriguing suggestion made during the review of this article. In many real-world applications there is no “true” fixed number of biclusters. Instead, the underlying latent structure may be a continuum of biclusters at different scales of row and column aggregation. Indeed, COBRA has the potential to estimate a multiscale model of the data. When the weights are uniform, all columns (rows) are averaged together. When the weights are positive only among nearest neighbors, only nearest neighboring columns (rows) are averaged together. Thus, by tuning the weights, we can obtain smoothed estimates of the data at different scales of resolution.

The ability to smooth estimates at different scales suggests a connection to computational harmonic analysis. Indeed, Coifman and Gavish (2011) explore the biclustering problem through a wavelet representation of the data matrix. They also seek a representation that is smooth with respect to partitions of the row and column graphs that specify the similarity among the observations and features. A checkerboard mean structure at different scales can be obtained via operations in the wavelet domain, namely by thresholding wavelet coefficients corresponding to different scales. We are currently exploring how to adapt Coifman and Gavish (2011)’s strategy to solve a sequence of COBRA problems at different scales in order to recover a continuum of biclusters.

An R package, called cvxbiclustr, implementing COBRA is available on CRAN.

## Acknowledgements

EC acknowledges support from CIA Postdoctoral Fellowship 2012-12062800003. GA acknowledges support from NSF DMS 1209017, 1264058, and 1317602. RB acknowledges support from ARO MURI W911NF-09-1-0383 and AFOSR grant FA9550-14-1-0088.

## Web Appendix A. Column and Row Weights

Recall that our goal is to minimize the following convex criterion

 Fγ(U) = 12∥X−U∥2F+γ[ΩW(U)+Ω~W(UT)]J(U), (.4)

where , and ( denotes the th column (row) of the matrix . In this work, we use the sparse Gaussian kernel weights proposed in Chi and Lange (2015) for the weights and that define the terms and .

We construct the weights in two steps. We describe these steps for computing the column weights; the row weights are computed analogously. We start by computing pre-weights between the th and th columns as , as the product of two terms. The first factor is 1 if is among ’s -nearest-neighbors or vice versa and 0 otherwise. The first term controls the sparsity of the weights. The second factor is a Gaussian kernel that puts greater pressure on similar columns to fuse and less pressure on dissimilar columns to fuse. The nonnegative constant controls the rate at which the pressure to fuse is applied as a function of the distance between columns; the value corresponds to uniform weights. The pre-weights are then normalized to sum to .

For all experiments in the paper, we set and set for both row and column weights.

We briefly discuss the rationale behind our weight choice here and refer readers to Chi and Lange (2015) for a more detailed exposition. Chi and Lange (2015) give several examples that show that restricting positive weights to nearest neighbors enhances both computational efficiency and clustering quality. In their examples they showed that if dense Gaussian kernel weights were used, cluster centroids shrunk towards each other as the tuning parameter increased but no fusions would occur along the path save a single simultaneous fusion of all cluster centroids for a sufficiently large . Thus, while the two factors defining the weights act similarly, sensible fusions along the solution path could be achieved only by using them together. This is best illustrated in the half-moons example in (Chi and Lange, 2015).

## Web Appendix B. Proofs of Solution Properties

In this appendix, we give proofs of propositions in Section 3 of our paper.

###### Proposition C.1 (Existence and Uniqueness).

The function defined in (1) has a unique global minimizer.

We first recall a few definitions and concepts useful in optimization (Lange, 2013). A function is coercive if all its sub level sets are compact. A function is convex if for all and in its domain. A function is strictly convex if the inequality is strict.

###### Proof.

The existence and uniqueness of a global minimizer are immediate consequences of the coerciveness and strict convexity of .

###### Proposition C.2 (Continuity).

The solution of (1) is jointly continuous in .

###### Proof.

Without loss of generality, we can absorb the regularization parameter into the weights . Thus, we can check to see if the solution is continuous in the variable . It is easy to verify that the following function is jointly continuous in and

 f(U,ζ) = 12∥X−U∥2F+Jw(U), (C.5)

where

 Jw(U) = 1√pΩW(U)+1√nΩ~W(UT) (C.6)

is a convex function of that is continuous in . Let

 U⋆(ζ) = argminUf(U,ζ). (C.7)

We proceed with a proof by contradiction. Suppose is not continuous at a point . Then there exists an and a sequence converging to such that for all where

 U(m) = argminUf(U,ζ(m)). (C.8)

Note that since is strongly convex in , the minimizers and exist and are unique. Without loss of generality we can assume . This fact will be used later in proving the boundedness of the sequence .

Fix an arbitrary point . If is a bounded sequence then we can pass to a convergent subsequence with limit . Note that for all . Since is continuous in , taking limits gives us the inequality

 f(¯U,ζ) ≤ f(~U,ζ). (C.9)

Since was selected arbitrarily, it follows that , which is a contradiction. It only remains for us to show that the sequence is bounded.

Consider the function

 g(U) = sup~ζ:∥~ζ−ζ∥F≤112∥~X−U∥2% F+J~w(U). (C.10)

Note that is convex, since it is the point-wise supremum of a collection of convex functions. Since and is strongly convex in , it follows that is also strongly convex and therefore has a unique global minimizer such that . It also follows that

 f(U(m),ζ(m)) ≤ f(U∗,ζ(m))≤g(U∗) (C.11)

for all . By the reverse triangle inequality it follows that

 12(∥U(m)∥F−∥X(m)∥F)2≤12∥U(m)−X(m)∥2F≤f(U(m),ζ(m)). (C.12)

Combining the inequalities in (C.11) and (C.12), we arrive at the conclusion that

 12(∥U(m)∥F−∥X(m)∥F)2≤g(U∗), (C.13)

for all . Suppose the sequence is unbounded, namely . But since converges to , the left hand side must diverge. Thus, we arrive at a contradiction if is unbounded.

###### Proposition C.3 (Zeroes of the fusion penalty).

Under Assumption 1,
if and only if for some .

###### Proof.

We first show that

 ΩW(U) = ∑i

is positive if and only if for all , namely all the columns of are the same. Clearly if the columns of are the same, then is zero. Suppose that is zero. Then it must that be for every . Consider a pair such that . By Assumption 1, there exists a path along which the weights are positive. Let denote the smallest weight along this path, namely . By the triangle inequality

 ∥U⋅i−U⋅j∥F ≤ ∥U⋅i−U⋅k∥F+⋯+∥U⋅l−U⋅j∥F. (C.15)

We can then conclude that

 w∥U⋅i−U⋅j∥F ≤ ΩW(U)=0. (C.16)

It follows that , since is positive. By a similar argument it follows that

 Ω~W(U) = ∑i

is zero if and only if for all , or in other words if the rows of are all the same. Thus, if and only if is a constant matrix. ∎

###### Proposition C.4 (Coalescence).

Under Assumption 1, is minimized by the grand mean for sufficiently large.

###### Proof.

We will show that there is a such that for all , the grand mean matrix is the unique global minimizer to the primal objective (.4). We will certify that is the solution to the primal problem by showing that the optimal value of a dual problem, which lower bounds the primal, equals .

Throughout the proof, we will work with the vectorization of matrices, namely the vector obtained by stacking the columns of a matrix on top of each other. We denote the vectorization of a matrix by its corresponding bold lower case, namely . Thus, we will construct a dual to the following representation of the primal problem,

 minimizeuFγ(u)=12∥x−u∥22+γJ(u). (C.18)

In order to rewrite the penalty in terms of the vector , we use the identity where denotes the Kronecker product between two matrices. Thus,

 J(u) = ∑i

where

 Aij = (ei−ej)T⊗I, (C.20) ~Aij = I⊗(ei−ej)T, (C.21)

and is the th standard basis vector. To keep things notationally simpler, we have absorbed the normalizations by and into the weights and .

We first introduce some notation in order to write the relevant dual problem to the primal problem (C.18). Note that the column weights can be identified with a column graph of nodes, where there is an edge between the th and th node if and only if . The row weights can also be identified with an analogous row graph of nodes. Let and denote the sets of edges in the column and row graphs, and let and denote their respective cardinalities. The edge-incidence matrix of the column graph encodes its connectivity and is defined as

 ϕc,li=⎧⎨⎩1If node i is the head of % edge l,−1If node i is the tail of edge l,0otherwise. (C.22)

The row edge-incidence matrix is defined similarly.

We begin deriving the dual problem by recalling that norms possess a variational representation in terms of their dual norms, namely

 ∥y∥ = max∥z∥†≤1⟨z,y⟩, (C.23)

where is the dual norm of . Using this fact and working through some tedious algebra, we can rewrite the penalty term in (C.18) compactly as

 γJ(u) = maxv∈Cγ⟨ATv,u⟩. (C.24)

The vector is the concatenation of several vectors, namely

 v=(vT1,…,vTEc,~vT1,…,~vTEr)T, (C.25)

where . The matrix can be expressed in terms of the row and column edge-incidence matrices, namely

 A = (Φc⊗II⊗Φr). (C.26)

Finally, the constraints on the vector are encoded in the set and .

Thus, the primal problem (C.18) can be expressed as the following saddle point problem

 maxv∈Cγminu12∥x−u∥22+⟨ATv,u⟩. (C.27)

By performing the minimization with respect to , we obtain a dual maximization problem that provides a lower bound on the primal objective

 maxv∈Cγ−12∥ATv∥22+⟨v,Ax⟩. (C.28)

For sufficiently large , the solution to the dual maximization problem coincides with the solution to the unconstrained maximization problem

 maxv−12∥ATv∥22+⟨v,Ax⟩, (C.29)

whose solution is . Plugging into the dual objective gives an optimal value of

 12∥AT(AAT)†Ax∥22, (C.30)

which we rewrite as

 12∥x−[I−AT(AAT)†A]x∥22. (C.31)

Note that is the projection onto the orthogonal complement of the column space of , which is equivalent to the null space or kernel of , denoted Ker. We will show shortly that Ker() is the span of the all ones vector. Therefore, .

Before showing that Ker() is the span of , we note that the smallest such that is an upper bound on .

We now argue that Ker() is the span of . We rely on the following fact: If is an incidence matrix of a connected graph with vertices, then the rank of is (See Theorem 7.2 in Chapter 7 of Deo (1974)). According to Assumption 1 in the paper, the column and row graphs are connected; it follows that has rank and has rank . It follows then that Ker() and Ker() have dimension one. Furthermore, since each row of and has one and one , it follows that Ker() , and likewise Ker() . A vector Ker() if and only if Ker() Ker().

Recall that if the singular values of a matrix are and the singular values of a matrix are , then the singular values of their Kronecker product are . It follows then that the rank of is the product of the ranks of and .

The above rank property of Kronecker products of matrices implies that the dimension of Ker( equals and the dimension of Ker( equals . It is easy to see then that the linearly independent set of vectors , where and , forms a basis for Ker(). Likewise, the linearly independent set of vectors , where and , forms a basis for Ker().

Take an element from Ker(), namely , where and . We will show that in order for Ker(), must be a multiple of 1. Consider the relevant matrix-vector product

 (I⊗Φr)(1⊗a)=(I⊗Φr)vec(a1T)=Φra1T,