Observing a data matrix , the problem of submatrix localization, a.k.a., biclustering or coclustering, consists in localizing one or several submatrices whose entries are ‘unusually large’, or significant in some other prescribed way. Such submatrices may be of special interest, since the unusual entries may indicate some potential association or relationship between the corresponding variables. One important application is in the analysis of gene expression data (Cheng and Church, 2000; Tanay et al., 2002), where the row and column indices stand for the genes and conditions, respectively. See (Madeira and Oliveira, 2004; Charrad and Ahmed, 2011) for a survey and (Prelić et al., 2006) for a comparison of existing methods.
1.1 Submatrix localization
We denote by and the number of rows and columns of the data matrix . We assume for simplicity that there is only one submatrix, of size
, to be localized. We further assume that the entries are independent and normally distributed, and with same (known) variance, as well as homogeneous both inside and outside the anomalous submatrix, or more formally,
where the are IID standard normal distribution, while are the index sets defining anomalous submatrix. Note that and . Here we use the symbol to represent integer set , while is the cardinality of discrete set . The parameter , which is assumed to be strictly positive in this case, quantifies the per-element signal strength within the submatrix.
This parametric setup is used, for example, in (Butucea and Ingster, 2013; Kolar et al., 2011; Butucea et al., 2015). (Butucea and Ingster, 2013) focuses on the simpler problem of detecting the presence of an anomalous submatrix and considers the case where the submatrix size is unknown, while (Kolar et al., 2011; Butucea et al., 2015) focus on localizing the submatrix with full knowledge of its size . In all these papers, submatrices are ‘scanned’ for anomaly. In detail, the scan statistic, with known, is defined as
Localizing the submatrix by the scan statistic is simply returning the row and column index sets achieving the maximum, namely:
This estimator has the ‘minimax’ property under the above assumptions.
Theorem 1 (Butucea et al. (2015) Theorems 2.1, 2.2).
Condition (4) provides a lower bound of the signal strength for the existence of a strongly consistent estimator with prior knowledge of the anomaly size .
We consider here the case when the submatrix size is not known. Our strategy is straightforward: we consider a variant of the scan statistic, different from the one defined in (2), which scans submatrices of different sizes.
Contribution 1 (Multiscale scan statistic and its property).
We propose a multiscale scan statistic as an estimator under the parametric model defined in Section1.1. We prove that under some regularity conditions, this statistic exactly recovers the submatrix with high probability when the signal strength is of the same order of magnitude as that implied by Theorem 1 under the setting where the submatrix size is known a priori.
Scanning, even when the submatrix size is known, is computationally difficult. It is in fact known to be NP-hard (Cheng and Church, 2000, Thm 1). To tackle the problem of computing the scan statistic defined in (2), (Shabalin et al., 2009) develops an alternate maximization algorithm called LAS (for Large Average Submatrix). Based on the LAS algorithm, we develop two algorithms in order to approach our proposed statistic.
We develop iterative algorithms to solve the problem of finding the proposed multiscale scan statistic. One algorithm adapts the idea of (Shabalin et al., 2009) and is a hill-climbing type optimization algorithm. The other is a variant based on the golden section search.
1.2 More related work
There is an active research line focusing on different aspects of submatrix localization. (Arias-Castro and Liu, 2017; Butucea and Ingster, 2013) concentrate on the detection of the existence of such an anomalous submatrix. Besides (Butucea et al., 2015; Kolar et al., 2011), discussed above, (Hajek et al., 2017a) also considers the minimal signal strength needed for accurate localization under symmetric data structure, while (Hajek et al., 2017a, b) develop upper and lower bounds for the existence of weakly consistent estimators. A convex optimization framework for biclustering is proposed and associated algorithms are developed in (Chi et al., 2016). A selective inference framework for quantifying the information contained within a selected submatrix is proposed by (Lee et al., 2015).
The computation issue is attracting an increasing amount of attention from researchers in the field, addressing the tradeoff between statistical power and computational tractability (Ma and Wu, 2015; Chen and Xu, 2016; Cai et al., 2017). In this context, a variety of computationally tractable (i.e., running in polynomial time) methods have been proposed and analyzed, such as methods relying on a semidefinite relaxation (Chen and Xu, 2016), spectral methods (Cai et al., 2017)
, methods relying on message passing heuristics(Hajek et al., 2017b), and more (Brault and Channarond, 2016; Tan and Witten, 2014).
Another closely related research area is network analysis, for example, stochastic blockmodels (Holland et al., 1983). Similar to submatrix localization, there are works that study the problem of detection (Arias-Castro and Verzelen, 2014; Verzelen and Arias-Castro, 2015; Zhang and Zhou, 2016), some that study the task of partitioning the graph (Mossel et al., 2016), for which spectral and SDP methods have been proposed (Chen and Xu, 2016; Abbe et al., 2016; Chaudhuri et al., 2012; McSherry, 2001).
The remaining of this paper is organized as follows. In Section 2 we introduce the multiscale scan statistic. Section 3 introduces the main algorithms for approximating the proposed statistic: a hill-climbing search algorithm and a golden section search algorithm. Some theory is established The theoretical properties of the (exact) scan statistic are stated in Section 4. Section 5 presents the result of some numerical simulations.
2 The multiscale scan statistic
The scan statistic as defined in (2) requires knowledge of the size of the submatrix, which we have denoted . When this is unknown, we use a criterium for comparing sums over submatrices of different sizes. For example, if two potential submatrix sizes and are investigated, we need to determine which one of and is more ‘significant’.
When comparing two normalized scan statistics, especially when the sizes of the scan statistics differ significantly, we need to consider the effect of background noise into the comparison. Inspired by the multiscale testing procedure proposed in (Butucea and Ingster, 2013), we define our multiscale scan statistic as
Naturally, the corresponding estimator for localizing the anomalous submatrix will be a maximizer of (5), namely
3 Two algorithms
3.1 The LAS algorithm
The computation of (7) seems to require the scanning of all, or most, submatrices, and there are too many submatrices for this to be possible. (The total number of submatrices is equal to .) This fact makes the direct computation of the multiscale scan statistic computationally intractable. We also note that the computation of (7) is harder than that of (2), and the latter is proved to be NP-hard by (Cheng and Church, 2000). It seems therefore necessary to resort to approximations.
When the true anomaly size is known, (Shabalin et al., 2009) proposed an iterative hill-climbing algorithm, called LAS, to approximate the statistic in (5); see Algorithm 1. As a hill-climbing algorithm, it may be trapped in local maxima, so that in practice the algorithm is run on several (random) initializations (Butucea and Ingster, 2013; Arias-Castro and Liu, 2017).
3.2 The adaptive LAS algorithm
We modify Algorithm 1 in order to approximate the multiscale scan statistic. In principle, we could run Algorithm 1 on each submatrix size , but although computationally feasible, it remains computationally demanding. Instead, we adopt a more greedy approach. We still rely on alternatively maximizing the objective, over the column index set, then the row index set, etc, until convergence, but the objective is that of (5). See Algorithm 2 for details. In practice, the algorithm is run on multiple random initializations as well.
Note that the initial submatrix size input in Algorithm 2 does not necessarily represent any prior knowledge of the anomaly size, although it can be informed by such prior knowledge.
3.3 A golden section search variant
We propose a variant based on the (two-dimensional) golden section search, a well-known method for finding the maximum of a unimodal function (see (Chang, 2009) for a general survey of the algorithm in dimensions). Define
so that the multiscale scan statistic is the maximum value of :
We simply apply the two-dimensional golden section search (with some modifications for discreteness) to . Details are provided in Algorithm 3.
The hope is that, if the signal strength is reasonably large and our initial bound on the submatrix size, below, is sufficiently accurate, is (with high probability) unimodal on with maximizer . We are not able to support this with a theoretical result, but have performed some numerical experiments, some of them reported in Section 5.3, that seem to support this.
In the algorithm, instead of computing , we instead apply Algorithm 1 to provide an approximation. In the description, is the golden ratio constant, namely, .
We note that the algorithm needs to stop the loop when the search frame is or smaller, for otherwise the loop would not break because of the discrete nature in the index sets (the search frame will not shrink). When this happens, the algorithm switches to an exhaustive search. The initial submatrix size is ideally an upper bound on the actual submatrix size. Its choice can be informed by prior knowledge.
4 Theoretical property
4.1 Gaussian entries
Theorem 2 (Exact recovery).
We note that . By comparing the terms defining with the terms defining (first defined in Theorem 1), it is easy to see that .
The first two terms in the maximum (9) are associated with the row and column structures of the anomaly and data matrix, while the last part is associated with the existence of weak consistent estimators. In fact, (Hajek et al., 2017a, b) show that when are known, the scan statistic (2) is weakly consistent if
(There is weak consistence if , where denotes the set symmetric difference.)
Denote , which identifies the submatrix. Key to the proof is to bound the probability that, for some other submatrix of size ,
The union of such events, as runs through all matrices, is exactly the event where the multiscale scan fails at exact recovery.
We will start by defining some quantities which quantify the closeness between the candidate matrix and true anomaly . For the submatrices in , which is the collection of all submatrices with dimensions , define their affinity in size as
For a specific submatrix with size , such that is a submatrix, define the affinity of coverage as
The proof strategy is that, with probability converging to , we can sequentially rule out the submatrices with low affinity in size and low affinity in coverage. We then analyze those submatrices left behind, which have size and coverage comparable with the true anomaly.
In fact, the introduction of the regularizer in the expression of the multiscale scan statistic is there to eliminate the submatrices with low affinity in size or in coverage. While when the affinities are not to small (bounded from zero by a positive number), the number of candidate matrices is controlled. We measure the probability of (12) when and only differ by one column or one row, and show that this is the largest failure probability among all candidates that have affinities bounded from below. ∎
4.2 An extension to an exponential family
In practice the Gaussian assumption is not satisfied in many situations. For example the entries may only take integer values (as with counting data) or binary data (as with presence-absence matrices in Ecology (Gotelli, 2000)). Therefore it is of importance to extend the result to a distribution family that covers several common data types. Following (Butucea and Ingster, 2013; Arias-Castro et al., 2018; Arias-Castro and Liu, 2017), we consider a one parameter exponential family. Let denote a distribution on the real line with mean zero and variance . Denote
as the moment generating function of, meaning , and suppose that for all in . Here is defined as and could be equal to infinity. For , define
which is a density with respect to .
By selecting appropriately, the distribution family becomes the normal location family, the Poisson family, or the Rademacher family. Note that with fixed, the distribution family is stochastically monotone in (Lehmann and Romano (2005), Lemma 3.4.2), which is to say, for and with and fixed ,
This fact enables us to model the submatrix localization problem with controlling the signal strength. With increasing, the anomaly is more ’anomalous’, making the localization problem relatively easier to solve. The localization problem can now be formalized with as the role of noise, and entries in the anomaly are distributed as . Formally,
Theorem 3 (Lower bound under exponential family assumption).
Proof Sketch of Theorem 3.
The proof is following most of the arguments in the proof of Theorem 1 in (Kolar et al., 2011). The result focuses on the normal location family, but extends to other one-parameter exponential families. We aim to prove the following:
is the Kullback Leibler divergence between two distribution, andare the distributions of data when and , with , respectively. Replace (15) in the proof of Theorem 1 in Kolar et al. (2011) with the above inequality, and the result follows. ∎
This result also partially answers the open problem raised by (Butucea et al., 2015) asking for a lower bound under a general exponential family.
Before we head into evaluating the performance of our proposed estimator (7) applied under the exponential family assumption, we need to re-define the adjusting quantity in (6). Fixing a constant , we re-define as
Theorem 4 (Exact recovery under exponential family).
Assume the model (13), and suppose that (3) and (16) hold. Take and such that and . Suppose we scan over submatrices of size such that and , and that the anomalous submatrix is among these, meaning that and . Suppose is as in Theorem 2. If
with high probability as , this scan returns the anomalous submatrix.
There is a normal approximation underneath which drives the asymptotic behavior of the statistic to be close to that under the normal model. This explains why the result is similar under a general exponential family as it is under the normal family. In fact, the proof of Theorem 4 is otherwise essentially the same as that of Theorem 2.
5 Numerical experiments
5.1 Performance of Proposed Algorithms
We performed some simulation experiments to evaluate the performance of our proposed algorithms. We generate a matrix with independent entries according to (1) as well as (15), and perform the proposed algorithms (Adaptive LAS and Golden Section Search LAS) on the generated data. By permutation invariance, we simply choose and . We then measure the accuracy of an estimator as follows
We compare the proposed algorithms with two computationally tractable algorithms with proved consistency.
Spectral algorithm of (Cai et al., 2017)
. The algorithm computes the singular value decomposition of the data matrix, then applies
-means algorithm on the first left and right singular vectors with number of clusters set to, and clusters the row and column indices according to corresponding singular vector’s clustering results.
Greatest Marginal Gap method of (Brault and Channarond, 2016). The algorithm calculates row and column sums of the data matrix , denoted as and , then sorts the row and column sums, obtaining and . The algorithm clusters rows based on , yielding two sets and . The clustering of columns is analogous.
We consider a balanced setting where is taken to be , and also an unbalanced setting where is taken to be . For each fixed signal strength, the data is independently generated times and the error counts defined by (19) for all four the algorithms are recorded. Three one-parameter exponential models (normal, Poisson, Rademacher) are investigated to illustrate the performance of the algorithms under different distributions. See Figure 1 and Figure 2.
5.1.1 Signal strength
The signal strength is quantified by the parameter of the entries inside the submatrix. Denote the following quantity :
5.1.2 Simulation result
The two proposed algorithms perform similarly under the balanced design, across all three data types. See Figure 1. Adaptive Hill-Climbing has a slightly weaker performance when the signal is weak (around to ), but the two algorithms’ error rates shrink to zero at the same signal strength, showing that the two approximate algorithms return the same result when signal strength goes beyond some threshold, and the result would exactly recover the planted anomaly.
As Figure 2 shows, the two algorithms perform differently under the imbalanced design when the signal is weak, with the Golden Section Search performing better. This is due to the fact that the search space of GSS is smaller, as well as the existence of local maximums when the design is imbalanced (see Figure 4 for an example). However, when considering the successful rate of exact recovery, the signal strength, above which the algorithm’s error rate shrinks to zero is still similar for both algorithms.
It is worth mentioning that in both cases, our proposed methods outperform the other two computationally tractable methods. In the balanced case, both Adaptive Hill-Climbing and Golden Section Search beat the spectral method by a small margin in error rate, while the error rate of Greatest Marginal Gap method is much worse in the examined signal strength region. In the imbalanced case, Adaptive Hill-Climbing has similar performance with spectral method in the weak signal region, while the GSS again outperforms both, showing that the multiscale scan statistic has better discovery power compared to spectral method. Again, the error rate of Greatest Marginal Gap is much larger.
5.2 Computing time
Here we present the computing time for executing the programs described in the previous section. The simulated data is from model (1), with corresponding sizes and signal strength . We generate data under this model times, and each time record the computation time for each of the competing methods.
The spectral method was computationally much more demanding than the other methods. As expected, the clustering algorithm based on the largest marginal gap is the fastest, however our algorithms are not too far (while much more accurate).
5.3 Unimodality issue in Algorithm 3
The golden section search presented in Algorithm 3 requires the function defined in (8) to be unimodal, in order to successfully discover the global maximal. In general, checking the unimodality of a function is often hard and here we present some numerical experiment illustrating the unimodality of the target function (8).
We use two simulated Gaussian datasets. For each combination of , we set the level of signal to , which is just above the signal level such that the search algorithm makes few to zero mistakes. Then for every pair of such that , we calculate the function value according to (8). The scan statistic is calculated by Algorithm 1, and is approximated by . Here we use the fact that when . The simulation is otherwise set as in the previous section, in that we examine both a balanced and an imbalanced setting. In the balanced setup, and . In the imbalanced setup, and .
6 Conclusion and discussion
In this paper we propose a new multiscale scan statistic for localizing the anomalous submatrix inside a large noisy matrix, which does not require prior knowledge on the anomalous submatrix size. We show the signal strength needed for strong consistency, and design two algorithms with good approximating accuracy and computing speed. There are, however, some problems for future work and discussions.
Minimaxity of the multiscale scan statistic: (Butucea et al., 2015) showed a sharp minimax signal bound, as we described in Theorem 1. Can our estimator based on the multiscale scan statistic reach the same minimax bound (which relies on knowledge of the submatrix size)? While we are unable to reduce the constant in Theorem 2 to the bound, we conjecture that our estimator is essentially minimax. Another measurement of accuracy of estimators is weak consistency. (Hajek et al., 2017b, a) shows that (2) is minimax in the weak consistency sense. When is the multiscale scan statistic weak consistent and is it minimax?
Computationally tractable algorithms: There is a growing literature about using computationally tractable methods such as semidefinite programming to approximate NP-hard problems. (for example, (Chen and Xu, 2016) on using SDP on finding the maximal likelihood estimator). Can such an algorithm be designed to approximate with guarantied accuracy the multiscale scan?
Tight minimax bound for exponential family case: As (Butucea et al., 2015) mentioned, minimax theory on the case of exponential family is still open. We give a bound based on (Kolar et al., 2011), but this bound is not tight. Also, although the scan statistic under the exponential family setup is proved to be minimax in testing by (Butucea and Ingster, 2013), the analysis of its localization performance is still open.
The authors thank the reviewers for the helpful feedback, and Jiaqi Guo for help in simulation studies.
- Abbe et al. (2016) Abbe, E., A. S. Bandeira, and G. Hall (2016). Exact recovery in the stochastic block model. IEEE Transactions on Information Theory 62(1), 471–487.
- Arias-Castro et al. (2018) Arias-Castro, E., R. M. Castro, E. Tánczos, and M. Wang (2018). Distribution-free detection of structured anomalies: Permutation and rank-based scans. Journal of the American Statistical Association 113(522), 789–801.
Arias-Castro, E. and Y. Liu (2017).
Distribution-free detection of a submatrix.
Journal of Multivariate Analysis156, 29–38.
- Arias-Castro and Verzelen (2014) Arias-Castro, E. and N. Verzelen (2014). Community detection in dense random networks. The Annals of Statistics 42(3), 940–969.
- Brault and Channarond (2016) Brault, V. and A. Channarond (2016). Fast and consistent algorithm for the latent block model. arXiv preprint arXiv:1610.09005.
- Butucea and Ingster (2013) Butucea, C. and Y. I. Ingster (2013). Detection of a sparse submatrix of a high-dimensional noisy matrix. Bernoulli 19(5B), 2652–2688.
- Butucea et al. (2015) Butucea, C., Y. I. Ingster, and I. A. Suslina (2015). Sharp variable selection of a sparse submatrix in a high-dimensional noisy matrix. ESAIM: Probability and Statistics 19, 115–134.
- Cai et al. (2017) Cai, T. T., T. Liang, A. Rakhlin, et al. (2017). Computational and statistical boundaries for submatrix localization in a large noisy matrix. The Annals of Statistics 45(4), 1403–1430.
- Chang (2009) Chang, Y.-C. (2009). N-dimension golden section search: Its variants and limitations. In 2009 2nd International Conference on Biomedical Engineering and Informatics, pp. 1–6. IEEE.
Charrad, M. and M. B. Ahmed (2011).
Simultaneous clustering: A survey.
International Conference on Pattern Recognition and Machine Intelligence, pp. 370–375. Springer.
- Chaudhuri et al. (2012) Chaudhuri, K., F. Chung, and A. Tsiatas (2012). Spectral clustering of graphs with general degrees in the extended planted partition model. In Conference on Learning Theory, pp. 1–23.
Chen, Y. and J. Xu (2016).
Statistical-computational tradeoffs in planted problems and submatrix
localization with a growing number of clusters and submatrices.
Journal of Machine Learning Research17(27), 1–57.
- Cheng and Church (2000) Cheng, Y. and G. M. Church (2000). Biclustering of expression data. In Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology, pp. 93–103. AAAI Press.
- Chi et al. (2016) Chi, E. C., G. I. Allen, and R. G. Baraniuk (2016). Convex biclustering. Biometrics.
- Gotelli (2000) Gotelli, N. J. (2000). Null model analysis of species co-occurrence patterns. Ecology 81(9), 2606–2621.
- Hajek et al. (2017a) Hajek, B., Y. Wu, and J. Xu (2017a). Information limits for recovering a hidden community. IEEE Transactions on Information Theory 63(8), 4729–4745.
- Hajek et al. (2017b) Hajek, B., Y. Wu, and J. Xu (2017b). Submatrix localization via message passing. The Journal of Machine Learning Research 18(1), 6817–6868.
- Holland et al. (1983) Holland, P. W., K. B. Laskey, and S. Leinhardt (1983). Stochastic blockmodels: First steps. Social networks 5(2), 109–137.
- Kolar et al. (2011) Kolar, M., S. Balakrishnan, A. Rinaldo, and A. Singh (2011). Minimax localization of structural information in large noisy matrices. In Advances in Neural Information Processing Systems, pp. 909–917.
- Lee et al. (2015) Lee, J. D., Y. Sun, and J. E. Taylor (2015). Evaluating the statistical significance of biclusters. In Advances in Neural Information Processing Systems, pp. 1324–1332.
- Lehmann and Romano (2005) Lehmann, E. L. and J. P. Romano (2005). Testing Statistical Hypotheses (3rd ed.). Spring. New York: Springer.
- Ma and Wu (2015) Ma, Z. and Y. Wu (2015). Computational barriers in minimax submatrix detection. The Annals of Statistics 43(3), 1089–1116.
- Madeira and Oliveira (2004) Madeira, S. C. and A. L. Oliveira (2004). Biclustering algorithms for biological data analysis: a survey. IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB) 1(1), 24–45.
- McSherry (2001) McSherry, F. (2001). Spectral partitioning of random graphs. In Proceedings of the 42nd IEEE symposium on Foundations of Computer Science, pp. 529. IEEE Computer Society.
- Mossel et al. (2016) Mossel, E., J. Neeman, A. Sly, et al. (2016). Consistency thresholds for the planted bisection model. Electronic Journal of Probability 21.
- Prelić et al. (2006) Prelić, A., S. Bleuler, P. Zimmermann, A. Wille, P. Bühlmann, W. Gruissem, L. Hennig, L. Thiele, and E. Zitzler (2006). A systematic comparison and evaluation of biclustering methods for gene expression data. Bioinformatics 22(9), 1122–1129.
et al. (2009)
Shabalin, A. A., V. J. Weigman, C. M. Perou, and A. B. Nobel (2009).
Finding large average submatrices in high dimensional data.The Annals of Applied Statistics 3(3), 985–1012.
- Tan and Witten (2014) Tan, K. M. and D. M. Witten (2014). Sparse biclustering of transposable data. Journal of Computational and Graphical Statistics 23(4), 985–1008.
- Tanay et al. (2002) Tanay, A., R. Sharan, and R. Shamir (2002). Discovering statistically significant biclusters in gene expression data. Bioinformatics 18(suppl 1), S136–S144.
- Verzelen and Arias-Castro (2015) Verzelen, N. and E. Arias-Castro (2015). Community detection in sparse random networks. The Annals of Applied Probability 25(6), 3465–3510.
- Zhang and Zhou (2016) Zhang, A. Y. and H. H. Zhou (2016). Minimax rates of community detection in stochastic block models. The Annals of Statistics 44(5), 2252–2280.
Appendix A Simulation details for reproducibility
In the appendix we list some detail setups of our simulation in order to provide essential information for reproducing the simulation results in the paper. All the simulation are done with statistical software R version 3.5.2. The graph is generated by graphing package ggplot2. The random number generator are set with seed by the set.seed function for each single graph. For the further reproduction purposes, all the simulation codes are available at https://github.com/nozoeli/adaptiveBiclustering.
For Algorithm 2, the initial input . The initial row index set is uniformly randomly chosen from . The algorithm is repeated times with I.I.D. chosen , and the result producing the largest multiscale scan statistic is returned.
For Algorithm 3, the input . When using Algorithm 1 to calculate , the initial row index is chosen with rows with the largest row sums across all columns (e.g. input , the initial row index set is made of the indices with top largest row sums), and Algorithm 1 is run once to calculate the scan statistic inside .
a.2 Simulation Setup for Figure 3
The setup of Algorithm 2 during this part of simulation is , with the rest setup the same as the previous section.
The setup of Algorithm 3 is the same as the previous section.
The function used in spectral method is as follows. The singular vector decomposition is using the R function svd, and the -means algorithm is calling R function kmeans with default setup and number of clusters .
a.3 Simulation Setup for Figure 4
The calculation of scan statistic during this part of simulation is repeating Algorithm 1 times with I.I.D. initial row indexes uniformly randomly chosen from , and return the result with the largest entry sum.