In this work, we study the problem of finding structure in multiway data, or tensors, via clustering. Tensors appear frequently in modern scientific and business applications involving complex heterogeneous data. For example, data in a neurogenomics study of brain development consists of a 3-way array of expression level measurements indexed by gene, space, and time (Liu et al., 2017). Other examples of 3-way data arrays consisting of matrices collected over time include email communications (sender, recipient, time) (Papalexakis et al., 2013), online chatroom communications (user, keyword, time) (Acar et al., 2006), bike rentals (source station, destination station, time) (Guigourès et al., 2015), and internet network traffic (source IP, destination IP, time) (Sun et al., 2006). The rise in tensor data has created new challenges in making predictions, such as in recommender systems for example (Zheng et al., 2016; Symeonidis, 2016; Symeonidis and Zioupos, 2016; Frolov and Oseledets, 2017; Bi et al., ) as well as inferring latent structure in multiway data (Acar and Yener, 2009; Anandkumar et al., 2014; Cichocki et al., 2015; Sidiropoulos et al., 2017).
As tensors become increasingly more common, the need for a reliable co-clustering method grows increasingly more urgent. Prevalent clustering methods, however, mainly focus on vector or matrix-variate data. The goal of vector clustering is to identify subgroups within the vector-variate observations (Ma and Zhong, 2008; Shen and Huang, 2010; Shen et al., 2012; Wang et al., 2013). Biclustering is the extension of clustering to two-way data where both the observations (rows) and the features (columns) of a data matrix are simultaneously grouped together (Hartigan, 1972; Madeira and Oliveira, 2004; Busygin et al., 2008). In spite of their prevalence, these approaches are not directly applicable to the cluster analysis of general-order (general-way) tensors. On the other hand, existing methods for co-clustering general -way arrays, for
, employ one of three strategies: (i) extensions of spectral clustering to tensors(Wu et al., 2016b), (ii) directly clustering the subarrays along each dimension, or way, of the tensor using either -means or variants on it (Jegelka et al., 2009), and (iii) low rank tensor decompositions (Sun et al., 2009; Papalexakis et al., 2013; Zhao et al., 2016). While all these existing approaches may demonstrate good empirical performance, they have limitations. For instance, the spectral co-clustering method proposed by Wu et al. (2016b) is limited to nonnegative tensors and the CoTec method proposed by Jegelka et al. (2009), like -means, requires specifying the number of clusters along each dimension as a tuning parameter. Most importantly, none of the existing methods provide statistical guarantees for recovering an underlying co-clustering structure. There is a conspicuous gap between statistical guarantees and computational efficiency for existing tensor clustering solutions due to the nature of the non-convex formulations of the previously mentioned works.
In this paper, we propose a Convex Co-clustering (CoCo) procedure that solves a convex formulation of the problem of co-clustering a -way array for . Our proposed CoCo estimator affords the following advantages over existing tensor co-clustering methods.
Under modest assumptions on the data generating process, the CoCo estimator is guaranteed to recover an underlying co-clustering structure with high probability. In particular, we establish a non-asymptotic error bound for the CoCo estimator, which reveals a surprising “blessing of dimensionality” phenomenon: As the dimensions of the array increase, the CoCo estimator isstill consistent even if the number of underlying co-clusters diverges at a rate that approaches the rate at which the number of elements in the tensor sample grows. More importantly, an underlying co-clustering structure can be consistently recovered with even a single tensor sample, which is a typical case in real applications. This phenomenon does not exist in vector or matrix-variate cluster analysis.
The CoCo estimator possesses stability guarantees. In particular, the CoCo estimator is Lipschitz continuous in the data and jointly continuous in the data and its tuning parameter. We emphasize that Lipschitz continuity in the data guarantees that perturbations in the data lead to graceful and commensurate variations in the cluster assignments, and the continuity in the tuning parameter can be leveraged to expedite computation through warm starts.
The CoCo estimator can be iteratively computed with convergence guarantees via an accelerated first order method with storage and per-iteration cost that is linear in the size of the data.
In short, the CoCo estimator comes with (i) statistical guarantees, (ii) practically relevant stability guarantees at all sample sizes, and (iii) a computationally and storage efficient algorithm. The theoretical properties of our CoCo estimator are supported by extensive simulation studies. To demonstrate its business impact, we apply the CoCo estimator to the cluster analysis of advertisement click tensor data from a major online company. Our clustering results provide meaningful business insights to help advertising planning.
Our work is related to, but also clearly distinct from, a number of recent developments in cluster analysis. The first related line of research tackles convex clustering (Hocking et al., 2011; Zhu et al., 2014; Chi and Lange, 2015; Chen et al., 2015; Tan and Witten, 2015; Wang et al., ; Radchenko and Mukherjee, 2017) and convex biclustering (Chi et al., 2017). These existing methods are not directly applicable to general-order tensors, however. Importantly, our CoCo estimator enjoys a unique “blessing of dimensionality” phenomenon that has not been established in the aforementioned approaches. Moreover, the CoCo estimator is similar in spirit to a recent series of work approximating a noisy observed array with an array that is smooth with respect to some latent organization associated with each dimension of the array (Gavish and Coifman, 2012; Ankenman, 2014; Mishne et al., 2016; Yair et al., 2017). Our proposed CoCo procedure seeks an approximating array that is smooth with respect to a latent clustering along each dimension of the array. As we will see shortly, focusing our attention in this work on the co-clustering model paves the way to the discovery and explicit characterization of new and interesting fundamental behavior in finding intrinsic organization within tensors.
The rest of the paper is organized as follows. In Section 2, we review standard facts and results about tensors that we will use. In Section 3, we introduce our convex formulation of the co-clustering problem. In Section 4, we establish the stability properties and prediction error bounds of the CoCo estimator. In Section 5, we describe the algorithm used to compute the CoCo estimator. In Section 6, we give guidance on how to set and select tuning parameters used in the CoCo estimator in practice. In Section 7, we present simulation results. In Section 8, we discuss the results of applying the CoCo estimator to co-cluster a real data tensor from online advertising. In Section 9, we close with a discussion. The Appendix contains a brief review of the two main tensor decompositions that are discussed in this paper, all technical proofs, as well as additional experiments.
We adopt the terminology and notation used by Kolda and Bader (2009). We call the number of ways or modes of a tensor its order. Vectors are tensors of order one and denoted by boldface lowercase letters, e.g. . Matrices are tensors of order two and denoted by boldface capital letters, e.g. . Tensors of higher-order, namely order three and greater, we denote by boldface Euler script letters, e.g. . Thus, if represent a -way data array of size , we say is a tensor of order . We denote scalars by lowercase letters, e.g. . We denote the th element of a vector by , the th element of a matrix by , the th element of a third-order tensor by , and so on.
We can extract a subarray of a tensor by fixing a subset of its indices. For example, by fixing the first index of a matrix to be , we extract the th row of the matrix, and by fixing the second index of a matrix to be , we extract a th column of the matrix. We use a colon to indicate all elements of a mode. Consequently, we denote the th row of a matrix by and the th column of a matrix by . Fibers are the subarrays of a tensor obtained by fixing all but one of its indices. In the case of a matrix, a mode-1 fiber is a matrix column and a mode-2 fiber is a matrix row. Slices are the two-dimensional subarrays of a tensor obtained by fixing all but two indices. For example, a third-order tensor has three sets of slices denoted by , and .
2.2 Basic Tensor Operations
It is often convenient to reorder the elements of a -way array into a matrix or a vector. Reordering a tensor’s elements into a matrix is referred to as matricization, while reordering its elements into a vector is referred to as vectorization. There are many ways to reorder a tensor into a matrix or vector. In this paper, we use a canonical mode- matricization, where the mode- fibers of a -way tensor become the columns of a matrix , where . Recall that the column-major vectorization of a matrix maps a matrix to the vector by stacking the columns of on top of each other, namely . In this paper, we take the vectorization of a -way tensor , denoted , to be the column-major vectorization of the mode-1 matriciziation of , namely , where the total number of elements in . As a shorthand, when the context leaves no ambiguity, we denote this vectorization of a tensor by its boldface lowercase version .
The Frobenius norm of a -way tensor is the natural generalization of the Frobenius norm of a matrix, namely it is the square root of the sum of the squares of all its elements,
The Frobenius norm of a tensor is equivalent to the -norm of the vectorization of the tensor, namely .
Let be a tensor in and be a matrix in . The d-mode (matrix) product of the tensor with the matrix , denoted by , is the tensor of size whose th element is given by
for . The vectorization of the -mode product can be expressed as
where is the -by-identity matrix and denotes the Kronecker product between two matrices. The identity given in (1) generalizes the well known formula for the column-major vectorization of a product of two matrices, namely .
3 A Convex Formulation of Co-clustering
We first consider a convex formulation of co-clustering problem when the data is a 3-way tensor before discussing the natural generalization to -way tensors. Our basic assumption is that the observed data tensor is a noisy realization of an underlying tensor that exhibits a checkerbox structure modulo some unknown reordering along each of its modes. Specifically suppose that there are and clusters along modes 1, 2, and 3 respectively. If the -th entry in belongs to the cluster defined by the th mode-1 group, th mode-2 group, and th mode-3 group, then we assume that the observed tensor element is given by
where is the mean of the co-cluster defined by the th mode-1 partition, th mode-2 partition, and th mode-3 partition, and
are noise terms. We will specify a joint distribution on the noise terms later in Section4.2 in order to derive prediction bounds. Thus, we model the observed tensor as the sum of a mean tensor , whose elements are expanded from the co-cluster means tensor , and a noise tensor . We can write this expansion explicitly by introducing a membership matrix for the th mode, where the th element of is one if and only if the th mode- slice belongs to the th mode- cluster for . We require that each row of the membership matrix sum to one, namely , to ensure that each of the mode- slices belongs to exactly one of the mode- clusters. Then,
Figure 1 illustrates an underlying mean tensor after permuting the slices along each of the modes to reveal a checkerbox structure.
The co-clustering model in (2) is the 3-way analogue of the checkerboard mean model often employed in biclustering data matrices (Madeira and Oliveira, 2004; Tan and Witten, 2014; Chi et al., 2017). Moreover, the tensor of co-cluster means corresponds to the tensor of cluster “centers” in the tensor clustering work by Jegelka et al. (2009). The model is complete and exclusive in that each tensor element is assigned to exactly one co-cluster. This is in contrast to models that allow potentially overlapping co-clusters (Lazzeroni and Owen, 2002; Bergmann et al., 2003; Turner et al., 2005; Huang et al., 2008; Witten et al., 2009; Lee et al., 2010; Sill et al., 2011; Papalexakis et al., 2013; Bhar et al., 2015).
Estimating the model in (2) consists of finding (i) the partitions along each mode and (ii) the mean values of each of the co-clusters. Estimating , given the mode clustering assignments is trivial. Let and denote the indices of the th mode-1, th mode-2, and th mode-3 groups respectively. If the noise terms are iid for some positive , then the maximum likelihood estimate of is simply the sample mean of the entries of over the indices defined by , and , namely
Finding the partitions and
, on the other hand, is a combinatorially hard problem. In recent years, however, many combinatorially hard problems, that initially appear computationally intractable, have been successfully attacked by solving a convex relaxation to the original combinatorial optimization problem. Perhaps the most celebrated convex relaxations is the lasso(Tibshirani, 1996), which simultaneously performs variable selection and parameter estimation for fitting sparse regression models by minimizing a non-smooth convex criterion.
In light of the lasso’s success, we propose to simultaneously identify partitions along the modes of and estimate the co-cluster means by minimizing the following convex objective function
By seeking the minimizer of (3), we have cast co-clustering as a signal approximation problem, modeled as a penalized regression, to estimate the true co-cluster means tensor . In the following discussion, we drop the dependence of in and denote our estimator as when there is no confusion. The quadratic term in (3) quantifies how well approximates , while the regularization term in (3) penalizes deviations away from a checkerbox pattern. The nonnegative parameter tunes the relative emphasis on these two terms. The parameters are nonnegative weights whose purpose will be discussed shortly.
To appreciate how the regularization term steers the minimizer of (3) towards a checkerbox pattern, consider the effect of one of the terms in isolation. Specifically, suppose that . When is zero, the minimum of (3) is attained when . Or stated another way, for . As increases, the mode-1 slices will shrink towards each other and in fact coalesce due to the non-differentiability of the Frobenius norm at zero. In other words, as gets larger, the pairwise differences of the mode-1 slices of will become increasingly sparser. Sparsity in these pairwise differences leads to a natural partitioning assignment. Two mode-1 slices and are assigned to the same mode-1 partition if . Under mild regularity conditions, that we will spell out in Section 4, for sufficiently large , all mode-1 slices will be identical and therefore belong to a single cluster. Similar behavior holds if or .
When includes all three terms for , pairs of mode-1, mode-2, and mode-3 slices are simultaneously shrunk towards each other and coalesce as the parameter increases. By coupling clustering along each of the modes simultaneously, our formulation explicitly seeks out a solution with a checkerbox mean structure. Moreover, we will show in Section 4 that the solution produces an entire solution path of checkerbox co-clustering estimates that varies continuously in . The solution path spans a range of models from the least smoothed model, where is and each tensor element occupies its own co-cluster, to the most smoothed model, where all the elements of are identical and all tensor elements belong to a single co-cluster.
The nonnegative weights fine tune the shrinkage of the slices along the th mode. For example, if , then there will be more pressure for and to fuse than for and to fuse as increases. Thus, the weight quantifies the similarity between the th and th mode- slices. A very large indicates that the two slices are very similar, while a very small indicates that they are very dissimilar. These pairwise similarities motivate a graphical view of clustering. For the th mode, define the set as the edge set of a similarity graph. Each slice is a node in the graph and the set contains an edge if and only if . Figure 2 shows an example of a mode-1 similarity graph, which corresponds to a tensor with seven mode- slices and positive weights that define the edge set
Given the connectivity of the graph, as increses, the slices , and will be shrunk towards each other while the slices and shrunk towards each other. Since for any , we can express the penalty terms for the th mode as
The graph in Figure 2 makes readily apparent that the convex objective in (3) separates over the connected components of the similarity graph for the mode- slices. Consequently, one can solve for the optimal component by component. Without loss of generality, we assume that the weights are such that all the similarity graphs are connected. Further discussion of the weights and practical recommendations for specifying them will be discussed in Section 6.1.
Having familiarized ourselves with the convex co-clustering for a 3-way array, we now present the natural extension of (3) for clustering the fibers of a general higher-order tensor along all its modes. Let where is the th standard basis vector in . The objective function of our convex co-clustering for a general higher-order tensor is as follows.
The difference between convex triclustering objective (3) and the general convex co-clustering objective (4) is in the penalty terms. Previously in (3) we penalized the difference between pairs slices whereas in (4) we penalize the differences between pairs of mode- subarrays.
Note that the function defined in (4) has a unique global minimizer. This follows immediately from the fact that is strongly convex. The unique global minimizer of is our proposed CoCo estimator, which is denoted by for the remainder of the paper.
At times it will be more convenient to work with vectors rather than tensors. By applying the identity in (1), we can rewrite the objective function in (4) in terms of the vectorizations of and as follows
where is the -by- matrix
where is the -by- identity matrix. We will refer to the unique global minimizer of (5), , as the vectorized version of our CoCo estimator.
The fusion penalties are a composition of the group lasso (Yuan and Lin, 2006) and the fused lasso (Tibshirani et al., 2005), a special case of the generalized lasso (Tibshirani and Taylor, 2011). When only a single mode is being clustered and only one of the terms is employed, we recover the objective function in the convex clustering problem (Pelckmans et al., 2005; She, 2010; Lindsten et al., 2011; Hocking et al., 2011; Sharpnack et al., 2012; Zhu et al., 2014; Chi and Lange, 2015; Radchenko and Mukherjee, 2017). Most prior work on convex clustering employ an element-wise -norm penalty on pairwise differences, as in the original fused lasso, however, -norm and -norm have also been considered (Hocking et al., 2011; Chi and Lange, 2015). When the tensor is a matrix and the rows and columns are being simultaneously clustered, we recover the objective function in the convex biclustering problem (Chi et al., 2017). In general, the fusion penalties shrink solutions to vector valued functions that are piece-wise constant over the mode- similarity graph defined by the weights . Viewed this way, we can see our approach as simultaneously performing the network lasso (Hallac et al., 2015) on similarity graphs.
Given the co-clustering structure assumed in (2), one may wonder how much is added by explicitly seeking a co-clustering over clustering along each mode independently. In other words, why not solve independent convex clustering problems with ? To provide some intuition on why co-clustering should be preferred over independently clustering each mode, consider the following problem. Imagine trying to cluster row vectors for drawn from a two-component mixture of Gaussians, namely
This is a challenging clustering problem due to the disproportionately small number of observations compared to the number of features. If, however, we were told that and for and and for , in other words that the features were clustered into two groups, our fortunes have reversed and we now have an abundance of observations compared to the number of effective features. Even if we lack a clear-cut clustering structure in the features, this example suggests that leveraging similarity structure along the columns can expedite identifying similarity structure along the rows, and vice versa. Indeed, if there is an underlying checkerbox mean tensor we may expect that simultaneously clustering along each mode should make the task of clustering along any one given mode easier. Our prediction error result presented in Section 4.2 in fact supports this suspicion (See Remark 4.2).
We first discuss how the CoCo estimator behaves as a function of the data tensor , the tuning parameter , and the weights . We will then present its statistical properties under mild conditions on the data generating process. We highlight that these properties hold regardless of the algorithm used to minimize (4), as they are intrinsic to its convex formulation. All proofs are given in Appendix B and Appendix C.
4.1 Stability Properties
The CoCo estimator varies smoothly with respect to , , and . Let denote the weights matrix for mode .
The minimizer of (4) is jointly continuous in In practice, we will typically fix the weights and compute the CoCo estimator over a grid of the penalization parameters in order to select a final CoCo estimator from among the computed candidate estimators of varying levels of smoothness. Since (4) does not admit a closed form minimizer, we resort to iterative algorithms for computing the CoCo estimator. Continuity of in can be leveraged to expedite computation through warm starts, namely using the solution as the initial guess for iteratively computing where is slightly larger or smaller than . Due to the continuity of in , small changes in will result in small changes in . Empirically the use of warm starts can lead to a non-trivial reduction in computation time (Chi and Lange, 2015). From the continuity in , we also see that convex co-clustering performs continuous co-clustering just as the lasso (Tibshirani, 1996) performs continuous variable selection.
The penalization parameter tunes the complexity of the CoCo estimator. Clearly when , the CoCo estimator is just the data, namely . The key to understanding the CoCo estimator’s behavior as increases is to recognize that the penalty functions are semi-norms. Under suitable conditions on the weights given in connectedness below, vanishes if and only if the mode- subarrays of are identical.
For any pair of mode- subarrays, indexed by and with , there exists a sequence of indices along which the weights, are positive.
Under Assumption 4.1, if and only if for some .
Proposition 4.1 suggests that if connectedness holds for all then as increases the CoCo estimator converges to the solution of the following constrained optimization problem:
the solution to which is just the global mean , whose entries are all identically the average value of over all its entries. The next result formalizes our intuition that as increases, the CoCo estimator will eventually coincide with .
Suppose connectedness holds for , then is minimized by the grand mean for sufficiently large. Thus, as increases from 0, the CoCo estimator traces a continuous solution path that starts from co-clusters, consisting of , to a single co-cluster, where for all .
For a fixed , we can derive an explicit bound on sensitivity of the CoCo estimator to perturbations in the data. The minimizer of (4) is a nonexpansive or 1-Lipschitz function of the data tensor , namely
Nonexpansivity of in provides an attractive stability result. Since varies smoothly with the data, small perturbations in the data are guaranteed to not lead to large variability of , or consequently large variability in the cluster assignments. In a special case of our method, Chi et al. (2017) showed empirically that the co-clustering assignments made by the 2-way version of the CoCo estimator was noticeably less sensitive to perturbations in the data than those made by several existing biclustering algorithms.
4.2 Statistical Properties
We next provide a finite sample bound for the prediction error of the CoCo estimator. For simplicity, we consider the case where we take uniform weights within a mode in (5), namely for all . Such uniform weight assumption has also been imposed in the analysis of the vector-version of convex clustering (Tan and Witten, 2015).
In order to derive the estimation error of , we first define an important definition for the noise and introduce two regularity conditions.
[Vu and Wang (2015)] We say a random vector is -concentrated if there are constants such that for any convex, 1-Lipschitz function and any ,
-concentrated random variable is more general than the Gaussian or sub-Gaussian random variables, and it allows dependence in its coordinates.Vu and Wang (2015) provided a few examples of -concentrated random variables. For instance, if the coordinates of are iid standard Gaussian, then is 1-concentrated. If the coordinates of are independent and -bounded, then is -concentrated. If the coordinates of come from a random walk with certain mixing properties, then is -concentrated for some .
Assumption 4.2 (Model)
We assume the true cluster center has a checkerbox structure such that the mode- subarrays have different values (number of clusters along the th mode), and each entry of is bounded above by a constant . Define as the true parameter expanded based on , namely
where are binary mode- cluster membership matrices such that . Denote with . We assume the samples belonging to the -th cluster satisfy
with and . Furthermore, we assume is a -concentrated random variable defined in with mean zero.
The checkerbox means model in model provides the underlying cluster structure of the tensor data. As a special case, model with reduces to the model assumption underlying convex biclustering (Chi et al., 2017). In contrast to the independent sub-Gaussian condition assumed in vector-version convex clustering (Tan and Witten, 2015)
, our error condition is much weaker since we allow for non-sub-Gaussian distributions as well as allow for dependence among its coordinates.
Assumption 4.3 (Tuning)
The tuning parameter satisfies
for some constant .
Suppose that model and tuning hold. The estimation error of in (5) with uniform weights satisfies,
with a high probability, where is a positive constant, and is the true number of clusters in the th mode.
Theorem 4.2 provides a finite sample error bound for the proposed CoCo tensor estimator. Our theoretical bound allows the number of clusters in each mode to diverge, which reflects a typical large-scale clustering scenario in big tensor data. A notable consequence of Theorem 4.2 is that, when , namely a higher-order tensor with at least 3 modes, the CoCo estimator can achieve estimation consistency along all the modes even when we only have one tensor sample. This property is uniquely enjoyed by co-clustering of tensor data with , and has not been previously established in the existing literature on vector clustering or biclustering. To see this, when are of the same order as , and are of the same order as , a sufficient condition for the consistency is that and up to a log term. When , the CoCo estimator is consistent so long as the number of clusters in each mode diverges slightly slower than . Remarkably, as we have more modes in the tensor data, this constraint on the rate of divergence of gets weaker. In short, we reap a unique and surprisingly welcome “blessing of dimensionality” phenomenon in the tensor co-clustering problem.
Next we discuss the connections of our bound (7) with prior results in the literature. An intermediate step in the proof of Theorem 4.2 indicates that the estimation error in the th mode is in the order of . In the clustering along the rows of a data matrix, our rate matches with that established for vector-version convex clustering (Tan and Witten, 2015), up to a log term . Such a log term is due to that fact that Tan and Witten (2015) considers the error to be iid sub-Gaussian while we consider a general -concentrated error. In practice, the iid assumption on the noise could be restrictive. Consequently, our theoretical analysis is built upon a new concentration inequality of quadratic forms recently developed in Vu and Wang (2015). In addition, our rate reveals an interesting theoretical property of the convex biclustering method proposed by Chi et al. (2017). When , our rate indicates that the estimation error along the row and column of the data matrix is and , respectively. Clearly, both errors can not converge to zero simultaneously. This indicates a disadvantage of matricizing a data tensor for co-clustering.
5 Estimation Algorithm
We next discuss a simple first order method for computing the solution to the convex co-clustering problem. The proposed algorithm generalizes the variable splitting approach introduced for convex clustering problem described in Chi and Lange (2015) to the CoCo problem. The key observation is that the Lagrangian dual of an equivalent formulation of the convex co-clustering problem is a constrained least squares problem that can be iteratively solved using the classic projected gradient algorithm.
5.1 A Lagrangian Dual of the CoCo Problem
Recall that we seek to minimize the objective function in (5)
Note that we have enumerated the edge indices in to simplify the notation for the following derivation.
We perform variable splitting and introduce the dummy variables. Let denote the matrix whose th column is . Further denote the vectorization of by and let denote the vector obtained by stacking the vectors on top of each other. We now solve the equivalent equality constrained minimization
where and is the oriented edge-vertex incidence matrix for the th mode graph, namely
We introduce dual variables corresponding to the equality constraint . Let denote the matrix whose th column is . Further denote the vectorization of by and . The Lagrangian dual objective is given by
where and is the indicator function of the closed convex set , namely is the function that vanishes on the set of and is infinity on the complement of . Details on the derivation of the dual objective are provided in Appendix D.
Maximizing the dual objective is equivalent to solving the following constrained least squares problem:
where . We can recover the primal solution via the relationship:
where is a solution to the dual problem (8). The dual problem (8) has at least one solution by the Weierstrass extreme value theorem, but the solution may not be unique since has a non-trivial kernel. Nonetheless, our CoCo estimator is still unique since for any solutions to the problem (8).
We numerically solve the constrained least squares problem in (8) with the projected gradient algorithm, which alternates between taking a gradient step and projecting onto the set . Algorithm 1 provides pseudocode of the projected gradient algorithm, which has several good features. The projected gradient algorithm is guaranteed to converge to a global minimizer of (8). Its per-iteration and storage costs using the weight choices, described in Section 6.1, are both , namely linear in either the number of dimensions or in the number of elements . For a modest additional computational and storage cost, we can accelerate the projected gradient method, for example with FISTA (Beck and Teboulle, 2009) or SpaRSA (Wright et al., 2009). In our experiments, we use a version of the latter, namely FASTA (Goldstein et al., 2014, 2015). Additional details on the derivation of the algorithmic updates, convergence guarantees, computational and storage costs, as well as stopping rules can be found in Appendix E.
6 Practical Issues
In this section, we address considerations for using the method in practice, namely how to set the weights and how to choose the tuning parameter .
6.1 Specifying the Weights
The first major practical consideration is how to choose the weights in the penalty terms ). In Section 4.2, we assumed uniform weights to establish a prediction error bound, which revealed a surprising and beneficial “blessing of dimensionality” phenomenon. Although this simplifying assumption gave clarity and insight into how the co-clustering problem gets easier as the number of modes increases, in practice choosing non-uniform weights can substantially improve the quality of the clustering results. In the context of convex clustering, Chen et al. (2015) and Chi and Lange (2015) provided empirical evidence that choosing uniform weights rarely produced exact sparsity in the pairwise differences of smooth estimates except in the case when there was a strong separation between groups. Indeed, similar phenomena were observed in earlier work on the related clustered lasso (She, 2010). Several related works (She, 2010; Chen et al., 2015; Chi and Lange, 2015) recommend a weight assignment strategy described below. In addition, the use of sparse weights can also lead to non-trivial improvements in the computational time and quality (Chi and Lange, 2015; Chi et al., 2017). Finally, non-uniform weights provide a connection between convex clustering and other penalized regression-based clustering methods that use folded-concave penalties (Pan et al., 2013; Marchetti and Zhou, 2014; Wu et al., 2016a). We defer further discussion on this connection to Appendix F.
In contrast to the convex clustering literature (She, 2010; Chen et al., 2015; Chi et al., 2017) which constructed the weights based on the original data matrix, we instead use a denoised version of the data tensor, denoted . In our preliminary experiments, we found a low-rank approximation of via the Tucker decomposition led to a marked improvement in co-clustering performance. Employing the Tucker decomposition introduces another tuning parameter, namely the rank of the decomposition. In our simulation studies described in Section 7, we use two different methods for choosing the rank as a robustness check to ensure our CoCo estimator’s performance does not crucially depend on the rank selection method. Details on these two methods can be found in Appendix F. While we found the Tucker decomposition to work well in practice, we suspect that other methods of denoising the tensor may work just as well or could possibly be more effective. We leave it to future work to explore alternatives to the Tucker decomposition.
Once the tensor is denoised, weights are then computed in two steps. We first calculate pre-weights between the th and th mode- subarrays as
The first term in equation (9), , is an indicator function that equals 1 if the th slice is among the th slice’s -nearest neighbors (or vice versa) and 0 othewise. The purpose of this term is to control the sparsity of the weights. The corresponding tuning parameter influences the connectivity of the mode- similarity graph. One can explore different levels of granularity in the clustering by varying (Chen et al., 2015). As a default, one can use the smallest such that the similarity graph is still connected. Note it is not necessary to calculate the exact -nearest neighbors, which scales quadratically in the number of fibers in the mode. A fast approximation to the -nearest neighbors is sufficient for the sake of inducing sparsity into the weights.
The second term in equation (9) is the Gaussian kernel, which takes on larger values for pairs of mode- subarrays that are more similar to each other. Intuitively, the weights should be inversely proportional to the distance between the th and th mode- subarrays (Chen et al., 2015; Chi et al., 2017). The inverse of the nonnegative parameter is a measure of scale. In practice, we can set it to be the median Euclidean distance between the th and th mode- subarrays that are -nearest neighbors of each other. A value of corresponds to uniform weights. Note that with minor modification, we can allow the inverse scale parameter to be pair dependent as described in Zelnik-Manor and Perona (2005).
To obtain the mode- weights , we normalize the mode- pre-weights to sum to . The normalization step puts the penalty terms on the same scale and ensures that clustering along any given single mode will not dominate the entire co-clustering as increases.
The second major practical consideration is how to choose to produce a final co-clustering result. Since co-clustering is an exploratory method, it may be suitable for a user to manually inspect a sequence of CoCo estimators for a range of and use domain knowledge tied to a specific application to select to recover a co-clustering assignment of a desired complexity. Since this approach is time consuming and requires expert knowledge, an automated, data-driven procedure for selecting is desirable. Cross-validation (Stone, 1974; Geisser, 1975) and stability selection (Meinshausen and Bühlmann, 2010) are popular techniques for tuning parameter selection, but since both methods are based on resampling, they are unattractive in the tensor setting due to the computational burden. We turn to the extended Bayesian Information Criterion (eBIC) proposed by Chen and Chen (2008, 2012), as it does not rely on resampling and thus is not as computationally costly as cross-validation or stability selection.
where is the residual sum of squares and
is the degrees of freedom for a particular value of. We use the number of co-clusters in the CoCo estimator as an estimate of , which is consistent with the spirit of degrees of freedom since each co-cluster mean is an estimated parameter. This criterion balances between model fitting and model complexity, and a similar version has been commonly employed in tuning parameter selection of tensor data analysis (Zhou et al., 2013; Sun et al., 2017).
The eBIC is calculated on a grid of values , and we select the optimal , denoted , which corresponds to the smallest value of the eBIC over , namely
7 Simulation Studies
To investigate the performance of the CoCo estimator in identifying co-clusters in tensor data, we first explore some simulated examples. We compare our CoCo estimator to a -means based approach that is similar in spirit to various tensor generalizations of the spectral clustering method common in the tensor clustering literature (Kutty et al., 2011; Liu et al., 2013b; Zhang et al., 2013; Wu et al., 2016b). We refer to this method as CPD+-means, since it first performs a rank- CP decomposition on the -way data tensor to reduce the dimensionality of the problem, and then independently applies -means clustering to the rows of each the factor matrix of the resulting CP decomposition. The -means algorithm has also been used to cluster the factor matrices resulting from a Tucker decomposition (Acar et al., 2006; Sun et al., 2006; Kolda and Sun, 2008; Sun et al., 2009; Kutty et al., 2011; Liu et al., 2013b; Zhang et al., 2013; Cao et al., 2015; Oh et al., 2017). We also considered this method in initial experiments, but its performance was inferior to that of CPD+-means so we only report results using the CP decomposition.
Both our CoCo estimator and CPD+-means have tuning parameters that need to be set. For the rank of the CP decomposition, we consider and use the tuning procedure in Sun et al. (2017) to automatically select the rank. A CP decomposition is then performed using the chosen rank, and those factor matrices are the input into the -means algorithm. A well known drawback of -means is that the number of clusters needs to be specified a priori. Several methods for selecting have been proposed in the literature, and we use the “gap statistic” developed by Tibshirani et al. (2001) to select an optimal from the specified possible values. Since CoCo estimates an entire solution path of mode-clustering results, ranging from clusters to a single cluster along mode , we consider a rather large set of possible values to make the methods more comparable.
As described in Section 6.1, we employ a Tucker approximation to the data tensor in constructing weights . In computing the Tucker decomposition we used one of two methods for selecting the rank. In the plots within this section, TD1 denotes the results where the Tucker rank was chosen using the SCORE algorithm (Yokota et al., 2017)
, while TD2 denotes results where the rank was chosen using a heuristic. Detailed discussion on these two methods are in AppendixF.
To assess the quality of the clustering performance, we consider two measures commonly used in the clustering literature: (i) the Adjusted Rand Index (ARI) and (ii) the Variation of Information (VI). The ARI (Hubert and Arabie, 1985) varies between -1 and 1, where 1 indicates a perfect match between two clustering assignments whereas a value close to zero indicates the two clustering assignments match about as might be expected if they were both randomly generated. Negative values indicate that there is less agreement between clusterings than expected from random partitions. The VI is a metric and therefore takes a minimum value of 0 when there is perfect agreement between two clustering assignments (Meilă, 2007)
. The VI is particularly well suited for assessing the similarity between hierarchical clustering assignments. The nearest neighbors of a clusteringwith respect to the VI metric are clusterings that result from the splitting or merging of small clusters in .
The results presented in this section report the average CoCo estimator performance across 200 simulated replicates. All simulations were performed in Matlab using the Tensor Toolbox (Bader et al., 2015). All the following plots, except the heatmaps in Figure 11, were made using the open source R package ggplot2 (Wickham, 2009).
7.1 Cubical Tensors, Checkerbox Pattern
For the first and main simulation setting, we study clustering data in a cubical tensor generated by a basic checkerbox mean model according to model. Each entry in the observed data tensor is generated according to the underlying model (2) with independent errors . Unless specified otherwise, there are two true clusters along each mode for a total of eight underlying co-clusters.