Data in the form of multidimensional array, or tensor, are now frequently arising in a diverse range of scientific and business applications (Zhu et al., 2007; Liu et al., 2013; Zhou et al., 2013; Ding and Cook, 2015). Particularly, for a large class of tensor data, time is one of the tensor modes, and this class is often termed as dynamic tensor. Examples of dynamic tensor data are becoming ubiquitous and its analysis are receiving increasing attention. For instance, in online advertising, Bruce et al. (2016) studied consumer engagement on advertisements over time to capture the temporal dynamics of users behavior. The resulting data is a three-mode tensor of user by advertisement by time. In molecular biology, Seigal et al. (2016)
studied time-course measurements of the activation levels of multiple pathways from genetically diverse breast cancer cell lines after exposure to numerous growth factors with different dose. The data is a five-mode tensor of cell line by growth factor by pathway by dose by time. In neurogenomics,Liu et al. (2017) modeled spatial temporal patterns of gene expression during brain development, and the data is a three-mode tensor of gene by brain region by time. In Section 6, we illustrate our method on a brain dynamic connectivity analysis, where the goal is to understand interaction of distinct brain regions and their dynamic pattern over time. One form of the data in this context is a three-mode tensor of region by region by time.
. Directly applying a clustering algorithm to the vectorized tensor data is a simple solution, but it often suffers from poor clustering accuracy, and induces heavy and sometimes intractable computation. There have been a number of proposals for clustering of tensor data, or the two-mode special case, matrix data. One class of such methods focused on biclustering that simultaneously group rows (observations) and columns (features) of the data matrix(Huang et al., 2009; Lee et al., 2010; Chi and Lange, 2015; Chi and Allen, 2017)
. The proposed solutions were based on sparse singular value decomposition, or a reformulation of biclustering as a penalized regression with some convex penalties. However, the dynamic property remained largely untapped in those solutions. The second class of approaches directly targeted biclustering of time-varying matrix data(Hocking et al., 2011; Ji et al., 2012; Li et al., 2015). Nevertheless, those solutions were designed specifically for matrix-valued data. Extension to a general-order tensor is far from trivial. The third class tackled clustering of tensor data through some forms of penalization (Cao et al., 2013; Wu et al., 2016). But none of those approaches incorporated the dynamic information in the data and would inevitably lead to loss in clustering accuracy. Moreover, there is no statistical guarantee provided in the performance of these tensor clustering algorithms. Tensor decomposition is a crucial component in handling tensor-valued data, and is to play a central role in our proposed tensor clustering solution as well. There have been a number of recent proposals of convex relaxation of tensor decomposition through various norms (Romera-Paredes and Pontil, 2013; Yuan and Zhang, 2017, 2016; Zhang, 2018). However, all those methods focused on low-rank tensor recovery, and none incorporated any sparsity or fusion structure in the decomposition. Recently, Sun et al. (2017) proposed a low-rank decomposition with a truncation operator for hard thresholding. Although sparsity was considered, they did not consider fusion structure for a dynamic tensor. Ignoring this structure, as we show later, would induce large estimation errors in both tensor decomposition and subsequent clustering. In summary, there exists no clustering solution with statistical guarantee that handles a general-order dynamic tensor and incorporates both sparsity and fusion structures.
In this article, we aim to bridge this gap by proposing a dynamic tensor clustering method, which takes into account both sparsity and fusion structures, and enjoys strong statistical guarantee as well as high computational efficiency. Our proposal makes multiple contributions to the clustering literature. First, our clustering method is built upon a newly proposed structured tensor factorization approach, which encourages both sparsity and smoothness in the decomposed components, and in turn captures the dynamic nature of the tensor data. We show how structured tensor factorization can be used to infer the clustering structure. Interestingly, we find that tensor Gaussian mixture model can be viewed as a special case of our clustering method. Second, our proposal is computationally efficient. This is partly achieved by substantial dimension reduction resulting from the imposed structure; in the illustrative example in Section6, the number of free parameters was reduced from about two millions to one thousand. Our optimization algorithm can be decomposed as an unconstrained tensor decomposition step followed by a constrained optimization step. We show that, the overall computational complexity of our constrained solution is comparable to that of the unconstrained one. Third, and probably most importantly, we establish rigorous theoretical guarantee for our proposed dynamic tensor clustering solution. Specifically, we first establish a non-asymptotic error bound for the estimator from the proposed structured tensor factorization. Based on this error bound, we then obtain the rate of convergence of the estimated cluster centers from our dynamic tensor clustering solution, and prove that the estimated clusters recover the true cluster structures with high probability. It is also noteworthy that we allow the number of clusters to grow with the sample size. Such consistency results are new in the tensor clustering literature. From a technical perspective, the fusion structure we consider introduces some additional challenges in the theoretical analysis, since the resulting truncated fusion operator is non-convex and the observed tensor is usually noisy with an unknown error distribution. To address such challenges, we develop a set of non-asymptotic techniques to carefully evaluate the estimation error in each iteration of our alternating updating algorithm. We also utilize a series of large deviation bounds to show that our estimation error replies on the error term only through its sparse spectral norm, which largely relieves the uncertainty from the unknown error distribution. Last but not least, although our algorithm mostly focuses on clustering along a single mode of tensor data, the same approach can be easily applied to co-clustering along multiple tensor modes. This is different from classical clustering methods, where an extension from clustering to bi-clustering generally requires different optimization formulations (see, e.g., Chi and Allen, 2017). In contrast, our clustering method naturally incorporates single and multi-mode clustering without requiring any additional modification.
The rest of the article is organized as follows. Section 2 introduces the proposed dynamic tensor clustering method, and Section 3 presents its solution through structured tensor factorization. Section 4 establishes the estimation error bound of the structured tensor factorization and the consistency properties of dynamic tensor clustering. Section 5 presents the simulations, and Section 6 illustrates with a brain dynamic functional connectivity analysis. The Supplementary Materials collect all technical proofs and additional simulations.
2.1 Clustering via tensor factorization
Given copies of -way tensors, , our goal is to uncover the underlying cluster structures of these samples. That is, we seek the true cluster assignment,
where is the number of clusters and . Here, for ease of presentation, we assume an equal number of samples per cluster. blackIn Section S.9 of the Supplementary Materials, we report some numerical results with unequal cluster sizes.
To cluster those tensor samples, we first stack them into a -way tensor, . We comment that, in principle, one can cluster along any single or multiple modes of . Without loss of generality, we focus our discussion on clustering along the last mode of , and only briefly comment on the scenario that clusters along multiple modes. This formulation covers a variety of scenarios encountered in our illustrative example of brain dynamic connectivity analysis. For instance, in one scenario, , , and the goal is to cluster individuals, each with a tensor that represents the brain connectivity pattern among brain regions over sliding time windows. In another scenario, , , and the goal is to cluster sliding windows for a single subject. In the third scenario, , , where is the number of subjects in group , and the goal becomes clustering moving windows for all subjects in that group.
Our key idea is to consider a structured decomposition of , then apply a usual clustering algorithm, e.g., -means, to the matrix from the decomposition that corresponds to the last mode to obtain the cluster assignment. Assume that the tensor is observed with noise,
where is an error tensor, and is the true tensor with a rank- CANDECOMP/PARAFAC (CP) decomposition structure (Kolda and Bader, 2009),
where , , , , denotes the vector norm and is the vector outer product. For ease of notation, we define . blackWe have chosen the CP decomposition due to its relatively simple formulation and its competitive empirical performance. It has been widely used for link prediction (Dunlavy et al., 2011), community detection (Anandkumar et al., 2014a), recommendation systems (Bi et al., 2017)
, and convolutional neural network speeding-up(Lebedev et al., 2015).
Given the structure in (2), it is straightforward to see that, the cluster structure of samples along the last mode of the tensor is fully determined by the matrix that stacks the decomposition components, . We denote this matrix as , which can be written as
where , , indicates the cluster assignment. Figure 1 shows a schematic illustration of our tensor clustering proposal. We comment that, if the goal is to cluster along more than one tensor mode, one only needs to apply a clustering algorithm to the matrix formed by each of those modes separately.
Accordingly, the true cluster means of the tensor samples can be written as,
The structure in (4) reveals the key underlying assumption of our tensor clustering solution. That is, we assume each cluster mean is a linear combination of the outer product of rank-1 basis tensors, and all the cluster means share the same basis tensors. We recognize that this assumption introduces an additional constraint. However, it leads to substantial dimension reduction, which in turn enables efficient estimation and inference in subsequent analysis. As we show in Section 2.3, the tensor Gaussian mixture model can be viewed as a special case of our clustering structure. Moreover, as our numerical study has found, this imposed structure provides a reasonable approximation in real data applications.
For comparison, we consider the alternative solution that applies clustering directly on the vectorized version of tensor data. It does not require (4), and the corresponding number of free parameters is in the order of . In the example in Section 6, , and that amounts to parameters. Imposing (4), however, would reduce the number of free parameters to ; again, for the aforementioned example, and that amounts to parameters. Such a substantial reduction in dimensionality is crucial for both computation and inference in tensor data analysis. Moreover, we use a simple simulation example to demonstrate that, our clustering method, which assumes (4) and thus exploits the underlying structure of the tensor data, can not only reduce the dimensionality and the computational cost, but also improve the clustering accuracy. Specifically, we follow the example in Section 5.2 to generate tensor samples of dimension from 4 clusters, with samples 1 to 25, 26 to 50, 51 to 75, 76 to 100 belonging to clusters 1 to 4, respectively. Figure 2 shows the heatmap of the vectorized data (left panel), which is of dimension , and the heatmap of the data with reduced rank (right panel), which is of dimension . It is clearly seen from this plot that, our clustering method based on the reduced data under (4) is able to fully recover the four underlying clusters, while the clustering method based on the vectorized data cannot.
2.2 Sparsity and fusion structures
Motivated from the brain dynamic functional connectivity analysis, in addition to the CP low-rank structure (2), we also impose the sparsity and smoothness fusion structures in tensor decomposition to capture the sparsity and dynamic properties of the tensor samples. Specifically, we impose the following structures in the parameter space,
where , denotes the vector norm, and , whose th row has and on its th and th positions and zero elsewhere. Combining these two structures with the CP decomposition of in (2), we consider
Here for simplicity, we assume the maximum of the sparsity parameter and the fusion parameter are the same across different rank , and we denote and . This can be easily extended to a more general case where these parameters vary with . To encourage such sparse and fused components, we propose to solve the following penalized optimization problem,
for some cardinality parameters . Here denotes the vector norm, i.e., the number of nonzero entries, and denotes the tensor Frobenius norm, which is defined as for a tensor . The optimization formulation (5) encourages sparsity in the individual components via a direct constraint, and encourages smoothness via a fused lasso penalty. blackWe make a few remarks. First, one can easily choose which mode to impose which constraints, by modifying the penalty functions in (5) accordingly. See Section 3.2 for some specific examples in brain dynamic connectivity analysis where different constraints along different tensor modes are imposed. Second, our problem is related to a recently proposed tensor decomposition method with generalized lasso penalties in Madrid-Padilla and Scott (2017). However, the two proposals differ in several ways. We use the
truncation to achieve sparsity, whereas they used the lasso penalty. It is known that former yields an unbiased estimator, while the latter leads to a biased one in high-dimensional estimation(Shen et al., 2012b; Zhu et al., 2014)
. Moreover, our rate of convergence of the parameter estimators are established under a general error tensor, whereas theirs required the error tensor to be Gaussian. Third, our method also differs from the structured sparse principal components analysis ofJenatton et al. (2010)
. The latter seeks the factors that maximumly explain the variance of the data while respecting some structural constraints, and it assumes the group structure is known a priori. By contrast, our method aims to identify a low-rank representation of the tensor data, and does not require any prior structure information but learns the group structure adaptively given the data.
2.3 A special case: tensor Gaussian mixture model
In model (1), no distributional assumption is imposed on the error tensor . In this section, we show that, if one further assumes that is a standard Gaussian tensor, then our method reduces to a tensor version of Gaussian mixture model.
An -way tensor
is said to follow a tensor normal distribution(Kolda and Bader, 2009) with mean and covariance matrices, , denoted as , if and only if , where , vec denotes the tensor vectorization, and denotes the matrix Kronecker product. Following the usual definition of Gaussian mixture model, we say is drawn from a tensor Gaussian mixture model, if its density function is of the form, , where is the mixture weight, and . We next consider two special cases of our general tensor clustering model (1).
First, when the error tensor in (1) is a standard Gaussian tensor, our clustering model is equivalent to assuming the tensor-valued samples , follow the above tensor Gaussian mixture model. That is,
where is a identity matrix. Thus the tensor samples are drawn from a tensor Gaussian mixture model with the, .
Second, when the error tensor in (1) is a general Gaussian tensor, our model is equivalent to assuming the samples follow
where is a general covariance matrix, .
blackIn our tensor clustering solution, we have chosen not to estimate those covariance matrices. This may lose some estimation efficiency, but it greatly simplifies both the computation and the theoretical analysis. Meanwhile, the methodology we develop can be extended to incorporate general covariances in a straightforward fashion, where the tensor cluster means and the covariance matrices can be estimated using a high-dimensional EM algorithm (Hao et al., 2018), in which the M-step solves a penalized weighted least squares. We do not pursue this line of research in this article.
3.1 Optimization algorithm
We first introduce some operators for achieving the sparsity and fusion structures of a given dense vector. We then present our optimization algorithm.
The first operator is a truncation operator to obtain the sparse structure. For a vector and a scaler , we define as
where refers to the set of indices of corresponding to its largest absolute values. The second is a fusion operator. For a vector and a fusion parameter , the fused vector is obtained via the fused lasso (Tibshirani et al., 2005); i.e.,
An efficient ADMM-based algorithm for this fused lasso has been developed in Zhu (2017). The third operator is a combination of the truncation and fusion operators. For and parameters , , we define as
Lastly, we denote as the normalization operator on a vector .
We propose a structured tensor factorization procedure in Algorithm 1. It consists of three major steps. Specifically, the step in (7) essentially obtains an unconstrained tensor decomposition, and is done through the classical tensor power method (Anandkumar et al., 2014b). Here, for a tensor and a set of vectors , the multilinear combination of the tensor entries is defined as . It is then followed by our new Truncatefuse step in (8) to generate a sparse and fused component. Finally, the step in (9) normalizes the component to ensure the unit-norm. These three steps form a full update cycle of one decomposition component. The algorithm then updates all components in an alternating fashion, until some termination condition is satisfied. In our implementation, the algorithm terminates if the total number of iterations exceeds 20, or .
In terms of computational complexity, we note that, the complexity of operation in (7) is , while the complexity in (8) consists of for the truncation operation and for the fusion operation (Tibshirani and Taylor, 2011). blackTherefore, the total complexity of Algorithm 1 is , by noting that . It is interesting to note that, when the tensor order and the dimension along each tensor mode is of a similar value, the complexity of the sparsity and fusion operation in (8) is to be dominated by that in (7). Consequently, our addition of the sparsity and fusion structures does not substantially increase the overall complexity of the tensor decomposition.
Based upon the structured tensor factorization, we next summarize in Algorithm 2 our proposed dynamic tensor clustering procedure illustrated in Figure 1. It is noted that Step 2 of Algorithm 2 utilizes the structured tensor factorization to obtain a reduced data
, whose columns consist of all information of the original samples that are relevant to clustering. This avoids the curse of dimensionality through substantial dimension reduction. We also note that Step 2 provides a reduced data along each mode of the original tensor, and hence it is straightforward to achieve co-clustering of any single or multiple tensor modes. This is different from the classical clustering methods, where extension from clustering to bi-clustering generally requires different optimization formulations(see, e.g., Chi and Allen, 2017).
3.2 Application example: brain dynamic connectivity analysis
Our proposed dynamic tensor clustering approach applies to many different applications involving dynamic tensor. Here we consider one specific application, brain dynamic connectivity analysis. Brain functional connectivity describes interaction and synchronization of distinct brain regions, and is characterized by a network, with nodes representing regions, and links measuring pairwise dependency between regions. This dependency is frequently quantified by Pearson correlation coefficient, and the resulting connectivity network is a region by region correlation matrix (Fornito et al., 2013). Traditionally, it is assumed that connectivity networks do not change over time when a subject is resting during the scan. However, there is growing evidence suggesting the contrary that networks are not static but dynamic over the time course of scan (Hutchison et al., 2013). To capture such dynamic changes, a common practice is to introduce sliding and overlapping time windows, compute the correlation network within each window, then apply a clustering method to identify distinct states of connectivity patterns over time (Allen et al., 2014).
There are several scenarios within this context, which share some similar characteristics. The first scenario is when we aim to cluster individual subjects, each represented by a tensor that describes the brain connectivity pattern among brain regions over sliding and overlapping time windows. Here denotes the sample size, the number of brain regions, and the total number of moving windows. In this case, and . It is natural to encourage sparsity in the first two modes of , and thus to improve interpretability and identification of connectivities among important brain regions. Meanwhile, it is equally intuitive to encourage smoothness along the time mode of , since the connectivity patterns among the adjacent and overlapping time windows are indeed highly correlated and similar. Toward that end, we propose the following structured tensor factorization based clustering solution, by considering the minimization problem,
where and are the cardinality and fusion parameters, respectively. This optimization problem can be solved via Algorithm 1. Given that the connectivity matrix is symmetric, i.e., is symmetric in its first two modes, we can easily incorporate such a symmetry constraint by setting in Algorithm 1.
Another scenario is to cluster sliding windows for a single subject, so to examine if the connectivity pattern is dynamic or not over time. Here and . In this case, we consider the following minimization problem,
If one is to examine the dynamic connectivity pattern for subjects simultaneously, then . Both problems can be solved by dynamic tensor clustering in Algorithm 2.
Our clustering algorithm involves a number of tuning parameters. To facilitate the computation, we propose a two-step tuning procedure. We first choose the rank , the sparsity parameters , and the fusion parameters , by minimizing
refers to the total degrees of freedom, and is estimated by, in which the degrees of freedom of an individual component is defined as the number of unique non-zero elements in . For simplicity, we set and . The selection criterion (10) balances model fitting and model complexity, and the criterion of a similar form has been widely used in model selection (Wang, 2015). Next, we tune the number of clusters by employing a well established gap statistic (Tibshirani et al., 2001). The gap statistic selects the best as the minimal one such that , where
is the standard error corresponding to theth gap statistic. In the literature, stability-based methods (Wang, 2010; Wang and Fang, 2013) have also been proposed that can consistently select the number of clusters. We choose the gap statistic due to its computational simplicity.
4.1 Theory with
We first develop the theory for the rank case in this section, then for the general rank case in the next section. We begin with the derivation of the rate of convergence of the general structured tensor factorization estimator from Algorithm 1. We then establish the clustering consistency of our proposed dynamic tensor clustering in Algorithm 2. The difficulties in the technical analysis lie in the non-convexity of the tensor decomposition and the incorporation of the sparsity and fusion structures.
Recall that we observe the tensor with noise as specified in (1). To quantify the noise level of the error tensor, we define the sparse spectral norm of as,
where , . It quantifies the perturbation error in a sparse scenario.
Consider the structure in . Assume the true decomposition components are sparse and smooth in that and , for . Furthermore, assume the initialization satisfies , , with .
We recognize that the initialization error bound , since the components are normalized to have a unit norm. This condition on the initial error is very weak, by noting that we only require the initialization to be slightly better than a naive estimator whose entries are all zeros so to avoid . As such, the proposed random initialization in our algorithm satisfies this condition with high probability. The next theorem establishes the convergence rate of the structured tensor factorization under the rank .
With being a constant strictly less than 1, we see a clear tradeoff between the signal level and the error tensor according to the condition . Besides, the derived upper bound in reveals an interesting interaction of the initial error , the signal level , and the error tensor . Apparently, the error bound can be reduced by lowering or , or increasing . For a fixed signal level , more noisy samples, i.e., a larger , would require a more accurate initialization in order to obtain the same error bound. Moreover, this error bound is a monotonic function of the smoothness parameter . In contrast to the non-smoothed tensor factorization with , our structured tensor factorization is able to greatly reduce the error bound, since is usually much smaller than in the dynamic setup.
Based on the general rate derived in Theorem 1, the next result shows that Algorithm 1 generates a contracted estimator. It is thus guaranteed that the estimator converges to the truth as the number of iterations increases.
Assume the conditions in Theorem 1, and the error tensor satisfies that
If , , then the update in our algorithm satisfies with high probability.
Corollary 1 implies that the distance of the estimator to the truth from each iteration is at most half of the one from the previous iteration. Simple algebra implies that the estimator in the th iteration of the algorithm satisfies , which converges to zero as increases. Here the assumption on is imposed to ensure that the contraction rate of the estimator is 1/2, and it can be relaxed for a slower contraction rate.
It is also noteworthy that the above results do not require specification of the error tensor distribution. Next we derive the explicit form of the estimation error when is a Gaussian tensor. blackIn the following, means , and means are of the same order up to a logarithm term, i.e., for some constant .
Assume the conditions in Theorem 1, and assume is a Gaussian tensor. Then we have for some constant . In addition, if the signal strength satisfies , we have the update in one iteration of our algorithm satisfies
Next we establish the consistency of our dynamic tensor clustering Algorithm 2 under the tensor Gaussian mixture model. Consider a collection of -way tensor samples, , from a tensor Gaussian mixture model, with rank-1 centers, , and equal prior probability , for . As before, we assume an equal number of samples in each cluster. We denote the component from the last mode as,
Define the true cluster assignments as . blackRecall that the estimated cluster assignments are obtained from Algorithm 2. Then we show that our proposed dynamic tensor clustering estimator is consistent, in that the estimated cluster centers from our dynamic tensor clustering converge to the truth consistently, and that the estimated cluster assignments recover the true cluster structures with high probability.
Assume and Assumption 1 holds. If , and the signal strength satisfies that , then the estimator satisfies that
Moreover, if for some constant , we have for any with high probability.
Compared to Corollary 2 for the general structured tensor factorization, Theorem 2 requires a stronger condition on the signal strength in order to ensure the estimation error of cluster centers converges to zero at a desirable rate. Given this rate and an additional condition on the minimal gap between clusters, we are able to ensure that the estimated clusters recover the true clusters with high probability. It is also worth mentioning that our theory allows the number of clusters to diverge polynomially with the sample size . blackBesides, we allow and () to diverge to infinity, as long as the signal strength condition is satisfied.
4.2 Theory with a general rank
We next extend the theory of structured tensor factorization and dynamic clustering to the general rank case. For this case, we need some additional assumptions on the structure of tensor factorization, the initialization, and the noise level.
We first introduce the concept of incoherence, which is to quantify the correlation between the decomposed components.
Denote the initialization error for some . Denote and . Define . The next theorem establishes the convergence rate of the structured tensor factorization under a general rank .