Rapid advances in sensing and data acquisition technologies are increasingly resulting in individual data samples or signals structured by multiple modes. Examples include hyperspectral video (four modes; two spatial, one temporal, and one spectral), colored depth video (five modes; two spatial, one temporal, one spectral, and one depth), and four-dimensional tomography (four modes; three spatial and one temporal). Such data form multiway arrays and are called tensor data [2, 3].
Typical feature extraction approaches that handle tensor data tend to collapse or vectorize the tensor into a long one-dimensional vector and apply existing processing methods for one-dimensional data. Such approaches ignore the structure and inter-mode correlations in tensor data. More recently, several works instead assume a structure on the tensor of interest through tensor decompositions such as the CANDECOMP/PARAFAC (CP) decomposition, Tucker decomposition , and PARATUCK decomposition 
to obtain meaningful representations of tensor data. Because these decompositions involve fewer parameters, or degrees of freedom, in the model, inference algorithms that exploit such decompositions often perform better than those that assume the tensors to be unstructured. Moreover, algorithms utilizing tensor decompositions tend to be more efficient in terms of storage and computational costs: the cost of storing the decomposition can be substantially lower, and numerical methods can exploit the structure by solving simpler subproblems.
In this work, we focus on the problem of finding sparse representations of tensors that admit a Tucker decomposition. More specifically, we analyze the dictionary learning (DL) problem for tensor data. The traditional DL problem for vector-valued data involves constructing an overcomplete basis (dictionary) such that each data sample can be represented by only a few columns (atoms) of that basis . To account for the Tucker structure of tensor data, we require that the dictionary underlying the vectorized versions of tensor data samples be Kronecker structured (KS). That is, it is comprised of coordinate dictionaries that independently transform various modes of the tensor data. Such dictionaries have successfully been used for tensor data representation in applications such as hyperspectral imaging, video acquisition, distributed sensing, magnetic resonance imaging, and the tensor completion problem (multidimensional inpainting) [7, 8]. To provide some insights into the usefulness of KS dictionaries for tensor data, consider the hypothetical problem of finding sparse representations of hyperspectral images. Traditional DL methods require each image to be rearranged into a one-dimensional vector of length and then learn an unstructured dictionary that has a total of unknown parameters, where . In contrast, KS DL only requires learning three coordinate dictionaries of dimensions , , and , where , and . This gives rise to a total of unknown parameters in KS DL, which is significantly smaller than . While such “parameter counting” points to the usefulness of KS DL for tensor data, a fundamental question remains open in the literature: what are the theoretical limits on the learning of KS dictionaries underlying th-order tensor data? To answer this question, we examine the KS-DL objective function and find sufficient conditions on the number of samples (or sample complexity) for successful local identification of coordinate dictionaries underlying the KS dictionary. To the best of our knowledge, this is the first work presenting such identification results for the KS-DL problem.
I-a Our Contributions
We derive sufficient conditions on the true coordinate dictionaries, coefficient and noise distributions, regularization parameter, and the number of data samples such that the KS-DL objective function has a local minimum within a small neighborhood of the true coordinate dictionaries with high probability. Specifically, suppose the observations are generated from a true dictionary consisting of the Kronecker product of coordinate dictionaries, , where and . Our results imply that samples are sufficient (with high probability) to recover the underlying coordinate dictionaries up to the given estimation errors .
I-B Relationship to Prior Work
Among existing works on structured DL that have focused exclusively on the Tucker model for tensor data, several have only empirically established the superiority of KS DL in various settings for 2nd and 3rd-order tensor data [9, 10, 8, 11, 12, 13].
In the case of unstructured dictionaries, several works do provide analytical results for the dictionary identifiability problem [14, 15, 16, 17, 18, 19, 20, 21]. These results, which differ from each other in terms of the distance metric used, cannot be trivially extended for the KS-DL problem. In this work, we focus on the Frobenius norm as the distance metric. Gribonval et al.  and Jung et al.  also consider this metric, with the latter work providing minimax lower bounds for dictionary reconstruction error. In particular, Jung et al.  show that the number of samples needed for reliable reconstruction (up to a prescribed mean squared error ) of an dictionary within its local neighborhood must be at least on the order of . Gribonval et al.  derive a competing upper bound for the sample complexity of the DL problem and show that samples are sufficient to guarantee (with high probability) the existence of a local minimum of the DL cost function within the neighborhood of the true dictionary. In our previous works, we have obtained lower bounds on the minimax risk of KS DL for 2nd-order  and th-order tensors [23, 24], and have shown that the number of samples necessary for reconstruction of the true KS dictionary within its local neighborhood up to a given estimation error scales with the sum of the product of the dimensions of the coordinate dictionaries, i.e., . Compared to this sample complexity lower bound, our upper bound is larger by a factor .
In terms of the analytical approach, although we follow the same general proof strategy as the vectorized case of Gribonval et al. , our extension poses several technical challenges. These include: () expanding the asymptotic objective function into a summation in which individual terms depend on coordinate dictionary recovery errors, () translating identification conditions on the KS dictionary to conditions on its coordinate dictionaries, and () connecting the asymptotic objective function to the empirical objective function using concentration of measure arguments; this uses the coordinate-wise Lipschitz continuity property of the KS-DL objective function with respect to the coordinate dictionaries. To address these challenges, we require additional assumption on the generative model. These include: () the true dictionary and the recovered dictionary belong to the class of KS dictionaries, and () dictionary coefficient tensors follow the separable sparsity model that requires nonzero coefficients to be grouped in blocks [25, 24].
I-C Notational Convention and Preliminaries
Underlined bold upper-case, bold upper-case and lower-case letters are used to denote tensors, matrices and vectors, respectively, while non-bold lower-case letters denote scalars. For a tensor , its -th element is denoted as . The -th element of vector is denoted by and the -th element of matrix is denoted as . The -th column of is denoted by and denotes the matrix consisting of the columns of with indices . We use for the cardinality of the set . Sometimes we use matrices indexed by numbers, such as , in which case a second index (e.g., ) is used to denote its columns. We use to denote the vectorized version of matrix , which is a column vector obtained by stacking the columns of on top of one another. We use to denote the vector comprised of the diagonal elements of and to denote the diagonal matrix, whose diagonal elements are comprised of elements of . The elements of the sign vector of , denoted as , are equal to , for , and for , where denotes the index of any element of . We also use to denote the vector with elements (used similarly for other trigonometric functions). Norms are given by subscripts, so , , and are the , , and norms of , while and are the spectral and Frobenius norms of , respectively. We use to denote and to denote .
We write for the Kronecker product of two matrices and , where the result is an matrix and we have . We also use . We define , , and for full rank matrix . In the body, we sometimes also use .
For matrices and of appropriate dimensions, we define their distance to be . For belonging to some set , we define
Note that while represents the surface of a sphere, we use the term “sphere” for simplicity. We use the standard “big-” (Knuth) notation for asymptotic scaling.
I-C1 Tensor Operations and Tucker Decomposition for Tensors
A tensor is a multidimensional array where the order of the tensor is defined as the number of dimensions in the array.
Tensor Unfolding: A tensor of order can be expressed as a matrix by reordering its elements to form a matrix. This reordering is called unfolding: the mode- unfolding matrix of a tensor is a matrix, which we denote by . Each column of consists of the vector formed by fixing all indices of except the one in the th-order. The -rank of a tensor is defined by ; trivially, .
Tensor Multiplication: The mode- matrix product of the tensor and a matrix , denoted by , is a tensor of size whose elements are The mode- matrix product of and and the matrix multiplication of and are related :
Tucker Decomposition: The Tucker decomposition decomposes a tensor into a core tensor multiplied by a matrix along each mode [5, 3]. We take advantage of the Tucker model since we can relate the Tucker decomposition to the Kronecker representation of tensors . For a tensor of order , if holds for all then, according to the Tucker model, can be decomposed into:
Since the Kronecker product satisfies , (3) is equivalent to
where and .
I-C2 Definitions for Matrices
We use the following definitions for a matrix with unit-norm columns: denotes the restricted isometry property () constant of order for . We define the worst-case coherence of as . We also define the order- cumulative coherence of as
Note that for , the cumulative coherence is equivalent to the worst-case coherence and . For , where ’s have unit-norm columns, [28, Corollary 3.6] and it can be shown that111The proof of (6) is provided in Appendix C.:
The rest of the paper is organized as follows. We formulate the KS-DL problem in Section II. In Section III, we provide analysis for asymptotic recovery of coordinate dictionaries composing the KS dictionary and in Section IV, we present sample complexity results for identification of coordinate dictionaries that are based on the results of Section III. Finally, we conclude the paper in Section V. In order to keep the main exposition simple, proofs of the lemmas and propositions are relegated to appendices.
Ii System Model
We assume the observations are th-order tensors . Given generating coordinate dictionaries , coefficient tensor , and noise tensor , we can write using (4) as222We have reindexed ’s in (4) for ease of notation.
where denotes the sparse generating coefficient vector, denotes the underlying KS dictionary, and denotes the underlying noise vector. Here, for , and .333Note that the ’s are compact sets on their respective oblique manifolds of matrices with unit-norm columns . We use for in the following for simplicity of notation. We assume we are given noisy tensor observations, which are then stacked in a matrix . To state the problem formally, we first make the following assumptions on distributions of and for each tensor observation.
Coefficient distribution: We assume the coefficient tensor follows the random “separable sparsity” model. That is, is sparse and the support of nonzero entries of is structured and random. Specifically, we sample elements uniformly at random from , . Then, the random support of is and is associated with
via lexicographic indexing, where , and the support of ’s are assumed to be independent and identically distributed (i.i.d.). This model requires nonzero entries of the coefficient tensors to be grouped in blocks and the sparsity level associated with each coordinate dictionary to be small .444 In contrast, for coefficients following the random non-separable sparsity model, the support of the nonzero entries of the coefficient vector are assumed uniformly distributed over
In contrast, for coefficients following the random non-separable sparsity model, the support of the nonzero entries of the coefficient vector are assumed uniformly distributed over.
We now make the same assumptions for the distribution of as assumptions A and B in Gribonval et al. . These include: () , () , where , () , () magnitude of is bounded, i.e., almost surely, and () nonzero entries of have a minimum magnitude, i.e., almost surely. Finally, we define as a measure of the flatness of (, with when all nonzero coefficients are equal ).
Noise distribution: We make following assumptions on the distribution of noise, which is assumed i.i.d. across data samples: () , () , and () magnitude of is bounded, i.e., almost surely.
Our goal in this paper is to recover the underlying coordinate dictionaries, , from noisy realizations of tensor data. To solve this problem, we take the empirical risk minimization approach and define
where is a regularization parameter. In theory, we can recover the coordinate dictionaries by solving the following regularized optimization program:
More specifically, given desired errors , we want a local minimum of (9) to be attained by coordinate dictionaries . That is, there exists a set such that .555We focus on the local recovery of coordinate dictionaries (i.e., ) due to ambiguities in the general DL problem. This ambiguity is a result of the fact that dictionaries are invariant to permutation and sign flips of dictionary columns, resulting in equivalent classes of dictionaries. Some works in the literature on conventional overcome this issue by defining distance metrics that capture the distance between these equivalent classes [16, 15, 17]. To address this problem, we first minimize the statistical risk:
Then, we connect to using concentration of measure arguments and obtain the number of samples sufficient for local recovery of the coordinate dictionaries. Such a result ensures that any KS-DL algorithm that is guaranteed to converge to a local minimum, and which is initialized close enough to the true KS dictionary, will converge to a solution close to the generating coordinate dictionaries (as opposed to the generating KS dictionary, which is guaranteed by analysis of the vector-valued setup ).
Iii Asympototic Identifiability Results
Then, the map admits a local minimum such that , , for any as long as
Theorem 1 captures how the existence of a local minimum for the statistical risk minimization problem depends on various properties of the coordinate dictionaries and demonstrates that there exists a local minimum of that is in local neighborhoods of the coordinate dictionaries. This ensures asymptotic recovery of coordinate dictionaries within some local neighborhood of the true coordinate dictionaries, as opposed to KS dictionary recovery for vectorized observations [20, Theorem 1].
We now explicitly compare conditions in Theorem 1 with the corresponding ones for vectorized observations [20, Theorem 1]. Given that the coefficients are drawn from the separable sparsity model, the sparsity constraints for the coordinate dictionaries in (11) translate into
Therefore, we have . Using the fact that , this translates into sparsity order . Next, the left hand side of the condition in (1) is less than 1. Moreover, from properties of the Frobenius norm, it is easy to show that The fact that and the assumption imply that the right hand side of (1) is lower bounded by . Therefore, Theorem 1 applies to coordinate dictionaries with dimensions and subsequently, KS dictionaries with . Both the sparsity order and dictionary dimensions are in line with the scaling results for vectorized data .
Iii-B Proof Outline
For given radii , the spheres are non-empty. This follows from the construction of dictionary classes, ’s. Moreover, the mapping is continuous with respect to the Frobenius norm on all . Hence, it is also continuous on compact constraint sets ’s. We derive conditions on the coefficients, underlying coordinate dictionaries, , regularization parameter, and ’s such that
This along with the compactness of closed balls and the continuity of the mapping imply the existence of a local minimum of achieved by in open balls, ’s, .
To find conditions that ensure , we take the following steps: given coefficients that follow the separable sparsity model, we can decompose any , as
where for .666The separable sparsity distribution model implies sampling without replacement from columns of . Given a generating , we obtain by solving with respect to , conditioned on the fact that . This eliminates the dependency of on by finding a closed-form expression for given , which we denote as . Defining
we expand using (19) and separate the terms that depend on each radius to obtain conditions for sparsity levels , and coordinate dictionaries such that . Finally, we derive conditions on , coordinate dictionary coherences and ’s that ensure and .
The key assumption in the proof of Theorem 1 is expanding according to (19). This is a consequence of the separable sparsity model for dictionary coefficients. For a detailed discussion on the differences between the separable sparsity model and the random sparsity model for tensors, we refer the readers to our earlier work .
Although some of the forthcoming lemmas needed of Theorem 1 impose conditions on ’s as well as true coordinate dictionaries ’s, we later translate these conditions exclusively in terms of ’s and ’s.
The proof of Theorem 1 relies on the following propositions and lemmas. The proofs of these are provided in Appendix A.
Suppose the following inequalities hold for :
any collection of , and for all , we have :
In addition, if
then . Thus, for all .
Let where for , and be a support set generated by the separable sparsity model. Then any , can be decomposed as , where and , for . Also, the following relations hold for this model:777The equations follow from basic properties of the Kronecker product .
where and are defined in Section I-C.
Given and , the difference
where without loss of generality, each is equal to either or , for .
We drop the index from for ease of notation throughout the rest of the paper.
Let be an arbitrary sign vector and be its support. Define888The quantity is not equal to conditioned on and the expression is only used for notation.
If is invertible for , then minimizes , where
and . Thus, can be expressed in closed form as:
Assume for and let be equal to either or . For
For any satisfying of order , given and , the following relations hold:
Lemma 6 (Lemma 4 ).
Let ’s be coordinate dictionaries such that . Then for any , exists and
and for any such that :
Lemma 7 (Lemma 6 ).
Given any , there exist