The classification of high-dimensional signals arises in a variety of image processing settiings: object and digit recognition [1, 2], speaker identification [3, 4], tumor classification [5, 6], and more. A standard technique is to find a low-dimensional representation of the signal, such as a subspace or union of subspaces on which the signal approximately lies. However, for many signals, such as dynamic scene videos  or tomographic images 
, the signal inherently is multi-dimensional, involving dimensions of space and/or time. To use standard techniques, one vectorizes the signal, which throws out the spatial structure of the data which could be leveraged to improve representation fidelity, reconstruction error, or classification performance.
In order to exploit multi-dimensional signal structure, researchers have proposed tensor-based
dictionary learning techniques, in which the signal of interest is a matrix or a higher-order tensor and the dictionary defining the (union of) subspace model is a tensor. A simple tensor-based model is theKronecker-structured (K-S) model, in which a two-dimensional signal is represented by a coefficient matrix and two matrix dictionaries that pre- and post-multiply the coefficient matrix, respectively. Vectorizing this model leads to a dictionary that is the Kronecker product of two smaller dictionaries; hence the K-S model is a specialization of subspace models. This model is applied to spatio-temporal data in 
, low-complexity methods for estimating K-S covariance matrices are developed in, and it is shown that the sample complexity of K-S models is smaller than standard union-of-subspace models in .
As standard union-of-subspace models have proven successful for classification tasks [12, 13, 14], a natural question is the classification performance of K-S subspace models. In this paper, we address this question from an information-theoretic perspective and developed an algorithm for learning discriminative K-S dictionaries. We consider a signal model in which each signal class is associated with a subspace whose basis is the Kronecker product of two smaller dictionaries; equivalently, we suppose that each signal class has a matrix normal distribution, where the row and column covariances are approximately low rank. Here the covariance of signal class follows a specific structure which is exactly the Kronecker product of two lower dimensional covariance matrices [15, 16, 17]. In this sense, signals are drawn from a matrix Gaussian mixture model (GMM), similar to , where each K-S subspace is associated with a mixture component.
To find the underlying low dimensional representation of signals, dictionary learning methods are widely used [19, 20, 21]. The underlying signal is compactly represented by a few large coefficients in an overcomplete dictionary. In a standard dictionary learning setting a 1-D signal is represented using a sparse coefficient vector , where an overcomplete dictionary is learned by minimization problems similar to
Where denotes the Forbenius norm, denotes the -norm, and denotes the strength of the sparsity prior. Well-established methods for dictionary learning in this framework include K-SVD  and the method of optimal directions . These methods are targeted at dictionaries that faithfully represent the signal, and do not specifically consider classification.
, which jointly learn a linear classifier and an overcomplete dictionary that is shared in common among the classes. Signals are then classified in the feature space induced by the dictionary. By contrast,[25, 13, 26, 27] propose methods for learning class-specific dictionaries, either by promoting incoherence among dictionaries or learning class-specific features. Signals are then classified by choosing the dictionary that minimizes the reconstruction error.
The above methods consider one-dimensional signals; multidimensional signals must first be vectorized, which may sacrifice structural information about the signal that could improve signal representation or classification. To preserve signal structure, extends K-SVD to tensor dictionaries, and [29, 30, 31, 6] employ a variety of tensor decompositions to learn dictionaries tailored to multidimensional structure. These methods boast improved performance over traditional methods on a variety of signal processing tasks, including image reconstruction, image denoising and inpainting, video denoising, and speaker classification.
Similar to , we first study the classification performance limits of K-S models in terms of diversity order and classification capacity, characterizing the performance in the limit of high SNR and large signal dimension, respectively. Further, we derive a tight upper bound on the misclassification probability in terms of the pairwise geometry of individual row and column subspaces. Where row and column subspaces correspond to two matrix dictionaries that pre- and post-multiply the coefficient matrix, respectively. We use principal angles between the subspaces as a measure to describe the geometry of subspaces [33, 34].
Finally, to learn discriminative dictionaries, we propose a new method, termed Kronecker-Structured Learning of Discriminative Dictionaries (K-SLD), that exploit multidimensional structure of the signal. K-SLD learns two subspace dictionaries per class: one to represent the columns of the signal, and one to represent the rows. Inspired by , we choose dictionaries that both represent each class individually and can be concatenated to form an overcomplete dictionary to represent signals generally. K-SLD is fast and learns compact data models with many fewer parameters than standard dictionary learning methods. We evaluate the performance of K-SLD on the Extended YaleB and UCI EEG database. The resulting dictionaries improve classification performance by up to 5% when training sets are small, improve reconstruction performance across the board, and result in dictionaries with no more than 5% of the storage requirements of existing subspace models.
In Section II, we describe the K-S classification model in detail. In Section III we derive the diversity order for K-S classification problems, showing the exponent of the probability of error as the SNR goes to infinity. This analysis depends on a novel expression, presented in Lemma 3, for the rank of sums of Kronecker products of tall matrices. In Section IV we provide high-SNR approximations to the classification capacity. In Section V, we propose a discriminative K-S dictionary learning algorithm which balances the learning of class-specific, Kronecker-structured subspaces against the learning of an general overcomplete dictionary that allows for the representation of general signals. In Section VI
we show that the empirical classification performance of K-S models agrees with the diversity analysis and evaluate the performance of proposed discriminative algorithm on extended YaleB face recognition dataset and EEG signal dataset correlating the EEG signals with individual’s alcoholism.
Ii Problem Definition
Ii-a Kronecker-structured Signal Model
To formalize the classification problem, let the signal of interest be a matrix whose entries are distributed according to one of class-conditional densities . Each class-conditional density corresponds to a Kronecker-structured model described by the pair of matrices and . The matrix describes the subspace on which the columns of approximately lie, and describes the subspace on which the rows of approximately lie. More precisely, if belongs to class , it has the form
has i.i.d. zero-mean Gaussian entries with variance, and has i.i.d. zero-mean Gaussian entries with unit variance. We can also express in vectorized form:
for coefficient vector , and noise vector , where , , and where is the usual Kronecker product. Then, the class-conditional density of is
In other words, the vectorized signal lies near a subspace with a Kronecker structure that encodes the row and column subspaces of .
In the sequel, we will characterize the performance limits over ensembles of classification problems of this form. To this end, we parameterize the set of class-conditional densities via
which contains the set of matrices indicating the row and column subspaces given signal and subspace dimensions . We can represent an -ary classification problem by a tuple , where each is the pair of matrices . Let , for , denote the class conditional densities parametrized by . For a classification problem defined by , we can define the average misclassification probability:
where is the output of the maximum-likelihood classifier over the class-conditional densities described by . In this paper, we provide two asymptotic analyses of . First, we consider the diversity order, which characterizes the slope of for a particular as . Second, we consider the classification capacity, which characterizes the asymptotic error performance averaged over as go to infinity. For the latter case, we define a prior distribution over the matrix pairs in each class:
where is the th element of matrix and is the th element of matrix . Note that the column and row subspaces described by and
are uniformly distributed over the Grassmann manifold because the matrix elements are i.i.d. Gaussian; however, the resulting K-S subspaces are not uniformly distributed.
Ii-B Diversity Order
For a fixed classification problem , the diversity order characterizes the decay of the misclassification probability as the noise power goes to zero. By analogy with the definition of the diversity order in wireless communications , we consider the asymptotic slope of on a logarithmic scale as that is the mismatch between data and model is vanishingly small. Formally, the diversity order is defined as
In Section III, we characterize exactly the diversity order for almost every .
Ii-C Classification Capacity
The classification capacity characterizes the number of unique subspaces that can be discerned as , , and go to infinity. That is, we derive bounds on how fast the number of classes can grow as a function of signal dimension while ensuring the misclassification probability decays to zero almost surely. Here, we define a variable 111Note that is different from , where is the variable we let to go to infinity and . and let it go to infinity. As grows to infinity we let the dimensions , , and scale linearly with as follows:
for and . We let the number of classes grow exponentially in as:
for some , which we call the classification rate. We say that the classification rate is achievable if . For fixed signal dimension ratios and , we define as the supremum over all achievable classification rates, and we call (sometimes abbreviated by ) the classification capacity.
We can bound the classification capacity by the mutual information between the signal vector and the matrix pair , that characterizes each Kronecker-structured class.
The classification capacity satisfies:
Where the mutual information is computed with respect to .
To prove lower bounds on the diversity order and classification capacity, we will need the following lemma, which gives the well-known Bhattacharyya bound on the probability of error of a maximum-likelihood classifier that chooses between two Gaussian hypotheses.
Lemma 2 ().
Consider a signal distributed according to or with equal priors. Then, define
Supposing maximum likelihood classification, the misclassification probability is bounded by
Ii-D Subspace Geometry
We characterize the subspace geometry in terms of principal angles. Principal angle defines as the canonical angles between elements of subspaces, and they induce a distance metric on the Grassmann manifold. If the principal angles between subspaces is large, this means that the subspaces are far apart and easily discernible.
Consider two linear subspaces and of with same dimensions each. The principal angles between these two subspaces are defined recursively as follows:
where and the first principal angle is the smallest angle between all pairs of unit vectors in the first and the second subspaces .
The principal angles can be computed directly via computing the singular value decomposition (SVD) of , where and are orthonormal basis for the subspaces and , respectively.
where the cosine of principal angles, , are the singular values of .
In this problem, suppose and are orthonormal basis for the subspaces and on which columns of signal approximately lies and and are orthonormal basis for the subspaces and on which rows of signal approximately lies. Then we define the orthonormal basis and for the Kronecker-structured subspaces and , respectively. The cosine of principal angles between and are the singular values of as follows:
where the cosine of principal angle between two Kronecker subspaces is the Kronecker product of cosine of principal angles between two row subspaces and two column subspaces that is .
Iii Diversity Order
As mentioned in Section II, the diversity order measures how quickly misclassification probability decays with the noise power for a fixed number of discernible subspaces. By careful analysis using the Bhattacharrya bound, we derive an exact expression for the diversity order for almost every222With respect to the Lebesgue measure over . classification problem. First, we state an expression that holds in general.
For a classification problem described by the tuple such that and for every , the diversity order is , where
and where denotes the matrix rank.
Applying the Bhattacharyya bound, the probability of a pairwise error between two Kronecker-structured classes and with covariances
is bounded by
Using the well-known Kronecker product identities and we can write the matrix as
It is trivial that , thus
denote the nonzero eigenvalues ofand respectively, and let denote the nonzero eigenvalues of and denote its rank. Then, we can write the pairwise bound in (17).
Using Weyl’s monotonicity theorem and for every , Therefore,
From this we can write
Next, we bound via the union bound. For all the subspaces, we obtain the pairwise error probability and by invoking the union bound over all the subspaces we obtain:
Taking logarithm on both sides we obtain:
Finally,  shows that the Bhattacharyya bound is exponentially tight as the pairwise error decays to zero. Furthermore, the union bound is exponentially tight. Therefore, the above inequality holds with equality, and ∎
For almost every classification problem, the rank has the same value, as we show in the next lemma.
For almost every classification problem , the matrices have rank
where denotes the positive part of a number.
Using standard matrix properties (e.g., ), we can write
Almost every matrix has full rank, so almost everywhere, so we can rewrite (28) as
Next, we study the three possible cases for (29).
Case 1: and . Here,
Case 2: and . Here,
Case 3: and . Here,
where the first and second equalities for each case hold almost everywhere, and the third equality for each case follows from Lemma 4. Combining the three cases yields the claim. ∎
For almost every classification problem , the diversity order is
Iii-a Diversity Order Gap
Diversity order characterize the slope of error probability, higher the diversity order faster the decay of misclassification probability. Since the Kronecker-structured subspaces comes from a restricted set of subsapces, the error performance of these subspaces can be worse. Therefore, to verify the efficiency of Kronecker subspaces, we characterizes the diversity order gap as the difference between the slope of misclassification probability of K-S subspaces and the standard subspaces. This diversity order gap is a function of signal dimensions, that is, and . We derive the signal dimension regimes where the diversity order gap is significant or/and zero.
Diversity order for K-S subspaces:
For the standard subspaces model in (3), the signal of interest and coefficient vector where, and . From , for the standard subspaces of same dimensions the diversity order would look like . This can be written in terms of Kronecker signal dimensions.
Diversity order for standard subspace:
We observe that the diversity order for K-S models is never greater to the diversity order of standard subspace, for any value of . However, for some regimes the diversity order of K-S model is smaller or equal to standard subspaces.
if then :
if then :
where is the diversity order gap. For any other region no diversity order gap exists, that is, . The details are provided in Appendix A.
The high-SNR classification performance of K-S subspaces is the same as general subspaces when the subspace dimensions are small, even though K-S subspaces are structured, involve fewer parameters, and are easier to train.
Iii-B Misclassification Probability in terms of Row and Column Subspaces Geometry
We derive a more accurate and tight high-SNR approximation of the probability of error in terms of principal angles between the K-S subspaces and also in terms of principal angle between the individual rows and columns subspaces. Using the eigenvalue decomposition of covariance of row subspace and the column subspace , where are the orthonormal basis of row and column subspace respectively and the are the eigenvalues of row and column subspaces, we can write the signal covariance as:
Similarly, . From , the Kronecker product of two orthonormal matrix is a orthonormal matrix, thus are the orthonormal bases and the diagonal elements of are the eigenvalues. From equation (27), the rank of sum of two Kronecker products is written as:
Since the intersection of two K-S subspaces define this rank and hence plays an important role in bounding the misclassification probability from above. According to , one can write the covariances of K-S subspaces in terms of subspaces intersections as follows:
Here corresponds to the K-S subspace intersection and corresponds to the set minus and respectively. Here accounts for the overlap between the subspaces, smaller the overlap between subspaces easier it to discern the classes. While on the other hand, means the complete overlap between subspaces and it becomes hard to discriminate between classes.
As , the misclassification probability in terms of principal angle between individual row and column subspaces is upper bounded as
and denotes the pseudo-determinant.
Appendix C. ∎
In case of no overlap between subspaces, that is, , both and as the misclassification probability is inversely related to the product of all principal angles, this makes the misclassification error negligibly small. On the other side, with subspace overlap , and has some positive value, there exists some non-trivial principal angles which effect the classification performance and it becomes very hard to distinguish between the subspaces.
Iv Classification capacity
In this section, we derive upper and lower bounds on the classification capacity that hold approximately for large . Detailed analysis can be found in the long version of the paper.
The classification capacity is upper bounded by