I Introduction
The classification of highdimensional signals arises in a variety of image processing settiings: object and digit recognition [1, 2], speaker identification [3, 4], tumor classification [5, 6], and more. A standard technique is to find a lowdimensional representation of the signal, such as a subspace or union of subspaces on which the signal approximately lies. However, for many signals, such as dynamic scene videos [7] or tomographic images [8]
, the signal inherently is multidimensional, involving dimensions of space and/or time. To use standard techniques, one vectorizes the signal, which throws out the spatial structure of the data which could be leveraged to improve representation fidelity, reconstruction error, or classification performance.
In order to exploit multidimensional signal structure, researchers have proposed tensorbased
dictionary learning techniques, in which the signal of interest is a matrix or a higherorder tensor and the dictionary defining the (union of) subspace model is a tensor. A simple tensorbased model is the
Kroneckerstructured (KS) model, in which a twodimensional signal is represented by a coefficient matrix and two matrix dictionaries that pre and postmultiply the coefficient matrix, respectively. Vectorizing this model leads to a dictionary that is the Kronecker product of two smaller dictionaries; hence the KS model is a specialization of subspace models. This model is applied to spatiotemporal data in [9], lowcomplexity methods for estimating KS covariance matrices are developed in
[10], and it is shown that the sample complexity of KS models is smaller than standard unionofsubspace models in [11].As standard unionofsubspace models have proven successful for classification tasks [12, 13, 14], a natural question is the classification performance of KS subspace models. In this paper, we address this question from an informationtheoretic perspective and developed an algorithm for learning discriminative KS dictionaries. We consider a signal model in which each signal class is associated with a subspace whose basis is the Kronecker product of two smaller dictionaries; equivalently, we suppose that each signal class has a matrix normal distribution, where the row and column covariances are approximately low rank. Here the covariance of signal class follows a specific structure which is exactly the Kronecker product of two lower dimensional covariance matrices [15, 16, 17]. In this sense, signals are drawn from a matrix Gaussian mixture model (GMM), similar to [18], where each KS subspace is associated with a mixture component.
To find the underlying low dimensional representation of signals, dictionary learning methods are widely used [19, 20, 21]. The underlying signal is compactly represented by a few large coefficients in an overcomplete dictionary. In a standard dictionary learning setting a 1D signal is represented using a sparse coefficient vector , where an overcomplete dictionary is learned by minimization problems similar to
(1) 
Where denotes the Forbenius norm, denotes the norm, and denotes the strength of the sparsity prior. Wellestablished methods for dictionary learning in this framework include KSVD [22] and the method of optimal directions [23]. These methods are targeted at dictionaries that faithfully represent the signal, and do not specifically consider classification.
Methods for incorporating discriminative ability into dictionary learning have been proposed, such as discriminative KSVD [12] (DKSVD) and label consistent (LCKSVD) [24]
, which jointly learn a linear classifier and an overcomplete dictionary that is shared in common among the classes. Signals are then classified in the feature space induced by the dictionary. By contrast,
[25, 13, 26, 27] propose methods for learning classspecific dictionaries, either by promoting incoherence among dictionaries or learning classspecific features. Signals are then classified by choosing the dictionary that minimizes the reconstruction error.The above methods consider onedimensional signals; multidimensional signals must first be vectorized, which may sacrifice structural information about the signal that could improve signal representation or classification. To preserve signal structure,[28] extends KSVD to tensor dictionaries, and [29, 30, 31, 6] employ a variety of tensor decompositions to learn dictionaries tailored to multidimensional structure. These methods boast improved performance over traditional methods on a variety of signal processing tasks, including image reconstruction, image denoising and inpainting, video denoising, and speaker classification.
Similar to [32], we first study the classification performance limits of KS models in terms of diversity order and classification capacity, characterizing the performance in the limit of high SNR and large signal dimension, respectively. Further, we derive a tight upper bound on the misclassification probability in terms of the pairwise geometry of individual row and column subspaces. Where row and column subspaces correspond to two matrix dictionaries that pre and postmultiply the coefficient matrix, respectively. We use principal angles between the subspaces as a measure to describe the geometry of subspaces [33, 34].
Finally, to learn discriminative dictionaries, we propose a new method, termed KroneckerStructured Learning of Discriminative Dictionaries (KSLD), that exploit multidimensional structure of the signal. KSLD learns two subspace dictionaries per class: one to represent the columns of the signal, and one to represent the rows. Inspired by [26], we choose dictionaries that both represent each class individually and can be concatenated to form an overcomplete dictionary to represent signals generally. KSLD is fast and learns compact data models with many fewer parameters than standard dictionary learning methods. We evaluate the performance of KSLD on the Extended YaleB and UCI EEG database. The resulting dictionaries improve classification performance by up to 5% when training sets are small, improve reconstruction performance across the board, and result in dictionaries with no more than 5% of the storage requirements of existing subspace models.
In Section II, we describe the KS classification model in detail. In Section III we derive the diversity order for KS classification problems, showing the exponent of the probability of error as the SNR goes to infinity. This analysis depends on a novel expression, presented in Lemma 3, for the rank of sums of Kronecker products of tall matrices. In Section IV we provide highSNR approximations to the classification capacity. In Section V, we propose a discriminative KS dictionary learning algorithm which balances the learning of classspecific, Kroneckerstructured subspaces against the learning of an general overcomplete dictionary that allows for the representation of general signals. In Section VI
we show that the empirical classification performance of KS models agrees with the diversity analysis and evaluate the performance of proposed discriminative algorithm on extended YaleB face recognition dataset and EEG signal dataset correlating the EEG signals with individual’s alcoholism.
Ii Problem Definition
Iia Kroneckerstructured Signal Model
To formalize the classification problem, let the signal of interest be a matrix whose entries are distributed according to one of classconditional densities . Each classconditional density corresponds to a Kroneckerstructured model described by the pair of matrices and . The matrix describes the subspace on which the columns of approximately lie, and describes the subspace on which the rows of approximately lie. More precisely, if belongs to class , it has the form
(2) 
where
has i.i.d. zeromean Gaussian entries with variance
, and has i.i.d. zeromean Gaussian entries with unit variance. We can also express in vectorized form:(3) 
for coefficient vector , and noise vector , where , , and where is the usual Kronecker product. Then, the classconditional density of is
(4) 
In other words, the vectorized signal lies near a subspace with a Kronecker structure that encodes the row and column subspaces of .
In the sequel, we will characterize the performance limits over ensembles of classification problems of this form. To this end, we parameterize the set of classconditional densities via
(5) 
which contains the set of matrices indicating the row and column subspaces given signal and subspace dimensions . We can represent an ary classification problem by a tuple , where each is the pair of matrices . Let , for , denote the class conditional densities parametrized by . For a classification problem defined by , we can define the average misclassification probability:
(6) 
where is the output of the maximumlikelihood classifier over the classconditional densities described by . In this paper, we provide two asymptotic analyses of . First, we consider the diversity order, which characterizes the slope of for a particular as . Second, we consider the classification capacity, which characterizes the asymptotic error performance averaged over as go to infinity. For the latter case, we define a prior distribution over the matrix pairs in each class:
(7) 
where is the th element of matrix and is the th element of matrix . Note that the column and row subspaces described by and
are uniformly distributed over the Grassmann manifold because the matrix elements are i.i.d. Gaussian; however, the resulting KS subspaces are not uniformly distributed.
IiB Diversity Order
For a fixed classification problem , the diversity order characterizes the decay of the misclassification probability as the noise power goes to zero. By analogy with the definition of the diversity order in wireless communications [35], we consider the asymptotic slope of on a logarithmic scale as that is the mismatch between data and model is vanishingly small. Formally, the diversity order is defined as
(8) 
In Section III, we characterize exactly the diversity order for almost every .
IiC Classification Capacity
The classification capacity characterizes the number of unique subspaces that can be discerned as , , and go to infinity. That is, we derive bounds on how fast the number of classes can grow as a function of signal dimension while ensuring the misclassification probability decays to zero almost surely. Here, we define a variable ^{1}^{1}1Note that is different from , where is the variable we let to go to infinity and . and let it go to infinity. As grows to infinity we let the dimensions , , and scale linearly with as follows:
(9) 
for and . We let the number of classes grow exponentially in as:
(10) 
for some , which we call the classification rate. We say that the classification rate is achievable if . For fixed signal dimension ratios and , we define as the supremum over all achievable classification rates, and we call (sometimes abbreviated by ) the classification capacity.
We can bound the classification capacity by the mutual information between the signal vector and the matrix pair , that characterizes each Kroneckerstructured class.
Lemma 1.
The classification capacity satisfies:
(11) 
Where the mutual information is computed with respect to .
To prove lower bounds on the diversity order and classification capacity, we will need the following lemma, which gives the wellknown Bhattacharyya bound on the probability of error of a maximumlikelihood classifier that chooses between two Gaussian hypotheses.
Lemma 2 ([36]).
Consider a signal distributed according to or with equal priors. Then, define
(12) 
Supposing maximum likelihood classification, the misclassification probability is bounded by
(13) 
IiD Subspace Geometry
We characterize the subspace geometry in terms of principal angles. Principal angle defines as the canonical angles between elements of subspaces, and they induce a distance metric on the Grassmann manifold. If the principal angles between subspaces is large, this means that the subspaces are far apart and easily discernible.
Consider two linear subspaces and of with same dimensions each. The principal angles between these two subspaces are defined recursively as follows:
where and the first principal angle is the smallest angle between all pairs of unit vectors in the first and the second subspaces [37].
The principal angles can be computed directly via computing the singular value decomposition (SVD) of , where and are orthonormal basis for the subspaces and , respectively.
where the cosine of principal angles, , are the singular values of .
In this problem, suppose and are orthonormal basis for the subspaces and on which columns of signal approximately lies and and are orthonormal basis for the subspaces and on which rows of signal approximately lies. Then we define the orthonormal basis and for the Kroneckerstructured subspaces and , respectively. The cosine of principal angles between and are the singular values of as follows:
where the cosine of principal angle between two Kronecker subspaces is the Kronecker product of cosine of principal angles between two row subspaces and two column subspaces that is .
Iii Diversity Order
As mentioned in Section II, the diversity order measures how quickly misclassification probability decays with the noise power for a fixed number of discernible subspaces. By careful analysis using the Bhattacharrya bound, we derive an exact expression for the diversity order for almost every^{2}^{2}2With respect to the Lebesgue measure over . classification problem. First, we state an expression that holds in general.
Theorem 1.
For a classification problem described by the tuple such that and for every , the diversity order is , where
(14) 
and where denotes the matrix rank.
Proof:
Applying the Bhattacharyya bound, the probability of a pairwise error between two Kroneckerstructured classes and with covariances
is bounded by
(15) 
where
Using the wellknown Kronecker product identities and we can write the matrix as
(16) 
It is trivial that , thus
Let and
denote the nonzero eigenvalues of
and respectively, and let denote the nonzero eigenvalues of and denote its rank. Then, we can write the pairwise bound in (17).(17)  
(18) 
By construction,
Using Weyl’s monotonicity theorem and for every , Therefore,
From this we can write
(19)  
(20)  
(21) 
Next, we bound via the union bound. For all the subspaces, we obtain the pairwise error probability and by invoking the union bound over all the subspaces we obtain:
Taking logarithm on both sides we obtain:
(22) 
Putting this and (21) into the definition of the diversity order from (8), we obtain
(23)  
(24)  
(25) 
Finally, [36] shows that the Bhattacharyya bound is exponentially tight as the pairwise error decays to zero. Furthermore, the union bound is exponentially tight. Therefore, the above inequality holds with equality, and ∎
For almost every classification problem, the rank has the same value, as we show in the next lemma.
Lemma 3.
For almost every classification problem , the matrices have rank
(26) 
where denotes the positive part of a number.
Proof:
Using standard matrix properties (e.g., [38]), we can write
(27) 
Applying Lemma 4 from Appendix B, we obtain
(28) 
Almost every matrix has full rank, so almost everywhere, so we can rewrite (28) as
(29) 
Next, we study the three possible cases for (29).
Case 1: and . Here,
Case 2: and . Here,
Case 3: and . Here,
where the first and second equalities for each case hold almost everywhere, and the third equality for each case follows from Lemma 4. Combining the three cases yields the claim. ∎
Corollary 1.
For almost every classification problem , the diversity order is
(30) 
Iiia Diversity Order Gap
Diversity order characterize the slope of error probability, higher the diversity order faster the decay of misclassification probability. Since the Kroneckerstructured subspaces comes from a restricted set of subsapces, the error performance of these subspaces can be worse. Therefore, to verify the efficiency of Kronecker subspaces, we characterizes the diversity order gap as the difference between the slope of misclassification probability of KS subspaces and the standard subspaces. This diversity order gap is a function of signal dimensions, that is, and . We derive the signal dimension regimes where the diversity order gap is significant or/and zero.
Diversity order for KS subspaces:
(31) 
For the standard subspaces model in (3), the signal of interest and coefficient vector where, and . From [32], for the standard subspaces of same dimensions the diversity order would look like . This can be written in terms of Kronecker signal dimensions.
Diversity order for standard subspace:
(32) 
We observe that the diversity order for KS models is never greater to the diversity order of standard subspace, for any value of . However, for some regimes the diversity order of KS model is smaller or equal to standard subspaces.
When and

if then :

if then :
where is the diversity order gap. For any other region no diversity order gap exists, that is, . The details are provided in Appendix A.
The highSNR classification performance of KS subspaces is the same as general subspaces when the subspace dimensions are small, even though KS subspaces are structured, involve fewer parameters, and are easier to train.
IiiB Misclassification Probability in terms of Row and Column Subspaces Geometry
We derive a more accurate and tight highSNR approximation of the probability of error in terms of principal angles between the KS subspaces and also in terms of principal angle between the individual rows and columns subspaces. Using the eigenvalue decomposition of covariance of row subspace and the column subspace , where are the orthonormal basis of row and column subspace respectively and the are the eigenvalues of row and column subspaces, we can write the signal covariance as:
Similarly, . From [39], the Kronecker product of two orthonormal matrix is a orthonormal matrix, thus are the orthonormal bases and the diagonal elements of are the eigenvalues. From equation (27), the rank of sum of two Kronecker products is written as:
(33) 
Since the intersection of two KS subspaces define this rank and hence plays an important role in bounding the misclassification probability from above. According to [33], one can write the covariances of KS subspaces in terms of subspaces intersections as follows:
(34)  
(35) 
Here corresponds to the KS subspace intersection and corresponds to the set minus and respectively. Here accounts for the overlap between the subspaces, smaller the overlap between subspaces easier it to discern the classes. While on the other hand, means the complete overlap between subspaces and it becomes hard to discriminate between classes.
Theorem 2.
As , the misclassification probability in terms of principal angle between individual row and column subspaces is upper bounded as
(36) 
where
(37) 
,
and denotes the pseudodeterminant.
Proof:
Appendix C. ∎
In case of no overlap between subspaces, that is, , both and as the misclassification probability is inversely related to the product of all principal angles, this makes the misclassification error negligibly small. On the other side, with subspace overlap , and has some positive value, there exists some nontrivial principal angles which effect the classification performance and it becomes very hard to distinguish between the subspaces.
Iv Classification capacity
In this section, we derive upper and lower bounds on the classification capacity that hold approximately for large . Detailed analysis can be found in the long version of the paper.
Theorem 3.
The classification capacity is upper bounded by
and