1 Introduction
Dictionary leaning (DL) combined with sparse representation (SR) has become popular for many computer vision tasks. Many DL algorithms, e.g., KSVD
[2], were applied originally for unsupervised learning tasks. Recently, some supervised DL algorithms have been proposed for classification tasks which exploit
class label information in the training samples. They include DKSVD [3] and LCKSVD [4], to name a few. However, DL for highdimensional data is computationally expensive. To circumvent this issue, dimensionality reduction (DR) techniques are used which reduce the computational cost and highlight the lowdimensional discriminative feature of the data.
In general, DR is applied first to the data samples, and then the dimensionalityreduced data are used for DL. The separately prelearned DR projection matrix, however, does not fully promote the latent structure of data or preserve the best feature for DL [5]. To address this issue, Feng et al. [6] have proposed integration of DL and DR for improvement of the discriminative classification performance, in which a specific constraint similar to the Fisher linear discriminative analysis is imposed on the coefficient matrix. Similarly, Yang et al. [7] propose learning of the projection matrix and classspecific dictionary jointly. Li et al. [8] report an integrated learning method of the nonnegative projection matrix. Foroughi et al. [9] discuss specific constraints on the coefficient matrix and on the projection matrix.
In many computer vision tasks, data of interest often reside on a manifold, which is a generalization of the Euclidean space. A particular manifold of interest is the manifold of symmetric positive definite
(SPD) matrices that has been widely used in many applications. For example, region covariance matrices (RCM), which are symmetric positive definite, give good performance in texture classification and face recognition tasks
[10, 11]. The diagonal elements of a RCM represent the variances of coponent features, and the offdiagonal elements indicate the respective correlations among them. Therefore, the RCM can represent multiple features in a natural way. It should be noted that the SPD matrices form a
Riemannian manifold, which allows to understand the geometry of the space [12]. Cherian and Sra [13] exploit the manifold structure to propose a Riemannian DL and sparse coding (SC) algorithm. Separately, the Riemannian DR techniques have been proposed in several works [14, 15, 16, 17].In this paper, our main contribution is to learn DL and DR jointly in the Riemannian framework. We propose RJDRDL, an algorithm for jointly learning the projection matrix for DR and the discriminative dictionary on the SPD matrices for classification tasks. The joint learning considers the interaction between DR and DL procedures by connecting them into a unified framework. The model is formulated as an objective function over a sparse coefficient matrix and a Cartesian product manifold that consists of the Stiefel manifold and multiple SPD manifolds. Optimization on the Cartesian product manifold is cast as an optimization problem on Riemannian manifolds [18]. Optimization on the sparse coefficient matrix, on the other hand, is a convex program.
This paper is organized as follows. Section II briefly introduces the SPD manifold and the Riemannian DL. Section III details the proposed RJDRDL algorithm. Our initial results on the MNIST image classification task in Section IV show that RJDRDL outperforms stateoftheart algorithms in the domain.
2 SPD manifold and Riemannian DL
This section briefly explains the geometry of SPD manifold and then introduces the Riemannian DL. Hereinafter, we denote the scalars with lowercase letters
, vectors with bold lowercase letters
, and matrices with boldface capitals . We denote a multidimensional or multiorder array as a tensor, which is denoted by .2.1 Geometry of SPD manifold [12]
A manifold of dimensional is a topological space that locally resembles the Euclidean space in a neighborhood of each point . All the tangent vectors at X form a vector space called the tangent space of at X and denoted as . When endowed with a smoothly defined metric, i.e., inner product between vectors in the tangent space at , the manifold is called a Riemannian manifold. The space of SPD matrices, denoted as , is a Riemannian manifold, called SPD manifold, when endowed with an appropriate Riemannian metric. The tangent space at any point on is identifiable with the set symmetric matrices .
One particular choice of the Riemannian metric on the SPD manifold is the affineinvariant Riemannian metric (AIRM) [19, 12]. If P is an element on , the AIRM is defined as
where . The choice of metric does not change with affine action by , which means that on and P. The Riemannian metric provides a way to compute the distance between two points on the manifold. Because the SPD manifold with the AIRM metric has a unique shortest path, which is called geodesic, between every two points [12, Section 6], the geodesic distance is given as
where , denotes the Frobenius norm, and denotes the matrix logarithm.
2.2 Riemannian DL (RDL)
Let be the input training sample set, where denotes th sample that forms a SPD matrix . The dictionary to be learned is denoted as , where is an atom of the dictionary. It should be noted that and are thirdorder tensors. We also denote a sparse coefficient vector as , which forms a coefficient matrix , to represent a query SPD matrix using the dictionary . It should also be emphasized that is required to be nonnegative to ensure that the resultant combination with the dictionary is positive definite. Therefore, we specifically represent a sparse conic combination of the dictionary and the coefficient vector as for . Finally, the problem formulation is defined as
where and respectively represent the regularizers on the coefficient vector and the dictionary [13]. To optimize this nonconvex problem, an alternative minimization algorithm is used for the DL and the SC subproblems.
3 RJDRDL on SPD manifolds
3.1 Problem formulation of RJDRDL
Let be the set of SPD matrices of size accompanied with class labels, i.e., , where denotes the th class training samples. is further composed of individual samples as , where and is the number of samples of the th class in the training set, i.e., . Both and are thirdorder tensors. The dictionary is denoted as , where is the classspecific subdictionary associated with the th class. is also composed as , where is the number of atoms of the th class subdictionary, and .
As described earlier, the proposed RJDRDL algorithm learns not only the dictionary , but also the projection matrix , which projects dimensional data onto dimensional data space. More specifically, is mapped into . Here, we need only fullrankness of U to guarantee that is a SPD matrix. Equivalently, we could enforce a unitary constraint on U, i.e., . The space of unitary matrices is called the Stiefel manifold St.
Considering that model parameters are and , where denotes the space of the product manifold , our proposed formulation is
(1)  
where is the discriminative reconstruction error and where and represent the graphbased constraints on the coefficient and the projection matrices, respectively. , which imposes sparsity on A. . s are nonnegative regularization parameters. , , and are described below.
Discriminative reconstruction error term : The dictionary is expected to approximate the dimensionalityreduced samples from all classes, of which error is represented as , where is the Riemannian geodesic distance on the SPD manifold. In addition, to impose a more discriminative power on , the th subdictionary is expected to approximate the dimensionalityreduced training samples associated with the th class. Here, let be the subvector that corresponds to the th subdictionary as , where . The error is equivalent to . It should be small. The subvector corresponding to other classes should be nearly zero, such that is small. Consequently, we obtain the cost function for as
(2)  
is the regularization parameter.
Graphbased coefficient term : We enforce A to be more discriminative, and therefore, we seek to constrain the intraclass coefficients to be mutually similar and the interclass ones to be highly dissimilar. To this end, we first construct an geometryaware intrinsic graph of intraclass and a penalty graph for interclass discrimination for two points as
where is the set of nearest intraclass neighbors of X in terms of geodesic distance. Similarly, is the set of nearest interclass neighbors of X. Considering the distance of pairs of coding coefficient vectors and as an indicator of discrimination capability, the final graphbased coefficient term is defined as
where [14]. This term enforces minimization of the difference of the two coding coefficients if they are the same class, although the difference of the code is maximized if they are from different classes.
Graphbased projection term : We also learn a projection matrix that can preserve class information and which can map the training samples to a lowdimensional discriminative space. Consequently, is defined as
where the affinity matrix
allows to assign different weights to the Riemannian distance between different points, e.g., the distance is assigned the weight .3.2 Optimization of RJDRDL
The objective function of (1) is divided into two subproblems, which are solved in alternating fashion. We discuss both the subproblems below.
DL subproblem on the product manifold: We consider the DL subproblem of (1) by optimizing the projection matrix U and the tensorformed dictionary , keeping A fixed to . Consequently, the problem is can be reformulated as
We exploit the Riemannian optimization framework on the Cartesian product manifold (consisting of the Stiefel manifold and multiple SPD manifolds). In particular, we use the Riemannian conjugate gradient (RCG) method for solving the DL subproblem. Theoretical convergence of the Riemannian algorithms is to a stationary point. The convergence analysis follows from [20, 21]. To this end, we require the expression for the Riemannian gradient. According to [13], the Riemannian gradient is obtained as with respect to from the definition of AIRM where is the Euclidean gradient of with respect to .
SC subproblem: We consider the SC subproblem of (1) for solving A, keeping U and fixed to and , respectively. The problem, therefore, can be reformulated as
where is denoted as for simplicity. Here, we calculate each column of A, i.e., sequentially by fixing the other coefficients.
It should be emphasized that the above problem is a convex problem and is solved with a gradient projection algorithm. Specifically, we use the spectral projected gradient (SPG) solver [Birgin_ACMTMS_2001, 13].
Classification scheme: We apply the learned projection matrix U and the dictionary on the query test sample
to estimate its class label. For this purpose, the test sample is first projected into the lowdimensional space by
U. Subsequently, it is coded over by solving the following equation:where . is the subvector corresponding to the subdirectory . The residual for the th class is calculated as
where is a weight to balance these two terms. is the mean vector of the learned coding coefficient matrix of the th class, i.e., . We adopt the distance between and the mean vector of the learned coding coefficient of the corresponding th class because it gives better classification results as shown in [22]. Finally, the identity of the testing sample is determined by selecting the class label with the minimum .
4 Numerical experiments
In this section, we show the effectiveness of the proposed RJDRDL algorithm against stateoftheart classification algorithms on SPD matrices.
The comparison methods are the following: NNAIRM is the AIRMbased nearest neighbor (NN) classifier; NNStein is the Stein metricbased NN classifier. The Stein metric
is a symmetric type of Bregman divergence and is defined as , where A and [23]. DRNNAIRM is the AIRMbased NN classifier with the dimensionalityreduced training samples, which are obtained by RDR [14]. DRNNAIRM is the same algorithm, but the distance metric is the Stein metric. RSRCAIRM and RSRCStein are the sparse representation classifiers (SRCs) based on the AIRM and Stein metrics, respectively. RKSRC stands for kernelbased SRC with the Stein metric. RDL is the DL with the SRC classifier [13]. RDRDLAIRM and RDRDLStein are the DL with the SRC classifier after the RDR algorithm.We implement our proposed algorithm in Matlab. The DL subproblem on the product manifold makes use of the Matlab toolbox Manopt [24]. The Matlab codes RDL, RDR, and RKSRC are downloaded from the respective authors’ homepages.
We use the MNIST dataset^{1}^{1}1http://yann.lecun.com/exdb/mnist/., which are handwritten digits of 0–9. It has 60,000 images for training and 10,000 images for testing. For this dataset, we generate RCMs [10], which is computed at from the feature vector
where is the pixel value at , , , and . Then, three RCMs, one from the entire image, one from the left half and one from the right, are concatenated diagonally, which produce RCM of size for each image. We execute 10 runs under randomly selected test samples with and training samples. The dictionary size is equal to that of the training sample. Therefore, the case of represents an extreme situation. We set the parameters of the proposed algorithm, based on crossvalidation, to , , and . are and in and , respectively. We also set . The original and reduced dimensions are and , respectively. We initialize U from the DR method [14] using single sample per class.
The results of the classification accuracy are presented in Table 1. The table presents superior performances of the proposed RJDRDL against stateoftheart algorithms. It should be noted that RDRDL (both with Stein and AIRM metrics) give poor performance, implying that the separately prelearned DR projection matrix might not be optimal for the subsequent DL.
Algorithm  Accuracy (Average Standard deviation)  

Dictionary size  5  10 
NNAIRM  
NNStein  
DRNNAIRM  
DRNNStein  
RSRCAIRM  
RSRCStein  
RKSRC  
RDL  
RDRDLAIRM  
RDRDLStein  
RJDRDL (Proposed) 
5 Conclusions
We have presented a Riemannian joint framework, RJDRDL, of performing dimensionality reduction along with discriminative dictionary learning on the set of SPD matrices for classification tasks. We formulate the joint learning as an objective function with the reconstruction error term and with the constraints on the projection matrix, the dictionary, and the sparse coefficient codes. Our numerical experiments demonstrate the good performance of jointly performing DL and DR. In particular, RJDRDL outperforms existing stateofthearts algorithms for the MNIST image classification task.
Extending the framework to learning with other metrics on the SPD manifold (e.g., the Stein metric or the logEuclidean metric) will be a topic of future research, as well as having a competitive numerical implementation with extensive evaluations on other realworld datasets.
Acknowledgements
H. Kasai was partially supported by JSPS KAKENHI Grant Numbers JP16K00031 and JP17H01732.
References
 [1] H. Kasai and B. Mishra. Riemannian joint dimensionality reduction and dictionary learning￥￥ on symmetric positive definite manifold. In EUSIPCO, 2018.
 [2] M. Aharon, M. Elad, and A. Bruckstein. KSVD: An algorithm for designing overcomplete dictionaries for sparse representation. IEEE Trans. Sig. Proc., 54(11):4311–4322, 2006.
 [3] Q. Zhang and B. Li. Discriminative ksvd for dictionary learning in face recognition. In CVPR, 2010.
 [4] Z. Jiang, Z. Lin, and L.S. Davis. Learning a discriminative dictionary for sparse coding via label consistent KSVD. IEEE Trans. Pattern Anal. Mach. Intell., 35(11):2651–2664, 2013.
 [5] H. V. Nguyen, V. M. Patel, N. M. Nasrabadi, and R. Chellappa. Sparse embedding: A framework for sparsity promoting dimensionality reduction. In ECCV, pages 414–427, 2012.
 [6] Z. Feng, L. Yang, M. Zhang, Y. Liu, and D. Zhang. Joint discriminative dimensionality reduction and dictionary learning for face recognition. Pattern Recognition, 46(8):2134–2143, 2013.
 [7] B. Q. Yang, C.C. Gu, K.J. Wu, T. Zhang, and X.P. Guan. Simultaneous dimensionality reduction and dictionary learning for sparse representation based classification. Multimedia Tools and Applications, 76(6):pp 8969–8990, 2016.
 [8] W. Liu, Z. Yu, Y. Wen, R. Lin, and M. Yang. Jointly learning nonnegative projection and dictionary with discriminative graph constraints for classification. In BMVC, 2016.
 [9] H. Foroughi, N. Ray, and H. Zhang. Object classification with joint projection and lowrank dictionary learning. IEEE Trans. on Image Process., 27(2):806–821, 2018.
 [10] Y. Pang, Y. Yuan, and X. Li. Gaborbased region covariance matrices for face recognition. IEEE Trans. Circuits Syst. Video Technol., 18(7):989–993, 2008.
 [11] O. Tuzel, F. Porikli, and P. Meer. Region covariance: a fast descriptor for detection and classification. In ECCV, 2006.
 [12] R. Bhatia. Positive definite matrices. Princeton series in applied mathematics. Princeton University Press, 2007.
 [13] A. Cherian and S. Sra. Riemannian dictionary learning and sparse coding for positive definite matrices. IEEE Trans. Neural Netw. Learn. Syst., 2016.
 [14] M. Harandi, M. Salzmann, and H. Richard. Dimensionality reduction on spd manifolds: The emergence of geometryaware methods. IEEE Trans. Pattern Anal. Mach. Intell., 2017.
 [15] Z. Huang and L. V. Gool. A riemannian network for spd matrix learning. In AAAI, 2017.
 [16] Z. Huang, R. Wang, S. Shan, X. Li, and X. Chen. Logeuclidean metric learning on symmetric positive definite manifold with application to image set classification. In ICML, 2015.
 [17] Z. Huang, R. Wang, X. Li, W. Liu, S. Shan, L. V. Gool, and X. Chen. Geometryaware similarity learning on spd manifolds for visual recognition. IEEE Trans. Circuits Syst. Video Technol., 2017.
 [18] P.A. Absil, R. Mahony, and R. Sepulchre. Optimization Algorithms on Matrix Manifolds. Princeton University Press, 2008.
 [19] X. Pennec, P. Fillard, and N. Ayache. A Riemannian framework for tensor computing. Int. Jornal of Computer Vision, 66(1):41–66, 2006.
 [20] H. Sato and T. Iwai. A new, globally convergent Riemannian conjugate gradient method. Optimization, 64(4):1011–1031, 2015.
 [21] W. Ring and B. Wirth. Optimization methods on Riemannian manifolds and their application to shape space. SIAM J. Optim., 22(2):596–627, 2012.
 [22] M. Yang, L. Zhang, X. Feng, and D. Zhang. Fisher discrimination dictionary learning for sparse representation. In ICCV, 2011.

[23]
S. Sra.
A new metric on the manifold of kernel matrices with application to matrix geometric means.
In NIPS, 2012.  [24] N. Boumal, B. Mishra, P.A. Absil, and R. Sepulchre. Manopt: a Matlab toolbox for optimization on manifolds. JMLR, 15(1):1455–1459, 2014.