1 Introduction
Symmetric Positive Definite (SPD) matrices arise naturally in several computer vision applications, such as covariances when modeling data using Gaussians, as kernel matrices for highdimensional embedding, as points in diffusion MRI [36]
, and as structure tensors in image processing
[5]. Furthermore, SPD matrices in the form of Region CoVariance Descriptors (RCoVDs) [43], offer an easy way to compute a representation that fuses multiple modalities (e.g., color, gradients, filter responses, etc.) in a cohesive, and compact format. In various mainstream vision applications, including tracking, reidentification, object, texture, and activity recognition, the trail of SPD matrices to advance the stateoftheart solutions can be seen [6, 44, 17]. SPD matrices are even used as secondorder pooling operators for enhancing the performance of popular deep learning architectures
[23, 21].SPD matrices, due to their positive definiteness property, form a cone in the Euclidean space. However, analyzing these matrices through their Riemannian geometry (or the associated Lie algebra) helps avoiding unlikely/unrealistic solutions, thereby improving the outcomes. For example, in diffusion MRI [36, 3], it has been shown that the Riemannian structure (which comes with an affine invariant metric) is immensely useful for accurate modeling. A similar observation is made for RCoVDs [9, 19, 45]. This has resulted in the exploration of various geometries and similarity measures for SPD matrices, viewing them from disparate perspectives. A few notable such measures are: (i) the affine invariant Riemannian metric (AIRM) using the natural Riemannian geometry [36], (ii) the Jeffreys KL divergence (KLDM) using relative entropy [34], (iii) the JensenBregman logdet divergence using information geometry [9], and (iv) Brug matrix divergence [27], among several others [11].
Each of the aforementioned measures has distinct mathematical properties and as such performs differently for a given problem. However and to some extent surprisingly, all of them can be obtained as functions acting on the generalized eigenvalues of their inputs. Recently, Cichocki et al.
[11] show that all these measures can be interpreted in a unifying setup using logdet divergence (ABLD) and each measure can be derived as a distinct parametrization of this divergence. For example, one could get JBLD from ABLD^{2}^{2}2Up to a scaling factor. using , and AIRM as the limit of . With such an interesting discovery, it is natural to ask if the parameters and can be learned for a given task in a datadriven way. This not only answers which measure is the right choice for a given problem, but also allows for deriving new measures that are not among the popular ones listed above.In this paper, we make the first attempt at learning an logdet divergence on SPD matrices for computer vision applications, dubbed Information Divergence and Dictionary Learning
(IDDL). We cast the learning problem in a discriminative ridge regression setup where the goal is to learn
and that maximize the classification accuracy for a given task.Being vigilant to the computational complexity of the resulting solution, we propose to embed SPD matrices using a dictionary in our metric learning framework. Our proposal enables us to learn the embedding (or more accurately the dictionary that identifies the embedding), along with the proper choice of the metric (i.e., parameters and
of the ABLD) and a classifier jointly. The output of our IDDL is a vector, each entry of this vector computes a potentially distinct ABLD to a distinct dictionary atom.
To achieve our goal, we propose an efficient formulation that benefits from recent advances in optimization over Riemannian manifolds to minimize a nonconvex and constrained objective. We provide extensive experiments using IDDL on a variety of computer vision applications, namely (i) action recognition, (ii) texture recognition, (iii) 3D shape recognition, and (iv) cancerous tissue recognition. We also provide insights into our learning scheme through extensive experiments on the parameters of the ABLD, and ablation studies under various performance settings. Our results demonstrate that our scheme achieves stateoftheart accuracies against competing techniques, including the recent sparse coding, Riemannian metric learning, and kernel coding schemes.
2 Related Work
The logdet divergence is a matrix generalization of the wellknown divergence [10] that computes the (a)symmetric (dis)similarity between two finite positive measures (data densities). As the name implies, divergence is a unification of the socalled family of divergences [2] (that includes popular measures such as the KLdivergence, JensenShannon divergence, and the chisquare divergence) and the family [4] (including the squared Euclidean distance and the Itakura Saito distance). Against several standard measures for computing similarities, both and
divergences are known to lead to solutions that are robust to outliers and additive noise
[29], thereby improving application accuracy. They have been used in several statistical learning applications including nonnegative matrix factorization [12, 25, 13], nearest neighbor embedding [20], and blindsource separation [33].A class of methods with similarities to our formulation are metric learning schemes on SPD matrices. One popular technique is the manifoldmanifold embedding of large SPD matrices into a tiny SPD space in a discriminative setting [19]. LogEuclidean metric learning has also been proposed for this embedding in [22, 40]. While, we also learn a metric in a discriminative setup, ours is different in that we learn an information divergence. In Thiyam et al. [42]
, ABLD is proposed replacing symmetric KL divergence in better characterizing the learning of a decision hyperplane for BCI applications. In contrast
^{3}^{3}3Automatic selection of the parameters of divergence is investigated in [38, 14]. However, they deal with scalar density functions in a maximumlikelihood setup and do not consider the optimization of and jointly., we propose to embed the data matrices as vectors, each dimension of these vectors learning a different ABLD, thus leading to a richer representation of the input matrix.Vectorial embedding of SPD matrices has been investigated using disparate formulations for computer vision applications. As alluded to earlier, the logEuclidean projection [3] is a common way to achieve this, where an SPD matrix is isomorphically mapped to the Euclidean space of symmetric matrices using the matrix logarithm. Popular sparse coding schemes have been extended to SPD matrices in [8, 39, 46] using SPD dictionaries, where the resulting sparse vector is assumed Euclidean. Another popular way to handle the nonlinear geometry of SPD matrices is to resort to kernel schemes by embedding the matrices in an infinite dimensional Hilbert space which is assumed to be linear [18, 31, 17]. In all these methods, the underlying similarity measure is fixed and is usually chosen to be one among the popular logdet divergences or the logEuclidean metric.
In contrast to all these methods, to the best of our knowledge, it is for the first time that a joint dictionary learning and information divergence learning framework is proposed for SPD matrices in computer vision. In the sequel, we first introduce logdet divergence and explore its properties in the next section. This will precede exposition to our discriminative metric learning framework for learning the divergence and efficient ways of solving our formulation.
Notations:
Following standard notations, we use upper case for matrices (such as ), lowerbold case for vectors , and lower case for scalars . Further, is used to denote the cone of SPD matrices. We use to denote a 3D tensor each slice of which is an SPD matrix of size . Further, we use to denote the identity matrix, for the matrix logarithm, and for the diagonalization operator.
3 Background
In this section, we will setup the mathematical preliminaries necessary to elucidate our contributions. We will visit the logdet divergence, its connections to other popular divergences, and its mathematical properties.
ABLD  Divergence  

Squared Affine Invariant Riemannian Metric [36]  
JensenBregman Logdet Divergence [9]  
Jeffreys KL Divergence^{4}^{4}4using the symmetrization of ABLD. [34]  
Burg Matrix Divergence [27] 
3.1 Log Determinant Divergence
Definition 1 (Abld [11])
For , the logdet divergence is defined as:
(1) 
(2) 
It can be shown that ABLD depends only on the generalized eigenvalues of and [11]. Suppose denotes the th eigenvalue of . Then under constraints defined in (2), we can rewrite (1) as:
(3) 
This formulation will come handy when deriving the gradient updates for and in the sequel. As alluded to earlier, a hallmark of the ABLD is that it unifies several popular distance measures on SPD matrices that one commonly encounters in computer vision applications. In Table 1, we list some of the popular measures in computer vision and the respective values of and .
3.2 ABLD Properties
Avoiding Degeneracy: An important observation regarding the design of optimization algorithms on ABLD is that the quantity inside the term has to be positive definite; conditions on and for which are specified by the following theorem.
Theorem 1 ([11])
For , if is the th eigenvalue of , then only if
(4)  
(5) 
Since s depend on the input matrices, on which we have no control over, we constrain and to have the same sign, thereby avoiding the quantity inside to be indefinite. We make this assumption in our formulations in Section 4.
Smoothness of , : Assuming have the same sign, except at origin (), ABLD is smooth everywhere with respect to and , thus allowing us to develop Newtontype algorithms on them. Due to the discontinuity at the origin, we ought to design algorithms specifically addressing this particular case.
Affine Invariance: It can be easily shown that
(6) 
for any invertible matrix
. This is an important property that makes this divergence useful in a variety of applications, such as diffusion MRI [36].Dual Symmetry: This property allows us to extend results derived for the case of to the one on later.
(7) 
Before concluding this part, we briefly introduce the concept of optimization on Riemannian manifolds and in particular the method of Riemmanian Conjugate Gradient descent (RCG).
3.3 Optimization on Riemannian Manifolds
As will be shown in § 4, we need to solve a nonconvex constrained optimization problem in the form
(8) 
Classical optimization methods generally turn a constrained problem into a sequence of unconstrained problems for which unconstrained techniques can be applied. In contrast, in this paper we make use of the optimization on Riemannian manifolds to minimize (8). This is motivated by recent advances in Riemannian optimization techniques where benefits of exploiting geometry over standard constrained optimization are shown [1]. As a consequence, these techniques have become increasingly popular in diverse application domains [8, 17].
A detailed discussion of Riemannian optimization goes beyond the scope of this paper, and we refer the interested reader to [1]. However, the knowledge of some basic concepts will be useful in the remainder of this paper. As such, here, we briefly consider the case of Riemannian Conjugate Gradient method (RCG), our choice when the empirical study of this work is considered. First we formally define the SPD manifold.
Definition 2 (The SPD Manifold)
The set of () dimensional real, SPD matrices endowed with the Affine Invariant Riemannian Metric (AIRM) [36] forms the SPD manifold .
(9) 
To minimize (8), RCG starts from an initial solution and improves its solution using the update rule
(10) 
where identifies a search direction and is a retraction. The retraction serves to identify the new solution along the geodesic defined by the search direction . In RCG, it is guaranteed that the new solution obtained by Eq. (10) is on and has a lower objective. The search direction is obtained by
(11) 
Here, can be thought of as a variable learning rate, obtained via techniques such as FletcherReeves [1]. Furthermore, is the Riemannian gradient of the objective function at and denotes the parallel transport of from to . In Table 2, we define the mathematical entities required to perform RCG on the SPD manifold. Note that computing the standard Euclidean gradient of the function , denoted by , is the only requirement to perform RCG on .
Riemannian gradient  

Retraction.  
Parallel Transport. 
4 Proposed Method
In this section, we first introduce the most general form of our joint IDDL formulation and follow it up by providing simplifications and derivations for specific cases (such as for ).
4.1 Information Divergence & Dictionary Learning
Suppose we are given a set of SPD matrices along their associated labels . Our goal is threefold: (i) learn a dictionary , a product of SPD manifolds, (ii) learn an ABLD on each dictionary atom to best represent the given data for the task of classification, and (iii) learn a discriminative objective function on the encoded SPD matrices (in terms of and the respective ABLDs) for the purpose of classification. These goals are formally captured in the IDDL objective proposed below. Let the th dictionary atom in be , then,
(12)  
where the vector denotes the encoding of in terms of the dictionary, and is the th dimension of this encoding. The function parameterized by learns a classifier on according to the provided class labels . While, there are several choices for (e.g., maxmargin hingeloss), we resort to a simple ridge regression objective in this paper. Thus, our is defined as follows: suppose is a oneoff encoding of class labels (i.e., , everywhere else zero), then
(13) 
where and is a regularization parameter. Note that a separate for each dictionary atom is the most general form of our formulation. In our experiments, we explore simplified cases when these parameters are shared across the atoms.
4.2 Efficient Optimization
In this section, we propose efficient ways to solve the IDDL objective in (12). We propose to use a blockcoordinate descent (BCD) scheme for optimization, in which each variable is updated alternately while fixing others. Going by the recent trends in Riemannian optimization for SPD matrices [8, 17], we use the Riemannian conjugate gradient (RCG) algorithm [1] for optimizing over each variable. As our objective is nonconvex in its variables (except for ), convergence of BCD iterations to a global minima is not guaranteed. In Alg. 1, we detail out the metasteps in our optimization scheme. We initialize the dictionary atoms and the divergence parameters as described in Section 6.3. Following that, we update the atoms, the divergence parameters, and classifier parameters in an alternating manner manner – that is, updating one variable whie fixing all others.
Recall from Section 3.3 that an essential ingredient in RCG is efficient computations of the Euclidean gradients of the objective with respect to the variables. In the following, we derive expressions for these gradients. Note that we assume that the dictionary atoms (i.e., ) to be on an SPD manifold. Also w.l.o.g, we assume and belong to the nonnegative orthant of the Euclidean space (for reasons in Section 3).
4.2.1 Gradients wrt
As is clear from our formulation, only the th dimension of involves . To simplify the notations, let us assume
(14) 
and let be its th dimension. Then we have (see the supplementary material for the details),
(15) 
Substituting for ABLD in (20) and rearranging the terms, we have:
(16) 
Let and . Further, let . Then, the term inside the gradient in (16) simplifies to:
(17) 
Theorem 2
Let . Furthermore assume . We have
Proof For simplifying the notations, lets write . Note that the eigenvalues (and hence the ) of is the same as that of , however, the latter being symmetric and thus keeping symmetric when doing the gradient descent, we will use this form. Thus,
(18) 
Using Taylor series expansion^{5}^{5}5Strictly speaking, the series expansions that we use in the proof assume that , which can be achieved via rescaling. However, empirically this requirement has not been seen to be needed in all the datasets that we use.,
(19) 
(20) 
Using Maclaurin series in the middle term, we thus get
As such, the gradient is:
(21) 
Combining (21) with (16), we have the expression for the gradient with respect to .
Remark 1
Computing for large datasets may become overwhelming. Let be the Schur decomposition (which is faster than the eigenvalue decomposition [16]). With , the gradient in (21) can be rewritten as:
(22) 
Compared to (21), this simplification reduces the number of matrix multiplications from 5 to 3 and matrix inversions from 2 to 1.
4.2.2 Gradients wrt and
For gradients with respect to , we will use the form of ABLD given in (3), where is assumed to be the th generalized eigenvalue of and dictionary atom . Using the notations defined in (14), the gradient has the form:
(23) 
The gradients wrt from (23) can be derived using the dual symmetry property described in (7).
4.3 Closed Form for
When fixing and , the objective reduces to the standard ridge regression formulation in , which can be solved in closed form as:
(24) 
where matrices and have and along their th column, for .
4.4 The Solution When
As alluded to earlier, ABLD is nonsmooth at the origin and we need to resort to the limit of the divergence, which happens to be the natural Riemannian metric (AIRM). That is,
(25) 
Using the same ridge regression cost for defined in (13), and using defined in (14), we have the gradient using as:
(26) 
where . Note that a simplification similar to (22) is also possible for (26).
5 Computational Complexity
We note that some of the terms in the gradients derived above could be computed offline (such as ), and thus we omit those terms from our analysis. Using the simplifications depicted in (22) and Schur decomposition, gradient computation for each takes flops. Using the gradient formulation in (23) for and , we need flops. Computations of the closed form for in (24) takes . At test time, given that we have learned the dictionary and the parameters of the divergence, encoding a data matrix requires flops, which is similar in complexity to the recent sparse coding schemes such as [8].
6 Experiments
In this section, we evaluate the performance of the IDDL scheme on eight computer vision datasets, which are known to benefit from SPDbased descriptors. To this end, we use the following datasets, namely (i) the JHMDB action recognition [24], (ii) the HMDB action recognition [26] (iii) the KTHTIPS2 dataset [32], (iv) Brodatz textures [35], (v) the Virus dataset [28], (vi) the SHREC 3D shape dataset [30], (vii) the Myometrium cancer dataset [41], and (viii) the Breast cancer dataset [41]. Below, we provide details about all the studied datasets and the way SPD descriptors are obtained on them. We use the standard evaluation schemes reported previously on these datasets. In some cases, we use our own implementations of popular methods but strictly following the recommended settings.
6.1 Datasets
HMDB and JHMDB datasets:
These are two popular action recognition benchmarks. The HMDB dataset consists of 51 action classes associated with 6766 video sequences, while JHMDB is a subset of HMDB with 955 sequences in 21 action classes. To generate SPD matrices on these datasets, we use the scheme proposed in [7], where we compute RBF kernel descriptors on the output of perframe CNN class predictions (fc8) for each stream (RBF and optical flow) separately, and fusing these two SPD matrices into a single blockdiagonal matrix per sequence. For the twostream model, we use a VGG16 model trained on optical flow and RGB frames separately as described in [37]. Thus, our descriptors are of size for HMDB and for JHMDB.
SHREC 3D Object Recognition Dataset:
KTHTIPS2 dataset and Brodatz Textures:
These are popular texture recognition datasets. The KTHTIPS dataset consists of 4752 images from 11 material classes under varying conditions of illumination, pose, and scale. Covariance descriptors of size are generated from this dataset following the procedure in [18]. We use the standard 4split crossvalidation for our evaluations on this dataset. As for the Brodatz dataset, we use the relative pixel coordinates, image intensity, and image gradients to form region covariance descriptors from 100 texture classes. Our dataset consists of 31000 SPD matrices, and we follow the procedure in [8] for our evaluation using an 80:20 rule as used in the RGBD dataset above.
Virus Dataset:
It consists of 1500 images of 15 different virus types. Similar to the KTHTIPS, we use the procedure in [18] to generate covariance descriptors from this dataset and follow their evaluation scheme using threesplits.
Cancer Datasets.
Apart from these standard SPD datasets, we also report performances on two cancer recognition datasets from [41] kindly shared with us by the authors. We use images from two types of cancers, namely (i) Breast cancer, consisting of binary classes (tissue is either cancerous or not) consisting of about 3500 samples, and (ii) Myometrium cancer, consisting of 3320 samples; we use covariancekernel descriptors as described in [41] which are of size . We follow the 80:20 rule for evaluation on this dataset as well.
6.2 Experimental Setup
Since we present experiments on a variety of datasets and under various configurations, we summarize our main experiments first. There are three sets of experiments we conduct, namely (i) comparison of IDDL against other popular measures on SPD matrices, (ii) comparisons among various configurations of IDDL, and (iii) comparisons against state of the art approaches on the above datasets. For those datasets that do not have prescribed crossvalidation splits, we repeat the experiments at least 5 times and average the performance scores.
6.3 Parameter Initialization
In all the experiments, we initialized the parameters of IDDL (e.g., the initial dictionary) in a principleway. We initialized the dictionary atoms by applying logEuclidean KMeans; i.e., we compute the logEuclidean map of the SPD data, compute Euclidean KMeans on these mapped points, and remap the KMeans centroids to the SPD manifold via an exponential map. To initialize
and , we recommend gridsearch by fixing the dictionary atoms as above. As an alternative to the gridsearch, we empirically observed that a good choice is to start with the Burg divergence (i.e., ). The regularization parameter was chosen using crossvalidation.6.4 Comparisons to Variants of IDDL
In this section, we analyze various aspects of the performance of IDDL. Generally speaking, our IDDL formulation is generic and customizable. For example, even though we formulated the problem as using a separate ABLD on each dictionary atom, it does not hurt to learn the same divergence over all atoms in some applications. To this end, we test the performance of three scenarios, namely (i) using a scalar and that is shared across all the dictionary atoms (which we call IDDLS), (ii) a vector and , where we assume , but each dictionary atom can potentially have a distinct parameter pair (we call this case IDDLV), and (iii) the most generic case where we could have , as vectors and they may not be equal, which we refer as IDDLN.
In Figure 2, we compare all these configurations on six of the datasets. We also include specific cases such as the Burg divergence () and the AIRM case () for comparisons (using the dictionary learning scheme proposed in Section 4.2.1). Our experiments show that IDDLN and IDDLV consistently perform well on most of the datasets. This is of course not very surprising given the generality of IDDL compared to the other measures.
6.5 Comparisons to Standard Measures
In this experiment, we compare the IDDL (see Figure 2) to the standard similarity measures on SPD matrices including logEuclidean Metric [3], AIRM [36], and JBLD [9]. We report 1NN classification performance on these baselines. In Table 4, we report the performance of these schemes. As a rule of thumb (and also supported empirically by crossvalidation studies on our datasets), for a class problem, we chose atoms in the dictionary. Increasing the size of the dictionary seems not helping in most cases. We also report a discriminative baseline by training a linear SVM on the logEuclidean mapped SPD matrices. The results reported in Table 4 clearly demonstrate the advantage of IDDL against the baselines on most of the datasets, where the benefits can go over more than 10% in some cases (such as the JHMDB and virus).
6.6 Comparisons to the State of the Art
We compare IDDL to the following popular methods that share similarities to our scheme, namely (i) LogEuclidean Metric learning (LEML) [22], (ii) kernelized Sparse Coding [18] that uses logEuclidean metric for sparse coding SPD matrices (), (iii) kernelized sparse coding using JBLD (), and kernelized locality constrained coding [17], and Riemannian dictionary learning and sparse coding (RSPDL) [8]. Our results are reported in Table 3. Again we observe that IDDL performs the best amongst all the competitive schemes, clearly demonstrating the advantage of learning the divergence and the dictionary. Note that comparisons are established by considering the same number of atoms for all schemes and finetuning the parameters of each algorithm (e.g., the bandwidth of the RBF kernel in ) using a validation subset of the training set. As for LEML, we increased the number of pairwise constraints until the performance hit a plateau.
Dataset — Classifier  LE 1NN  AIRM 1NN  JBLD 1NN  SVMLE  IDDL  Variant 
JHMDB  52.99%  51.87%  52.24%  54.48%  68.3%  V 
HMDB  29.30%  43.3%  46.3%  41.7%  55.50%  N 
VIRUS  66.67%  67.89%  68.11%  68.00%  78.39%  N 
BRODATZ  80.10%  80.50%  80.50%  86.80%  74.10%  N 
KTH TIPS  72.05%  72.83%  72.87%  75.59%  79.37%  V 
3D Object  97.4%  98.2%  95.6%  98.9%  96.08%  Burg 
Breast Cancer  87.42%  80.00%  84.00%  87.71%  90.46%  Burg 
Myometrium Cancer  80.87%  84.18%  93.20%  93.22%  94.66%  Burg 
7 Ablative Study
In this section, we study the influence of each of the components in our algorithm. In Figure 3, we plot a heatmap of the classification accuracy against changing and on the KTHTIPS2 and Virus datasets. We fixed the size of dictionaries to 22 for the KTH TIPS and 30 for the Virus datasets. The plots reveal that the performance varies for different parameter settings, thus (along with the results in Table 4) substantiates that learning these parameters is a way to improve performance. It should be noted that for our SVMbased experiments, we used a linear SVM on the logEuclidean mapped SPD matrices.
In Figure 4, we plot the convergence of our objective against iterations. We also depict the BCD objective as contributed by the dictionary learning updates and the parameter learning; we use the IDDLV for this experiment. As is clear, most part of the decrement in objective happens when the dictionary is learned, which is not surprising given that it has the most number of variables to learn. For most datasets, we observe that the RCG converges in about 200300 iterations.
7.1 Running Time Experiments
In Figure 5, we plot the running time for one iteration of RCG against the number of dimensions of the matrices and the number of dictionary atoms. While our dictionary updates seem quadratic in the number of dimensions, it in fact scales linearly with the dictionary size, and usually converges in a couple of seconds.
7.2 Evaluation of Joint Learning
In Table 5, we evaluate the usefulness of the learning the information divergence against learning the dictionary on the Virus dataset. For this experiment, we evaluated three scenarios, (i) fixing the dictionary to the initialization (using KMeans), and learning the parameters using the IDDLS variant, (ii) fixing to the initialization using GridSearch, while learning the dictionary, and (iii) learning both dictionary and the parameters jointly. As the results in Table 5 shows, jointly learning the parameters demonstrates better results, thus justifying our IDDL formulation.
Atoms — Method  IDDLFix  IDDLFix  IDDLN 

15  78.33%  61.67%  77.33% 
45  80.33%  70.00%  83.67% 
75  81.67%  76.00%  82.33% 
7.3 Trajectories of
In this experiment, we demonstrate the BCD trajectories of and for the IDDLS and IDDLN variants of our algorithm on the Virus dataset. Specifically, in Figure 6, we show how the value of and varies as the BCD iteration progresses. In this experiment, we used 15 dictionary atoms. All experiments used the same initializations for the invariants. We also plot the corresponding objective and training accuracies. For IDDLN, recall that the parameters are vectors, and thus we plot the average and respectively. We also plot for various initializations for these parameters.
For IDDLS, it appears that different initializations leads to disparate points of convergence. However, for all points of convergence, the objective convergence is very similar (and so is the training accuracy), suggesting that there are multiple local minima that leads to similar empirical results. We also find that initializing with demonstrates slightly better convergence than other possibilities, which we observed for other datasets too. For IDDLN, we found that the mean of the parameters remained more or less constant, although the exact values varied by (not shown).
8 Limitations of IDDL
The main limitation of our approach is the nonconvexity of our objective; that precludes a formal analysis of the convergence. A further limitation is that the gradient expressions involve matrix inversions and may need careful regularizations to avoid numerical instability. We also note that the AB divergence has a discontinuity at the origin, which needs to be accounted for when learning the parameters.
Further, from our experimental analysis, it looks like there is no single variant of IDDL (amongst IDDLS, IDDLV, IDDLN, IDDLA, and IDDLB) that consistently performs the best for all datasets. However, with the possibility of learning alphabeta, we would think the most generalized variant IDDLN might perhaps be the best choice for any application as it can plausibly learn the alternatives.
9 Conclusions
In this paper, we proposed a novel framework unifying the problem of dictionary learning and information divergence learning on SPD matrices; two problems that have been investigated separately so far. We leveraged on the recent advances in information geometry for this purpose, namely using the logdet divergence. We formulated an objective for jointly learning the divergence and the dictionary and showed that it can be solved efficiently using optimization methods on Riemannian manifolds. Experiments on eight computer vision datasets demonstrate superior performance of our approach against alternatives.
Acknowledgments
This material is based upon work supported by the National Science Foundation through grants #CNS0934327, #CNS1039741, #SMA1028076, #CNS1338042, #CNS1439728, #OISE1551059, and #CNS1514626. Dr. Cherian is funded by the Australian Research Council Centre of Excellence for Robotic Vision (#CE140100016).
References
 [1] P.A. Absil, R. Mahony, and R. Sepulchre. Optimization algorithms on matrix manifolds. Princeton University Press, 2009.
 [2] S.i. Amari and H. Nagaoka. Methods of information geometry, volume 191. American Mathematical Soc., 2007.
 [3] V. Arsigny, P. Fillard, X. Pennec, and N. Ayache. Logeuclidean metrics for fast and simple calculus on diffusion tensors. Magnetic resonance in medicine, 56(2):411–421, 2006.

[4]
A. Basu, I. R. Harris, N. L. Hjort, and M. Jones.
Robust and efficient estimation by minimising a density power divergence.
Biometrika, 85(3):549–559, 1998.  [5] T. Brox, J. Weickert, B. Burgeth, and P. Mrázek. Nonlinear structure tensors. Image and Vision Computing, 24(1):41–55, 2006.
 [6] J. Carreira, R. Caseiro, J. Batista, and C. Sminchisescu. Semantic segmentation with secondorder pooling. In ECCV, 2012.
 [7] A. Cherian, P. Koniusz, and S. Gould. Higherorder pooling of CNN features via kernel linearization for action recognition. In WACV, 2017.

[8]
A. Cherian and S. Sra.
Riemannian dictionary learning and sparse coding for positive
definite matrices.
IEEE Trans. on Neural Networks and Learning Systems
, 2016.  [9] A. Cherian, S. Sra, A. Banerjee, and N. Papanikolopoulos. Jensenbregman logdet divergence with application to efficient similarity search for covariance matrices. PAMI, 35(9):2161–2174, 2013.
 [10] A. Cichocki and S.i. Amari. Families of alphabetaand gammadivergences: Flexible and robust measures of similarities. Entropy, 12(6):1532–1568, 2010.
 [11] A. Cichocki, S. Cruces, and S.i. Amari. Logdeterminant divergences revisited: Alphabeta and gamma logdet divergences. Entropy, 17(5):2988–3034, 2015.
 [12] A. Cichocki, R. Zdunek, A. H. Phan, and S.i. Amari. Nonnegative matrix and tensor factorizations: applications to exploratory multiway data analysis and blind source separation. John Wiley & Sons, 2009.
 [13] I. S. Dhillon and S. Sra. Generalized nonnegative matrix approximations with bregman divergences. In NIPS, 2005.
 [14] O. Dikmen, Z. Yang, and E. Oja. Learning the information divergence. PAMI, 37(7):1442–1454, 2015.
 [15] D. Fehr. Covariance based point cloud descriptors for object detection and classification. PhD thesis, University Of Minnesota, 2013.
 [16] G. H. Golub and C. F. Van Loan. Matrix computations, volume 3. JHU Press, 2012.
 [17] M. Harandi and M. Salzmann. Riemannian coding and dictionary learning: Kernels to the rescue. In CVPR, 2015.
 [18] M. Harandi, M. Salzmann, and F. Porikli. Bregman divergences for infinite dimensional covariance matrices. In CVPR, 2014.
 [19] M. T. Harandi, M. Salzmann, and R. Hartley. From manifold to manifold: Geometryaware dimensionality reduction for spd matrices. In ECCV, 2014.
 [20] G. Hinton and S. Roweis. Stochastic neighbor embedding. In NIPS, 2002.
 [21] Z. Huang and L. Van Gool. A Riemannian network for SPD matrix learning. CoRR arXiv:1608.04233, 2016.
 [22] Z. Huang, R. Wang, S. Shan, X. Li, and X. Chen. LogEuclidean metric learning on symmetric positive definite manifold with application to image set classification. In ICML, 2015.

[23]
C. Ionescu, O. Vantzos, and C. Sminchisescu.
Matrix backpropagation for deep networks with structured layers.
In ICCV, 2015.  [24] H. Jhuang, J. Gall, S. Zuffi, C. Schmid, and M. J. Black. Towards understanding action recognition. In ICCV, 2013.
 [25] R. Kompass. A generalized divergence measure for nonnegative matrix factorization. Neural computation, 19(3):780–791, 2007.
 [26] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. Hmdb: a large video database for human motion recognition. In ICCV, 2011.
 [27] B. Kulis, M. Sustik, and I. Dhillon. Learning lowrank kernel matrices. In ICML, 2006.
 [28] G. Kylberg, M. Uppström, K. Hedlund, G. Borgefors, and I. Sintorn. Segmentation of virus particle candidates in transmission electron microscopy images. Journal of microscopy, 245(2):140–147, 2012.

[29]
J. Lafferty.
Additive models, boosting, and inference for generalized divergences.
In
Proc. conf. on Computational learning theory
, 1999.  [30] K. Lai, L. Bo, X. Ren, and D. Fox. A largescale hierarchical multiview RGBD object dataset. In ICRA, 2011.
 [31] P. Li, Q. Wang, W. Zuo, and L. Zhang. LogEuclidean kernels for sparse representation and dictionary learning. In ICCV, 2013.
 [32] P. Mallikarjuna, A. T. Targhi, M. Fritz, E. Hayman, B. Caputo, and J.O. Eklundh. The KTHTIPS2 database, 2006.
 [33] M. Mihoko and S. Eguchi. Robust blind source separation by beta divergence. Neural computation, 14(8):1859–1886, 2002.
 [34] M. Moakher and P. G. Batchelor. Symmetric positivedefinite matrices: From geometry to applications and visualization. In Visualization and Processing of Tensor Fields, pages 285–298. Springer, 2006.
 [35] T. Ojala, M. Pietikäinen, and D. Harwood. A comparative study of texture measures with classification based on featured distributions. Pattern recognition, 29(1):51–59, 1996.
 [36] X. Pennec, P. Fillard, and N. Ayache. A Riemannian framework for tensor computing. IJCV, 66(1):41–66, 2006.
 [37] K. Simonyan and A. Zisserman. Twostream convolutional networks for action recognition in videos. In NIPS, 2014.
 [38] U. Şimşekli, A. T. Cemgil, and B. Ermiş. Learning mixed divergences in coupled matrix and tensor factorization models. In ICASSP, 2015.
 [39] R. Sivalingam, D. Boley, V. Morellas, and N. Papanikolopoulos. Tensor sparse coding for region covariances. In ECCV, 2010.
 [40] R. Sivalingam, V. Morellas, D. Boley, and N. Papanikolopoulos. Metric learning for semisupervised clustering of region covariance descriptors. In ICDSC, 2009.
 [41] P. Stanitsas, A. Cherian, X. Li, A. Truskinovsky, V. Morellas, and N. Papanikolopoulos. Evaluation of feature descriptors for cancerous tissue recognition. In ICPR, 2016.
 [42] D. B. Thiyam, S. Cruces, J. Olias, and A. Cichocki. Optimization of alphabeta logdet divergences and their application in the spatial filtering of two class motor imagery movements. Entropy, 19(3):89, 2017.
 [43] O. Tuzel, F. Porikli, and P. Meer. Region covariance: A fast descriptor for detection and classification. In ECCV, 2006.
 [44] L. Wang, J. Zhang, L. Zhou, C. Tang, and W. Li. Beyond covariance: Feature representation with nonlinear kernel matrices. In ICCV, 2015.
 [45] R. Wang, H. Guo, L. S. Davis, and Q. Dai. Covariance discriminative learning: A natural and efficient approach to image set classification. In CVPR, 2012.
 [46] Y. Xie, J. Ho, and B. Vemuri. On a nonlinear generalization of sparse coding and dictionary learning. In ICML, 2013.
Comments
There are no comments yet.