# Learning Discriminative Alpha-Beta-divergence for Positive Definite Matrices (Extended Version)

Symmetric positive definite (SPD) matrices are useful for capturing second-order statistics of visual data. To compare two SPD matrices, several measures are available, such as the affine-invariant Riemannian metric, Jeffreys divergence, Jensen-Bregman logdet divergence, etc.; however, their behaviors may be application dependent, raising the need of manual selection to achieve the best possible performance. Further and as a result of their overwhelming complexity for large-scale problems, computing pairwise similarities by clever embedding of SPD matrices is often preferred to direct use of the aforementioned measures. In this paper, we propose a discriminative metric learning framework, Information Divergence and Dictionary Learning (IDDL), that not only learns application specific measures on SPD matrices automatically, but also embeds them as vectors using a learned dictionary. To learn the similarity measures (which could potentially be distinct for every dictionary atom), we use the recently introduced alpha-beta-logdet divergence, which is known to unify the measures listed above. We propose a novel IDDL objective, that learns the parameters of the divergence and the dictionary atoms jointly in a discriminative setup and is solved efficiently using Riemannian optimization. We showcase extensive experiments on eight computer vision datasets, demonstrating state-of-the-art performances.

## Authors

• 26 publications
• 1 publication
• 24 publications
• 3 publications
• 11 publications
• ### Riemannian Dictionary Learning and Sparse Coding for Positive Definite Matrices

Data encoded as symmetric positive definite (SPD) matrices frequently ar...
07/10/2015 ∙ by Anoop Cherian, et al. ∙ 0

• ### Riemannian Metric Learning for Symmetric Positive Definite Matrices

Over the past few years, symmetric positive definite (SPD) matrices have...
01/10/2015 ∙ by Raviteja Vemulapalli, et al. ∙ 0

• ### The Alpha-Beta-Symetric Divergence and their Positive Definite Kernel

In this article we study the field of Hilbertian metrics and positive de...
03/01/2018 ∙ by Mactar Ndaw, et al. ∙ 0

• ### Positive definite matrices and the S-divergence

Positive definite matrices abound in a dazzling variety of applications....
10/08/2011 ∙ by Suvrit Sra, et al. ∙ 0

• ### Infinite-dimensional Log-Determinant divergences II: Alpha-Beta divergences

This work presents a parametrized family of divergences, namely Alpha-Be...
10/13/2016 ∙ by Minh Ha Quang, et al. ∙ 0

• ### Generalized Beta Divergence

This paper generalizes beta divergence beyond its classical form associa...
06/14/2013 ∙ by Y. Kenan Yilmaz, et al. ∙ 0

• ### Dictionary Learning and Sparse Coding on Statistical Manifolds

In this paper, we propose a novel information theoretic framework for di...
05/03/2018 ∙ by Rudrasis Chakraborty, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Symmetric Positive Definite (SPD) matrices arise naturally in several computer vision applications, such as covariances when modeling data using Gaussians, as kernel matrices for high-dimensional embedding, as points in diffusion MRI [36]

, and as structure tensors in image processing

[5]. Furthermore, SPD matrices in the form of Region CoVariance Descriptors (RCoVDs)  [43], offer an easy way to compute a representation that fuses multiple modalities (e.g., color, gradients, filter responses, etc.) in a cohesive, and compact format. In various mainstream vision applications, including tracking, re-identification, object, texture, and activity recognition, the trail of SPD matrices to advance the state-of-the-art solutions can be seen [6, 44, 17]

. SPD matrices are even used as second-order pooling operators for enhancing the performance of popular deep learning architectures

[23, 21].

SPD matrices, due to their positive definiteness property, form a cone in the Euclidean space. However, analyzing these matrices through their Riemannian geometry (or the associated Lie algebra) helps avoiding unlikely/unrealistic solutions, thereby improving the outcomes. For example, in diffusion MRI [36, 3], it has been shown that the Riemannian structure (which comes with an affine invariant metric) is immensely useful for accurate modeling. A similar observation is made for RCoVDs [9, 19, 45]. This has resulted in the exploration of various geometries and similarity measures for SPD matrices, viewing them from disparate perspectives. A few notable such measures are: (i) the affine invariant Riemannian metric (AIRM) using the natural Riemannian geometry [36], (ii) the Jeffreys KL divergence (KLDM) using relative entropy [34], (iii) the Jensen-Bregman logdet divergence using information geometry [9], and (iv) Brug matrix divergence [27], among several others [11].

Each of the aforementioned measures has distinct mathematical properties and as such performs differently for a given problem. However and to some extent surprisingly, all of them can be obtained as functions acting on the generalized eigenvalues of their inputs. Recently, Cichocki et al.

[11] show that all these measures can be interpreted in a unifying setup using -logdet divergence (ABLD) and each measure can be derived as a distinct parametrization of this divergence. For example, one could get JBLD from ABLD222Up to a scaling factor. using , and AIRM as the limit of . With such an interesting discovery, it is natural to ask if the parameters and can be learned for a given task in a data-driven way. This not only answers which measure is the right choice for a given problem, but also allows for deriving new measures that are not among the popular ones listed above.

In this paper, we make the first attempt at learning an -logdet divergence on SPD matrices for computer vision applications, dubbed Information Divergence and Dictionary Learning

(IDDL). We cast the learning problem in a discriminative ridge regression setup where the goal is to learn

and that maximize the classification accuracy for a given task.

Being vigilant to the computational complexity of the resulting solution, we propose to embed SPD matrices using a dictionary in our metric learning framework. Our proposal enables us to learn the embedding (or more accurately the dictionary that identifies the embedding), along with the proper choice of the metric (i.e., parameters and

of the ABLD) and a classifier jointly. The output of our IDDL is a vector, each entry of this vector computes a potentially distinct ABLD to a distinct dictionary atom.

To achieve our goal, we propose an efficient formulation that benefits from recent advances in optimization over Riemannian manifolds to minimize a non-convex and constrained objective. We provide extensive experiments using IDDL on a variety of computer vision applications, namely (i) action recognition, (ii) texture recognition, (iii) 3D shape recognition, and (iv) cancerous tissue recognition. We also provide insights into our learning scheme through extensive experiments on the parameters of the ABLD, and ablation studies under various performance settings. Our results demonstrate that our scheme achieves state-of-the-art accuracies against competing techniques, including the recent sparse coding, Riemannian metric learning, and kernel coding schemes.

## 2 Related Work

The -logdet divergence is a matrix generalization of the well-known -divergence [10] that computes the (a)symmetric (dis)similarity between two finite positive measures (data densities). As the name implies, -divergence is a unification of the so-called -family of divergences [2] (that includes popular measures such as the KL-divergence, Jensen-Shannon divergence, and the chi-square divergence) and the -family [4] (including the squared Euclidean distance and the Itakura Saito distance). Against several standard measures for computing similarities, both and

divergences are known to lead to solutions that are robust to outliers and additive noise

[29], thereby improving application accuracy. They have been used in several statistical learning applications including non-negative matrix factorization  [12, 25, 13], nearest neighbor embedding [20], and blind-source separation [33].

A class of methods with similarities to our formulation are metric learning schemes on SPD matrices. One popular technique is the manifold-manifold embedding of large SPD matrices into a tiny SPD space in a discriminative setting [19]. Log-Euclidean metric learning has also been proposed for this embedding in [22, 40]. While, we also learn a metric in a discriminative setup, ours is different in that we learn an information divergence. In Thiyam et al. [42]

, ABLD is proposed replacing symmetric KL divergence in better characterizing the learning of a decision hyperplane for BCI applications. In contrast

333Automatic selection of the parameters of -divergence is investigated in [38, 14]. However, they deal with scalar density functions in a maximum-likelihood setup and do not consider the optimization of and jointly., we propose to embed the data matrices as vectors, each dimension of these vectors learning a different ABLD, thus leading to a richer representation of the input matrix.

Vectorial embedding of SPD matrices has been investigated using disparate formulations for computer vision applications. As alluded to earlier, the log-Euclidean projection [3] is a common way to achieve this, where an SPD matrix is isomorphically mapped to the Euclidean space of symmetric matrices using the matrix logarithm. Popular sparse coding schemes have been extended to SPD matrices in [8, 39, 46] using SPD dictionaries, where the resulting sparse vector is assumed Euclidean. Another popular way to handle the non-linear geometry of SPD matrices is to resort to kernel schemes by embedding the matrices in an infinite dimensional Hilbert space which is assumed to be linear  [18, 31, 17]. In all these methods, the underlying similarity measure is fixed and is usually chosen to be one among the popular -logdet divergences or the log-Euclidean metric.

In contrast to all these methods, to the best of our knowledge, it is for the first time that a joint dictionary learning and information divergence learning framework is proposed for SPD matrices in computer vision. In the sequel, we first introduce -logdet divergence and explore its properties in the next section. This will precede exposition to our discriminative metric learning framework for learning the divergence and efficient ways of solving our formulation.

##### Notations:

Following standard notations, we use upper case for matrices (such as ), lower-bold case for vectors , and lower case for scalars . Further, is used to denote the cone of SPD matrices. We use to denote a 3D tensor each slice of which is an SPD matrix of size . Further, we use to denote the identity matrix, for the matrix logarithm, and for the diagonalization operator.

## 3 Background

In this section, we will setup the mathematical preliminaries necessary to elucidate our contributions. We will visit the -log-det divergence, its connections to other popular divergences, and its mathematical properties.

### 3.1 αβ-Log Determinant Divergence

###### Definition 1 (Abld [11])

For , the -log-det divergence is defined as:

 (1)
 α≠0, β≠0  and α+β≠0. (2)

It can be shown that ABLD depends only on the generalized eigenvalues of and  [11]. Suppose denotes the -th eigenvalue of . Then under constraints defined in (2), we can rewrite (1) as:

 (3)

This formulation will come handy when deriving the gradient updates for and in the sequel. As alluded to earlier, a hallmark of the ABLD is that it unifies several popular distance measures on SPD matrices that one commonly encounters in computer vision applications. In Table 1, we list some of the popular measures in computer vision and the respective values of and .

### 3.2 ABLD Properties

Avoiding Degeneracy: An important observation regarding the design of optimization algorithms on ABLD is that the quantity inside the term has to be positive definite; conditions on and for which are specified by the following theorem.

###### Theorem 1 ([11])

For , if is the -th eigenvalue of , then only if

 λi >∣∣∣αβ∣∣∣1α+β, for α>0 and β<0, or (4) λi <∣∣∣βα∣∣∣1α+β, for α<0 and β>0,∀i=1,2,⋯,d. (5)

Since s depend on the input matrices, on which we have no control over, we constrain and to have the same sign, thereby avoiding the quantity inside to be indefinite. We make this assumption in our formulations in Section 4.

Smoothness of , : Assuming have the same sign, except at origin (), ABLD is smooth everywhere with respect to and , thus allowing us to develop Newton-type algorithms on them. Due to the discontinuity at the origin, we ought to design algorithms specifically addressing this particular case.

Affine Invariance: It can be easily shown that

 D(α,β)(X∥Y)=D(α,β)(AXAT∥AYAT), (6)

for any invertible matrix

. This is an important property that makes this divergence useful in a variety of applications, such as diffusion MRI [36].

Dual Symmetry: This property allows us to extend results derived for the case of to the one on later.

 D(α,β)(X∥Y)=D(β,α)(Y∥X). (7)

Before concluding this part, we briefly introduce the concept of optimization on Riemannian manifolds and in particular the method of Riemmanian Conjugate Gradient descent (RCG).

### 3.3 Optimization on Riemannian Manifolds

As will be shown in § 4, we need to solve a non-convex constrained optimization problem in the form

 minimize L(B) s.t.   B∈Sd++. (8)

Classical optimization methods generally turn a constrained problem into a sequence of unconstrained problems for which unconstrained techniques can be applied. In contrast, in this paper we make use of the optimization on Riemannian manifolds to minimize (8). This is motivated by recent advances in Riemannian optimization techniques where benefits of exploiting geometry over standard constrained optimization are shown [1]. As a consequence, these techniques have become increasingly popular in diverse application domains [8, 17].

A detailed discussion of Riemannian optimization goes beyond the scope of this paper, and we refer the interested reader to [1]. However, the knowledge of some basic concepts will be useful in the remainder of this paper. As such, here, we briefly consider the case of Riemannian Conjugate Gradient method (RCG), our choice when the empirical study of this work is considered. First we formally define the SPD manifold.

###### Definition 2 (The SPD Manifold)

The set of () dimensional real, SPD matrices endowed with the Affine Invariant Riemannian Metric (AIRM) [36] forms the SPD manifold .

 Sp++≜{X∈Rd×d:vTXv>0, ∀v∈Rd−{0d}}. (9)

To minimize (8), RCG starts from an initial solution and improves its solution using the update rule

 B(t+1)=τB(t)(P(t)), (10)

where identifies a search direction and is a retraction. The retraction serves to identify the new solution along the geodesic defined by the search direction . In RCG, it is guaranteed that the new solution obtained by Eq. (10) is on and has a lower objective. The search direction is obtained by

Here, can be thought of as a variable learning rate, obtained via techniques such as Fletcher-Reeves [1]. Furthermore, is the Riemannian gradient of the objective function at and denotes the parallel transport of from to . In Table 2, we define the mathematical entities required to perform RCG on the SPD manifold. Note that computing the standard Euclidean gradient of the function , denoted by , is the only requirement to perform RCG on .

## 4 Proposed Method

In this section, we first introduce the most general form of our joint IDDL formulation and follow it up by providing simplifications and derivations for specific cases (such as for ).

### 4.1 Information Divergence & Dictionary Learning

Suppose we are given a set of SPD matrices along their associated labels . Our goal is three-fold: (i) learn a dictionary , a product of SPD manifolds, (ii) learn an ABLD on each dictionary atom to best represent the given data for the task of classification, and (iii) learn a discriminative objective function on the encoded SPD matrices (in terms of and the respective ABLDs) for the purpose of classification. These goals are formally captured in the IDDL objective proposed below. Let the -th dictionary atom in be , then,

 IDDL:=minB>0,α>0,β>0,W N∑i=1f(vi,yi;W) (12) subject to vki =D(αk,βk)(Xi∥Bk),

where the vector denotes the encoding of in terms of the dictionary, and is the -th dimension of this encoding. The function parameterized by learns a classifier on according to the provided class labels . While, there are several choices for (e.g., max-margin hinge-loss), we resort to a simple ridge regression objective in this paper. Thus, our is defined as follows: suppose is a one-off encoding of class labels (i.e., , everywhere else zero), then

 f(vi,yi;W)=12∥hi−Wvi∥2+γ∥W∥2F, (13)

where and is a regularization parameter. Note that a separate for each dictionary atom is the most general form of our formulation. In our experiments, we explore simplified cases when these parameters are shared across the atoms.

### 4.2 Efficient Optimization

In this section, we propose efficient ways to solve the IDDL objective in (12). We propose to use a block-coordinate descent (BCD) scheme for optimization, in which each variable is updated alternately while fixing others. Going by the recent trends in Riemannian optimization for SPD matrices [8, 17], we use the Riemannian conjugate gradient (RCG) algorithm [1] for optimizing over each variable. As our objective is non-convex in its variables (except for ), convergence of BCD iterations to a global minima is not guaranteed. In Alg. 1, we detail out the meta-steps in our optimization scheme. We initialize the dictionary atoms and the divergence parameters as described in Section 6.3. Following that, we update the atoms, the divergence parameters, and classifier parameters in an alternating manner manner – that is, updating one variable whie fixing all others.

Recall from Section 3.3 that an essential ingredient in RCG is efficient computations of the Euclidean gradients of the objective with respect to the variables. In the following, we derive expressions for these gradients. Note that we assume that the dictionary atoms (i.e., ) to be on an SPD manifold. Also w.l.o.g, we assume and belong to the non-negative orthant of the Euclidean space (for reasons in Section 3).

As is clear from our formulation, only the -th dimension of involves . To simplify the notations, let us assume

 ζ=−(hi−Wvi)TW, (14)

and let be its -th dimension. Then we have (see the supplementary material for the details),

 ∇Bkf:=ζki∇Bk(D(αk,βk)(Xi∥Bk)). (15)

Substituting for ABLD in (20) and rearranging the terms, we have:

 ∇Bkf =1αkβk∇Bklogdet[αkβk(Xi\raisebox4.945pt$−1$Bk)αk+βk+Id]−1βkBk\raisebox4.945pt$−1$. (16)

Let and . Further, let . Then, the term inside the gradient in (16) simplifies to:

 g(Bk;Z,rk,θk)=logdet[rk(ZBk)θk+Id]. (17)
###### Theorem 2

Let . Furthermore assume . We have

Proof  For simplifying the notations, lets write . Note that the eigenvalues (and hence the ) of is the same as that of , however, the latter being symmetric and thus keeping symmetric when doing the gradient descent, we will use this form. Thus,

 logdet(p(SBS)q+Id)=Tr(log(p(SBS)q+Id)) (18)

Using Taylor series expansion555Strictly speaking, the series expansions that we use in the proof assume that , which can be achieved via rescaling. However, empirically this requirement has not been seen to be needed in all the datasets that we use.,

 (???)=Tr(p(SBS)q−p2(SBS)2q2+p3(SBS)3q3−⋯). (19)
 ∇B (???)⇒pqS(SBS)q−1S−qp2S(SBS)2q−1S+⋯=pqS(SBS)\raisebox4.945pt$−1$(SBS)q[Id−p(SBS)q+⋯]S. (20)

Using Maclaurin series in the middle term, we thus get

 ∇Bkg=rkθkBk\raisebox4.945pt$−1$Z−12i(Z12iBkZ12i)θk(Id+rk(Z12iBkZ12i)θk)−1Z12i. (21)

Combining (21) with (16), we have the expression for the gradient with respect to .

###### Remark 1

Computing for large datasets may become overwhelming. Let be the Schur decomposition (which is faster than the eigenvalue decomposition [16]). With , the gradient in (21) can be rewritten as:

 ∇Bkg=rkθkBk\raisebox4.945pt$−1$(Z−12iUi)⎡⎣diag⎛⎝δθi1+rkδθki⎞⎠⎤⎦(Z−12iUi)\raisebox4.945pt$−1$. (22)

Compared to (21), this simplification reduces the number of matrix multiplications from 5 to 3 and matrix inversions from 2 to 1.

#### 4.2.2 Gradients wrt αk and βk

For gradients with respect to , we will use the form of ABLD given in (3), where is assumed to be the -th generalized eigenvalue of and dictionary atom . Using the notations defined in (14), the gradient has the form:

 ∇αkf =ζkid∑j=1∇αk⎡⎢⎣1αkβklogαkλβkijk+βkλ−αkijkαk+βk⎤⎥⎦ =ζkiα2kβkd∑j=1{αkλβkijk−αkβkλ−αkijklogλijkαkλβkijk+βkλ−αkijk−αkαk+βk−logαkλβkijk+βkλ−αkijkαk+βk}. (23)

The gradients wrt from (23) can be derived using the dual symmetry property described in (7).

### 4.3 Closed Form for W

When fixing and , the objective reduces to the standard ridge regression formulation in , which can be solved in closed form as:

 W∗=HVT(VVT+γId)\raisebox4.945pt$−1$, (24)

where matrices and have and along their -th column, for .

### 4.4 The Solution When α,β→0

As alluded to earlier, ABLD is non-smooth at the origin and we need to resort to the limit of the divergence, which happens to be the natural Riemannian metric (AIRM). That is,

 D(0,0)(Xi∥Bk)=∥∥∥Log(X−12iBkX−12i)∥∥∥2F. (25)

Using the same ridge regression cost for defined in (13), and using defined in (14), we have the gradient using as:

 ∇Bkf=2ζkiX−12iLog[Pik]Pik\raisebox4.945pt$−1$X−12i, (26)

where . Note that a simplification similar to (22) is also possible for (26).

## 5 Computational Complexity

We note that some of the terms in the gradients derived above could be computed offline (such as ), and thus we omit those terms from our analysis. Using the simplifications depicted in (22) and Schur decomposition, gradient computation for each takes flops. Using the gradient formulation in (23) for and , we need flops. Computations of the closed form for in (24) takes . At test time, given that we have learned the dictionary and the parameters of the divergence, encoding a data matrix requires flops, which is similar in complexity to the recent sparse coding schemes such as [8].

## 6 Experiments

In this section, we evaluate the performance of the IDDL scheme on eight computer vision datasets, which are known to benefit from SPD-based descriptors. To this end, we use the following datasets, namely (i) the JHMDB action recognition [24], (ii) the HMDB action recognition [26] (iii) the KTH-TIPS2 dataset [32], (iv) Brodatz textures [35], (v) the Virus dataset [28], (vi) the SHREC 3D shape dataset [30], (vii) the Myometrium cancer dataset [41], and (viii) the Breast cancer dataset [41]. Below, we provide details about all the studied datasets and the way SPD descriptors are obtained on them. We use the standard evaluation schemes reported previously on these datasets. In some cases, we use our own implementations of popular methods but strictly following the recommended settings.

### 6.1 Datasets

##### HMDB and JHMDB datasets:

These are two popular action recognition benchmarks. The HMDB dataset consists of 51 action classes associated with 6766 video sequences, while JHMDB is a subset of HMDB with 955 sequences in 21 action classes. To generate SPD matrices on these datasets, we use the scheme proposed in [7], where we compute RBF kernel descriptors on the output of per-frame CNN class predictions (fc8) for each stream (RBF and optical flow) separately, and fusing these two SPD matrices into a single block-diagonal matrix per sequence. For the two-stream model, we use a VGG16 model trained on optical flow and RGB frames separately as described in [37]. Thus, our descriptors are of size for HMDB and for JHMDB.

##### SHREC 3D Object Recognition Dataset:

It consists of 15000 RGBD covariance descriptors generated from the SHREC dataset [30] by following [15]. SHREC consists of 51 3D object classes. The descriptors are of size . Similar to [8], we randomly picked 80% of the dataset for training and used the remaining for testing.

##### KTH-TIPS2 dataset and Brodatz Textures:

These are popular texture recognition datasets. The KTH-TIPS dataset consists of 4752 images from 11 material classes under varying conditions of illumination, pose, and scale. Covariance descriptors of size are generated from this dataset following the procedure in [18]. We use the standard 4-split cross-validation for our evaluations on this dataset. As for the Brodatz dataset, we use the relative pixel coordinates, image intensity, and image gradients to form region covariance descriptors from 100 texture classes. Our dataset consists of 31000 SPD matrices, and we follow the procedure in [8] for our evaluation using an 80:20 rule as used in the RGBD dataset above.

##### Virus Dataset:

It consists of 1500 images of 15 different virus types. Similar to the KTH-TIPS, we use the procedure in [18] to generate covariance descriptors from this dataset and follow their evaluation scheme using three-splits.

##### Cancer Datasets.

Apart from these standard SPD datasets, we also report performances on two cancer recognition datasets from [41] kindly shared with us by the authors. We use images from two types of cancers, namely (i) Breast cancer, consisting of binary classes (tissue is either cancerous or not) consisting of about 3500 samples, and (ii) Myometrium cancer, consisting of 3320 samples; we use covariance-kernel descriptors as described in [41] which are of size . We follow the 80:20 rule for evaluation on this dataset as well.

### 6.2 Experimental Setup

Since we present experiments on a variety of datasets and under various configurations, we summarize our main experiments first. There are three sets of experiments we conduct, namely (i) comparison of IDDL against other popular measures on SPD matrices, (ii) comparisons among various configurations of IDDL, and (iii) comparisons against state of the art approaches on the above datasets. For those datasets that do not have prescribed cross-validation splits, we repeat the experiments at least 5 times and average the performance scores.

### 6.3 Parameter Initialization

In all the experiments, we initialized the parameters of IDDL (e.g., the initial dictionary) in a principle-way. We initialized the dictionary atoms by applying log-Euclidean K-Means; i.e., we compute the log-Euclidean map of the SPD data, compute Euclidean K-Means on these mapped points, and remap the K-Means centroids to the SPD manifold via an exponential map. To initialize

and , we recommend grid-search by fixing the dictionary atoms as above. As an alternative to the grid-search, we empirically observed that a good choice is to start with the Burg divergence (i.e., ). The regularization parameter was chosen using cross-validation.

### 6.4 Comparisons to Variants of IDDL

In this section, we analyze various aspects of the performance of IDDL. Generally speaking, our IDDL formulation is generic and customizable. For example, even though we formulated the problem as using a separate ABLD on each dictionary atom, it does not hurt to learn the same divergence over all atoms in some applications. To this end, we test the performance of three scenarios, namely (i) using a scalar and that is shared across all the dictionary atoms (which we call IDDL-S), (ii) a vector and , where we assume , but each dictionary atom can potentially have a distinct parameter pair (we call this case IDDL-V), and (iii) the most generic case where we could have , as vectors and they may not be equal, which we refer as IDDL-N.

In Figure 2, we compare all these configurations on six of the datasets. We also include specific cases such as the Burg divergence () and the AIRM case () for comparisons (using the dictionary learning scheme proposed in Section 4.2.1). Our experiments show that IDDL-N and IDDL-V consistently perform well on most of the datasets. This is of course not very surprising given the generality of IDDL compared to the other measures.

### 6.5 Comparisons to Standard Measures

In this experiment, we compare the IDDL (see Figure 2) to the standard similarity measures on SPD matrices including log-Euclidean Metric [3], AIRM [36], and JBLD [9]. We report 1-NN classification performance on these baselines. In Table 4, we report the performance of these schemes. As a rule of thumb (and also supported empirically by cross-validation studies on our datasets), for a -class problem, we chose atoms in the dictionary. Increasing the size of the dictionary seems not helping in most cases. We also report a discriminative baseline by training a linear SVM on the log-Euclidean mapped SPD matrices. The results reported in Table 4 clearly demonstrate the advantage of IDDL against the baselines on most of the datasets, where the benefits can go over more than 10% in some cases (such as the JHMDB and virus).

### 6.6 Comparisons to the State of the Art

We compare IDDL to the following popular methods that share similarities to our scheme, namely (i) Log-Euclidean Metric learning (LEML) [22], (ii) kernelized Sparse Coding [18] that uses log-Euclidean metric for sparse coding SPD matrices (), (iii) kernelized sparse coding using JBLD (), and kernelized locality constrained coding [17], and Riemannian dictionary learning and sparse coding (RSPDL) [8]. Our results are reported in Table 3. Again we observe that IDDL performs the best amongst all the competitive schemes, clearly demonstrating the advantage of learning the divergence and the dictionary. Note that comparisons are established by considering the same number of atoms for all schemes and fine-tuning the parameters of each algorithm (e.g., the bandwidth of the RBF kernel in ) using a validation subset of the training set. As for LEML, we increased the number of pairwise constraints until the performance hit a plateau.

## 7 Ablative Study

In this section, we study the influence of each of the components in our algorithm. In Figure 3, we plot a heatmap of the classification accuracy against changing and on the KTH-TIPS2 and Virus datasets. We fixed the size of dictionaries to 22 for the KTH TIPS and 30 for the Virus datasets. The plots reveal that the performance varies for different parameter settings, thus (along with the results in Table 4) substantiates that learning these parameters is a way to improve performance. It should be noted that for our SVM-based experiments, we used a linear SVM on the log-Euclidean mapped SPD matrices.

In Figure 4, we plot the convergence of our objective against iterations. We also depict the BCD objective as contributed by the dictionary learning updates and the parameter learning; we use the IDDL-V for this experiment. As is clear, most part of the decrement in objective happens when the dictionary is learned, which is not surprising given that it has the most number of variables to learn. For most datasets, we observe that the RCG converges in about 200-300 iterations.

### 7.1 Running Time Experiments

In Figure 5, we plot the running time for one iteration of RCG against the number of dimensions of the matrices and the number of dictionary atoms. While our dictionary updates seem quadratic in the number of dimensions, it in fact scales linearly with the dictionary size, and usually converges in a couple of seconds.

### 7.2 Evaluation of Joint Learning

In Table 5, we evaluate the usefulness of the learning the information divergence against learning the dictionary on the Virus dataset. For this experiment, we evaluated three scenarios, (i) fixing the dictionary to the initialization (using KMeans), and learning the parameters using the IDDL-S variant, (ii) fixing to the initialization using GridSearch, while learning the dictionary, and (iii) learning both dictionary and the parameters jointly. As the results in Table 5 shows, jointly learning the parameters demonstrates better results, thus justifying our IDDL formulation.

### 7.3 Trajectories of α,β

In this experiment, we demonstrate the BCD trajectories of and for the IDDL-S and IDDL-N variants of our algorithm on the Virus dataset. Specifically, in Figure 6, we show how the value of and varies as the BCD iteration progresses. In this experiment, we used 15 dictionary atoms. All experiments used the same initializations for the invariants. We also plot the corresponding objective and training accuracies. For IDDL-N, recall that the parameters are vectors, and thus we plot the average and respectively. We also plot for various initializations for these parameters.

For IDDL-S, it appears that different initializations leads to disparate points of convergence. However, for all points of convergence, the objective convergence is very similar (and so is the training accuracy), suggesting that there are multiple local minima that leads to similar empirical results. We also find that initializing with demonstrates slightly better convergence than other possibilities, which we observed for other datasets too. For IDDL-N, we found that the mean of the parameters remained more or less constant, although the exact values varied by (not shown).

## 8 Limitations of IDDL

The main limitation of our approach is the non-convexity of our objective; that precludes a formal analysis of the convergence. A further limitation is that the gradient expressions involve matrix inversions and may need careful regularizations to avoid numerical instability. We also note that the AB divergence has a discontinuity at the origin, which needs to be accounted for when learning the parameters.

Further, from our experimental analysis, it looks like there is no single variant of IDDL (amongst IDDL-S, IDDL-V, IDDL-N, IDDL-A, and IDDL-B) that consistently performs the best for all datasets. However, with the possibility of learning alpha-beta, we would think the most generalized variant IDDL-N might perhaps be the best choice for any application as it can plausibly learn the alternatives.

## 9 Conclusions

In this paper, we proposed a novel framework unifying the problem of dictionary learning and information divergence learning on SPD matrices; two problems that have been investigated separately so far. We leveraged on the recent advances in information geometry for this purpose, namely using the -logdet divergence. We formulated an objective for jointly learning the divergence and the dictionary and showed that it can be solved efficiently using optimization methods on Riemannian manifolds. Experiments on eight computer vision datasets demonstrate superior performance of our approach against alternatives.

## Acknowledgments

This material is based upon work supported by the National Science Foundation through grants #CNS-0934327, #CNS-1039741, #SMA-1028076, #CNS-1338042, #CNS-1439728, #OISE-1551059, and #CNS-1514626. Dr. Cherian is funded by the Australian Research Council Centre of Excellence for Robotic Vision (#CE140100016).

## References

• [1] P.-A. Absil, R. Mahony, and R. Sepulchre. Optimization algorithms on matrix manifolds. Princeton University Press, 2009.
• [2] S.-i. Amari and H. Nagaoka. Methods of information geometry, volume 191. American Mathematical Soc., 2007.
• [3] V. Arsigny, P. Fillard, X. Pennec, and N. Ayache. Log-euclidean metrics for fast and simple calculus on diffusion tensors. Magnetic resonance in medicine, 56(2):411–421, 2006.
• [4] A. Basu, I. R. Harris, N. L. Hjort, and M. Jones.

Robust and efficient estimation by minimising a density power divergence.

Biometrika, 85(3):549–559, 1998.
• [5] T. Brox, J. Weickert, B. Burgeth, and P. Mrázek. Nonlinear structure tensors. Image and Vision Computing, 24(1):41–55, 2006.
• [6] J. Carreira, R. Caseiro, J. Batista, and C. Sminchisescu. Semantic segmentation with second-order pooling. In ECCV, 2012.
• [7] A. Cherian, P. Koniusz, and S. Gould. Higher-order pooling of CNN features via kernel linearization for action recognition. In WACV, 2017.
• [8] A. Cherian and S. Sra. Riemannian dictionary learning and sparse coding for positive definite matrices.

IEEE Trans. on Neural Networks and Learning Systems

, 2016.
• [9] A. Cherian, S. Sra, A. Banerjee, and N. Papanikolopoulos. Jensen-bregman logdet divergence with application to efficient similarity search for covariance matrices. PAMI, 35(9):2161–2174, 2013.
• [10] A. Cichocki and S.-i. Amari. Families of alpha-beta-and gamma-divergences: Flexible and robust measures of similarities. Entropy, 12(6):1532–1568, 2010.
• [11] A. Cichocki, S. Cruces, and S.-i. Amari. Log-determinant divergences revisited: Alpha-beta and gamma log-det divergences. Entropy, 17(5):2988–3034, 2015.
• [12] A. Cichocki, R. Zdunek, A. H. Phan, and S.-i. Amari. Non-negative matrix and tensor factorizations: applications to exploratory multi-way data analysis and blind source separation. John Wiley & Sons, 2009.
• [13] I. S. Dhillon and S. Sra. Generalized nonnegative matrix approximations with bregman divergences. In NIPS, 2005.
• [14] O. Dikmen, Z. Yang, and E. Oja. Learning the information divergence. PAMI, 37(7):1442–1454, 2015.
• [15] D. Fehr. Covariance based point cloud descriptors for object detection and classification. PhD thesis, University Of Minnesota, 2013.
• [16] G. H. Golub and C. F. Van Loan. Matrix computations, volume 3. JHU Press, 2012.
• [17] M. Harandi and M. Salzmann. Riemannian coding and dictionary learning: Kernels to the rescue. In CVPR, 2015.
• [18] M. Harandi, M. Salzmann, and F. Porikli. Bregman divergences for infinite dimensional covariance matrices. In CVPR, 2014.
• [19] M. T. Harandi, M. Salzmann, and R. Hartley. From manifold to manifold: Geometry-aware dimensionality reduction for spd matrices. In ECCV, 2014.
• [20] G. Hinton and S. Roweis. Stochastic neighbor embedding. In NIPS, 2002.
• [21] Z. Huang and L. Van Gool. A Riemannian network for SPD matrix learning. CoRR arXiv:1608.04233, 2016.
• [22] Z. Huang, R. Wang, S. Shan, X. Li, and X. Chen. Log-Euclidean metric learning on symmetric positive definite manifold with application to image set classification. In ICML, 2015.
• [23] C. Ionescu, O. Vantzos, and C. Sminchisescu.

Matrix backpropagation for deep networks with structured layers.

In ICCV, 2015.
• [24] H. Jhuang, J. Gall, S. Zuffi, C. Schmid, and M. J. Black. Towards understanding action recognition. In ICCV, 2013.
• [25] R. Kompass. A generalized divergence measure for nonnegative matrix factorization. Neural computation, 19(3):780–791, 2007.
• [26] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. Hmdb: a large video database for human motion recognition. In ICCV, 2011.
• [27] B. Kulis, M. Sustik, and I. Dhillon. Learning low-rank kernel matrices. In ICML, 2006.
• [28] G. Kylberg, M. Uppström, K. Hedlund, G. Borgefors, and I. Sintorn. Segmentation of virus particle candidates in transmission electron microscopy images. Journal of microscopy, 245(2):140–147, 2012.
• [29] J. Lafferty. Additive models, boosting, and inference for generalized divergences. In

Proc. conf. on Computational learning theory

, 1999.
• [30] K. Lai, L. Bo, X. Ren, and D. Fox. A large-scale hierarchical multi-view RGB-D object dataset. In ICRA, 2011.
• [31] P. Li, Q. Wang, W. Zuo, and L. Zhang. Log-Euclidean kernels for sparse representation and dictionary learning. In ICCV, 2013.
• [32] P. Mallikarjuna, A. T. Targhi, M. Fritz, E. Hayman, B. Caputo, and J.-O. Eklundh. The KTH-TIPS2 database, 2006.
• [33] M. Mihoko and S. Eguchi. Robust blind source separation by beta divergence. Neural computation, 14(8):1859–1886, 2002.
• [34] M. Moakher and P. G. Batchelor. Symmetric positive-definite matrices: From geometry to applications and visualization. In Visualization and Processing of Tensor Fields, pages 285–298. Springer, 2006.
• [35] T. Ojala, M. Pietikäinen, and D. Harwood. A comparative study of texture measures with classification based on featured distributions. Pattern recognition, 29(1):51–59, 1996.
• [36] X. Pennec, P. Fillard, and N. Ayache. A Riemannian framework for tensor computing. IJCV, 66(1):41–66, 2006.
• [37] K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. In NIPS, 2014.
• [38] U. Şimşekli, A. T. Cemgil, and B. Ermiş. Learning mixed divergences in coupled matrix and tensor factorization models. In ICASSP, 2015.
• [39] R. Sivalingam, D. Boley, V. Morellas, and N. Papanikolopoulos. Tensor sparse coding for region covariances. In ECCV, 2010.
• [40] R. Sivalingam, V. Morellas, D. Boley, and N. Papanikolopoulos. Metric learning for semi-supervised clustering of region covariance descriptors. In ICDSC, 2009.
• [41] P. Stanitsas, A. Cherian, X. Li, A. Truskinovsky, V. Morellas, and N. Papanikolopoulos. Evaluation of feature descriptors for cancerous tissue recognition. In ICPR, 2016.
• [42] D. B. Thiyam, S. Cruces, J. Olias, and A. Cichocki. Optimization of alpha-beta log-det divergences and their application in the spatial filtering of two class motor imagery movements. Entropy, 19(3):89, 2017.
• [43] O. Tuzel, F. Porikli, and P. Meer. Region covariance: A fast descriptor for detection and classification. In ECCV, 2006.
• [44] L. Wang, J. Zhang, L. Zhou, C. Tang, and W. Li. Beyond covariance: Feature representation with nonlinear kernel matrices. In ICCV, 2015.
• [45] R. Wang, H. Guo, L. S. Davis, and Q. Dai. Covariance discriminative learning: A natural and efficient approach to image set classification. In CVPR, 2012.
• [46] Y. Xie, J. Ho, and B. Vemuri. On a nonlinear generalization of sparse coding and dictionary learning. In ICML, 2013.