Generalized Twin Gaussian Processes using Sharma-Mittal Divergence

by   Mohamed Elhoseiny, et al.
Rutgers University

There has been a growing interest in mutual information measures due to their wide range of applications in Machine Learning and Computer Vision. In this paper, we present a generalized structured regression framework based on Shama-Mittal divergence, a relative entropy measure, which is introduced to the Machine Learning community in this work. Sharma-Mittal (SM) divergence is a generalized mutual information measure for the widely used Rényi, Tsallis, Bhattacharyya, and Kullback-Leibler (KL) relative entropies. Specifically, we study Sharma-Mittal divergence as a cost function in the context of the Twin Gaussian Processes (TGP) Bo:2010, which generalizes over the KL-divergence without computational penalty. We show interesting properties of Sharma-Mittal TGP (SMTGP) through a theoretical analysis, which covers missing insights in the traditional TGP formulation. However, we generalize this theory based on SM-divergence instead of KL-divergence which is a special case. Experimentally, we evaluated the proposed SMTGP framework on several datasets. The results show that SMTGP reaches better predictions than KL-based TGP, since it offers a bigger class of models through its parameters that we learn from the data.



There are no comments yet.


page 1

page 2

page 3

page 4


Relation between the Kantorovich-Wasserstein metric and the Kullback-Leibler divergence

We discuss a relation between the Kantorovich-Wasserstein (KW) metric an...

Investigation of Alternative Measures for Mutual Information

Mutual information I(X;Y) is a useful definition in information theory t...

Stealth Attacks on the Smart Grid

Random attacks that jointly minimize the amount of information acquired ...

Estimating 2-Sinkhorn Divergence between Gaussian Processes from Finite-Dimensional Marginals

Optimal Transport (OT) has emerged as an important computational tool in...

Principled Bayesian Minimum Divergence Inference

When it is acknowledged that all candidate parameterised statistical mod...

Generalized Bregman Divergence and Gradient of Mutual Information for Vector Poisson Channels

We investigate connections between information-theoretic and estimation-...

Principles of Bayesian Inference using General Divergence Criteria

When it is acknowledged that all candidate parameterised statistical mod...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Since 1950s, a lot of work has been done to measure information and probabilistic metrics. Claude Shannon (Shannon 2001) proposed a powerful framework to mathematically quantify information , which has been the foundation of the information theory and the development in communication, networking, and a lot of Computer Science applications. Many problems in Physics and Computer Science require a reliable measure of information divergence, which have motivated many mathematicians, physicists, and computer scientists to study different divergence measures. For instance, Rényi (Rényi 1960), Tsallis (Tsallis 1988)

and Kullback-Leibler divergences 

(Gray 1990) have been applied in many Computer Science applications. They have been effectively used in machine learning for many tasks including subspace analysis (Learned-Miller and Fisher-III 2003; Póczos and Lõrincz 2005; Van Hulle 2008; Szabó et al 2007), facial expression recognition (Shan et al 2005), texture classification (Hero et al 2001), image registration (Kybic 2006), clustering (Aghagolzadeh et al 2007), non-negative matrix factorization (Wang and Zhang 2013) and 3D pose estimation (Bo and Sminchisescu 2010).

In the Machine Learning community, a lot of attempts have been done to understand information and connect it to uncertainty. Many of proposed terminologies turns out to be different views of the same measure. For instance, Bregman Information (Banerjee et al 2005), Statistical Information (DeGroot 1962), Csiszár-Morimoto f-divergence, and the gap between the expectations in Jensen’s inequality (i.e., the Jensen gap) (Jensen 1906)

turn out to be equivalent to the maximum reduction in uncertainty for convex functions, in contrast with the prior probability distribution 

(Reid and Williamson 2011).

A lot of work has been proposed in order to unify divergence functions (Amari and Nagaoka 2000; Reid and Williamson 2011; Zhang 2007; 2004).  Cichocki and Ichi Amari (2010) considered explicitly the relationships between Alpha-divergence (Cichocki et al 2008), Beta-divergence (Kompass 2007) and Gamma-divergence (Cichocki and Ichi Amari 2010); each of them is a single-parameter divergence measure. Then,  Cichocki et al (2011) introduced a two-parameter family. However, we study here a two-parameter divergence measure (Sharma 1975), investigated in the Physics community, which is interesting to be considered in the Machine Learning community.

Akturk et al (2007), physicists111 This work was proposed four years before Cichocki et al (2011) and it was not considered either as a prior work in the Machine Learning community as far as we know, studied an entropy measure called Sharma-Mittal on theormostatics in 2007, which was originally introduced by Sharma BD et al (Sharma 1975). Sharma-Mittal (SM) divergence has two parameters ( and ), detailed later in Section 2Akturk et al (2007) discussed that SM entropy generalizes both Tsallis () and Rényi entropy () in the limiting cases of its two parameters; this was originally showed by (Masi 2005). In addition, it can be shown that SM entropy converges to Shannon entropy as . Aktürk et al also suggested a physical meaning of SM entropy, which is the free energy difference between the equilibrium and the off-equilibrium distribution. In 2008, SM entropy was also investigated in multidimensional harmonic oscillator systems (Aktürk et al 2008). Similarly, SM relative entropy (mutual information) generalizes each of the Rényi, Tsallis and KL mutual information divergences. This work in physics domain motivated us to investigate SM Divergence in the Machine Learning domain.

A closed-form expression for SM divergence between two Gaussian distributions was recently proposed 

(Nielsen and Nock 2012), which motivated us to study this measure in structured regression setting. In this paper, we present a generalized framework for structured regression utilizing a family of divergence measures that includes SM divergence, Rényi divergence, Tsallis divergence and KL divergence. In particular, we study SM divergence within the context of Twin Gaussian Processes (TGP), a state-of-the-art structured-output regression method. Bo and Sminchisescu (2010) proposed TGP as a structured prediction approach based on estimating the KL divergence from the input to output Gaussian Processes, denoted by KLTGP222 that is why it is called Twin Gaussian Processes. Since KL divergence is not symmetric, Bo and Sminchisescu (2010) also studied TGP based on KL divergence from the output to the input data, denoted by IKLTGP (Inverse KLTGP). In this work, we present a generalization for TGP using the SM divergence, denoted by SMTGP. Since SM divergence is a two-parameter family, we study the effect of these parameters and how they are related to the distribution of the data. In the context TGP, we show that these two parameters, and , could be interpreted as distribution bias and divergence order in the context of structured learning. We also highlight probabilistic causality direction of the SM objective function333This is mainly detailed in section 4. More specifically, there are six contributions to this paper

  1. [noitemsep,topsep=0pt,parsep=0pt,partopsep=0pt,leftmargin=*]

  2. The first presentation of SM divergence in the Machine Learning Community

  3. A generalized version of TGP based on of SM divergence to predict structured outputs; see Subsections 3.2.

  4. A simplification to the SM divergence closed-form expression in (Nielsen and Nock 2012) for Multi-variate Gaussian Distribution444This simplification could be useful out of the context TGP, while computing SM-divergence between two multi-variate distributions, which reduced both the cost function evaluation and the gradient computation, used in our prediction framework; see Subsections 3.3 and 3.4.

  5. Theoretical analysis of TGP under SM divergence in Section 4.

  6. A certainty measure, that could be associated with each structured output prediction, is argued in subsection 4.2.

  7. An experimental demonstration that SM divergence improves on KL divergence under TGP prediction by correctly tuning and through cross validation on two toy examples and three real datasets; see Section 5.

The rest of this paper is organized as follows: Section 2 presents background on SM Divergence and its available closed-form expression for multivariate Gaussians. Section 3 presents the optimization problem used in our framework and the derived analytic gradients. Section 4 presents our theoretical analysis on TGP under our framework from spectral perspective. Section 5 presents our experimental validation. Finally, Section 6 discusses and concludes our work.

2 Sharma-Mittal Divergence

This section addresses a background on SM-divergence and its closed form for the multivariate Gaussian distribution.

2.1 SM Family Divergence Measures

The SM divergence, , between two distributions and is defined as (Sharma 1975)


It was shown in (Akturk et al 2007) that most of the widely used divergence measures are special cases of SM divergence. Each of the Rényi, Tsallis and KL divergences can be defined as limiting cases of SM divergence as follows:


where , and denotes Rényi, Tsallis, KL divergences respectively. We also found that Bhattacharyya divergence (Kailath 1967), denoted by is a limit case of SM and Rényi divergences as follows

While SM is a two-parameter generalized entropy measure originally introduced by Sharma (1975), it is worth to mention that two-parameter family of divergence functions has been recently proposed in the machine learning community since 2011 (Cichocki et al 2011; Zhang 2013). It is shown in (Cichocki and Ichi Amari 2010) that the Tsallis entropy is connected to the Alpha-divergence (Cichocki et al 2008), and Beta-divergence (Kompass 2007)555 Alpha and Beta divergence should not be confused with and parameters of Sharma Mittal divergence, while the Rényi entropy is related to the Gamma-divergences (Cichocki and Ichi Amari 2010). The Tsallis and Rényi relative entropies are two different generalization of the standard Boltzmann-Gibbs entropy (or Shannon information). However, we focus here on SM divergence for three reasons (1) It generalizes over a considerable family of functions suitable for structured regression problems (2) Possible future consideration of this measure in works that study entropy and divergence functions, (3) SM divergence has a closed-form expression, recently proposed for multivariate Gaussian distributions (Nielsen and Nock 2012), which is interesting to study.

Another motivations of this work is to study how the two parameters of the SM Divergence, as a generalized entropy measure, affect the performance of the structured regression problem. Here we show an analogy in the physics domain that motivates our study. As indicated by Masi (2005) in physics domain, it is important to understand that Tsallis and Rényi entropies are two different generalizations along two different paths. Tsallis generalizes to non-extensive systems666i.e., In Physics, Entropy is considered to have an extensive property if its value depends on the amount of material present; Tsallis is an non-extensive entropy, while Rényi to quasi-linear means777i.e., Rényi entropy is could be interpreted as an averaging of quasi-arithmetic function  Akturk et al (2007). SM entropy generalizes to non-extensive sets and non-linear means having Tsallis and Rényi measures as limiting cases. Hence, in TGP regression setting, this indicates resolving the trade-off of having a control of the direction of bias towards one of the distributions (i.e. input and output distributions) by changing . It also allows higher-order divergence measure by changing . Another motivation from Physics is that SM entropy is the only entropy that gives rise to a thermostatistics based on escort mean values888

escort mean values are useful theoretical tools, used in thermostatistics,for describing basic properties of some probability density function  

(Tsallis et al 2009) and admitting of a partition function (Frank and Plastino 2002).

2.2 SM-divergence Closed-Form Expression for Multivariate Gaussians

In order to solve optimization problems efficiently over relative entropy, it is critical to have a closed-form formula for the optimized function, which is SM relative entropy in our framework. Prediction over Gaussian Processes (Rasmussen and Williams 2005) is performed practically as a multivariate Gaussian distribution. Hence, we are interested in finding a closed-form formula for SM relative entropy of distribution from , such that , and . In 2012, Frank Nielsen proposed a closed form expression for SM divergence (Nielsen and Nock 2012) as follows


where , , is a positive definite matrix, and denotes the matrix determinant. The following section builds on this SM closed-form expression to predict structured output under TGP, which leads an analytic gradient of the SMTGP cost function with cubic computational complexity. We then present a simplified expression of the closed-form expression in Equation 3, which results in an equivalent SMTGP analytic gradient of quadratic complexity.

3 Sharma-Mittal TGP

In prediction problems, we expect that similar inputs produce similar predictions. This notion was adopted in (Bo and Sminchisescu 2010; Yamada et al 2012) to predict structured output based on KL divergence between two Gaussian Processes. This section presents TGP for structured regression by minimizing SM relative entropy. We follow that by our theoretical analysis of TGPs in Section 4

. We begin by introducing some notation. Let the joint distributions of the input and the output be defined as follows


where is a new input test point, whose unknown outcome is and the training set is and matrices. is an matrix with , such that is the similarity kernel between and . is an

column vector with

. Similarly, is an matrix with , such that is the similarity kernel between and , and is an column vector with . By applying Gaussian-RBF kernel functions, the similarity kernels for inputs and outputs will be in the form of and , respectively, where and are the corresponding kernel bandwidths, and are regularization parameters to avoid overfitting and to handle noise in the data, and if , otherwise.

3.1 KLTGP and IKLTGP Prediction

Bo and Sminchisescu (2010) firstly proposed TGP which minimizes the Kullback-Leibler divergence between the marginal GP of inputs and outputs. However, they were focusing on the Human Pose Estimation problem. As a result, the estimated pose using TGP is given as the solution of the following optimization problem (Bo and Sminchisescu 2010)


where , . The analytical gradient of this cost function is defined as follows (Bo and Sminchisescu 2010)


where is the dimension index of the output . For Gaussian kernels, we have

The optimization problem can be solved using a second order BFGS quasi-Newton optimizer with cubic polynomial line search for optimal step size selection. Since KL divergence is not symmetric,  Bo and Sminchisescu (2010) also studied inverse KL-divergence between the output and the input distribution under TGP; we denote this model as IKLTGP. Equations 7 and 8 show the IKLTGP cost function and its corresponding gradient999 we derived this equation since it was not provided in (Bo and Sminchisescu 2010).


From Equations 6 and 8, it is not hard to see that the gradients of KLTGP and IKLTGP can be computed in quadratic complexity, given that and are precomputed once during training and stored, as it depends only on the training data. This quadratic complexity of KLTGP gradient presents a benchmark for us to compute the gradient for SMTGP in . Hence, we address this benchmark in our framework, as detailed in the following subsections.

3.2 SMTGP Prediction

By applying the closed-form in Equation 3, SM divergence between and becomes in the following form


From matrix algebra, . Similarly, . Hence, Equation 9 could be rewritten as follows


is a positive constant, since and are positive definite matrices. Hence, it could be removed from the optimization problem. Same argument holds for , so could be also removed from the cost function. Having removed these constants, the prediction function reduces to minimizing the following expression


It is worth mentioning that is quadratic to compute, given that is precomputed during the training; see Appendix A.

To avoid numerical instability problems in Equation 11 (introduced by determinant of the large matrix , we optimized instead of . We derived the gradient of by applying the matrix calculus directly on the logarithm of Equation 11, presented below; the derivation steps are detailed in Appendix B


is computed by solving the following linear system of equations , is the first elements in , which is a vector of elements. The computational complexity of the gradient in Equation 12 is cubic at test time, due to solving this system. On the other hand, the gradient for KLTGP is quadratic. This problem motivated us to investigate the cost function to achieve a quadratic complexity of the gradient computation for SMTGP.

3.3 Quadratic SMTGP Prediction

We start by simplifying the closed-form expression introduced in (Nielsen and Nock 2012), which led to the gradient computation.

Lemma 3.1.

SM-divergence between two N-dimensional multivariate Gaussians and can be written as


Under TGP setting, the exponential term in Equation 3 vanishes to 1, since (i.e. ). Then, could be simplified as follows:


We denote the original closed-form expression as , while the simplified form . After applying the simplified SM expression in Lemma 3.1 to measure the divergence between and , the new cost function becomes in the following form


where , . Since , , and are multiplicative positive constants that do not depend on , they can be dropped from the cost function. Also, is an additive constant that can be ignored under optimization. After ignoring these multiplicative positive constants and the added constant, the improved SMTGP cost function reduces to


In contrast to in Equation 11, does not involve a determinant of a large matrix. Hence, we predict the output by directly101010There is no need to optimize over the logarithm of because there is no numerical stability problem minimizing in Equation 16. Since the cost function has two factors that does depend on , we follow the rule that if where is a constant, and are functions, then which interprets the two terms of the derived gradient below, where , ,


The computational complexity of the cost function in Equation 16 and the gradient in Equation 17 is quadratic at test time (i.e. ) on number of the training data. Since and depend only on the training points, they are precomputed in the training time. Hence, our hypothesis, about the quadratic computational complexity of improved SMTGP prediction function and gradient, is true since the remaining computations are . This indicates the advantage of using our closed-form expression for SM divergence in lemma 3.1 against the closed-form proposed in (Nielsen and Nock 2012) with cubic complexity. However, both expression are equivalent, it is straight forward to compute the gradient in quadratic complexity from expression.

3.4 Advantage of against out of SMTGP context

The previous subsection shows that the computational complexity of SMTGP prediction was decreased significantly using our at test time to be quadratic, compared to cubic complexity for . Out of the TGP context, we show here another general advantage of using our proposed closed-form expression to generally compute SM-divergence between two Gaussian distributions and . is times faster to compute than under condition. This is since needs operations which is much less than operations needed to compute (i.e., requires less matrix operations); see Appendix C for the proof. We conclude this section by a general form of Lemma 3.1 in Equation 18, where . This equation was achieved by refactorizing the exponential term and using matrix identities.


In case , is times faster than computing . This is since needs operations in this case which is less than operations needed to compute under ; see Appendix C. This indicates that the simplifications, we provided in this work, could be used to generally speedup the computation of SM divergence between two Gaussian Distributions, beyond the context of TGPs.

4 Theoretical Analysis

In order to understand the role of and parameters of SMTGP, we performed an eigen analysis of the cost function in Equation 15. Generally speaking, the basic notion of TGP prediction, is to extend the dimensionality of the divergence measure from training examples to examples, which involves the test point and the unknown output . Hence, we start by discussing the extension of a general Gaussian Process from (e.g. and ) to (e.g. and ), where is any domain and is the point that extends to , detailed in subsection 4.1. Based on this discussion, we will derive two lemmas to address some properties of the SMTGP prediction in Subsection 4.2, which will lead to a probabilistic interpretation that we provide in subsection 4.3.

4.1 A Gaussian Process from to points

In this section, we will use a superscript to disambiguate between the kernel matrix of size and , i.e. and . Let be a Gaussian process on an arbitrary domain . Let be the marginalization of the given Gaussian process over the training points (i.e. ). Let be the extension of the be the marginalization of over points after adding the point (i.e. )111111This is linked to the extending to and to by and respectively. The kernel matrix is written in terms of as follows


where . The matrix determinant of is related to by


Since multivariate Gaussian distribution is a special case of the elliptical distributions, the eigen values of any covariance matrix (e.g.

) are interpreted as variance of the distribution in the direction of the corresponding eigen vectors. Hence, the determinant of the matrix (e.g.

) generalizes the notion of the variance in multiple dimensions as the volume of this elliptical distribution, which is oriented by the eigen vectors. From this notion, one could interpret as the ratio by which the variance (uncertainty) of the marginalized Gaussian process is scaled, introduced by the new data point . Looking closely at , we can notice