1 Introduction
Since 1950s, a lot of work has been done to measure information and probabilistic metrics. Claude Shannon (Shannon 2001) proposed a powerful framework to mathematically quantify information , which has been the foundation of the information theory and the development in communication, networking, and a lot of Computer Science applications. Many problems in Physics and Computer Science require a reliable measure of information divergence, which have motivated many mathematicians, physicists, and computer scientists to study different divergence measures. For instance, Rényi (Rényi 1960), Tsallis (Tsallis 1988)
and KullbackLeibler divergences
(Gray 1990) have been applied in many Computer Science applications. They have been effectively used in machine learning for many tasks including subspace analysis (LearnedMiller and FisherIII 2003; Póczos and Lõrincz 2005; Van Hulle 2008; Szabó et al 2007), facial expression recognition (Shan et al 2005), texture classification (Hero et al 2001), image registration (Kybic 2006), clustering (Aghagolzadeh et al 2007), nonnegative matrix factorization (Wang and Zhang 2013) and 3D pose estimation (Bo and Sminchisescu 2010).In the Machine Learning community, a lot of attempts have been done to understand information and connect it to uncertainty. Many of proposed terminologies turns out to be different views of the same measure. For instance, Bregman Information (Banerjee et al 2005), Statistical Information (DeGroot 1962), CsiszárMorimoto fdivergence, and the gap between the expectations in Jensen’s inequality (i.e., the Jensen gap) (Jensen 1906)
turn out to be equivalent to the maximum reduction in uncertainty for convex functions, in contrast with the prior probability distribution
(Reid and Williamson 2011).A lot of work has been proposed in order to unify divergence functions (Amari and Nagaoka 2000; Reid and Williamson 2011; Zhang 2007; 2004). Cichocki and Ichi Amari (2010) considered explicitly the relationships between Alphadivergence (Cichocki et al 2008), Betadivergence (Kompass 2007) and Gammadivergence (Cichocki and Ichi Amari 2010); each of them is a singleparameter divergence measure. Then, Cichocki et al (2011) introduced a twoparameter family. However, we study here a twoparameter divergence measure (Sharma 1975), investigated in the Physics community, which is interesting to be considered in the Machine Learning community.
Akturk et al (2007), physicists^{1}^{1}1 This work was proposed four years before Cichocki et al (2011) and it was not considered either as a prior work in the Machine Learning community as far as we know, studied an entropy measure called SharmaMittal on theormostatics in 2007, which was originally introduced by Sharma BD et al (Sharma 1975). SharmaMittal (SM) divergence has two parameters ( and ), detailed later in Section 2. Akturk et al (2007) discussed that SM entropy generalizes both Tsallis () and Rényi entropy () in the limiting cases of its two parameters; this was originally showed by (Masi 2005). In addition, it can be shown that SM entropy converges to Shannon entropy as . Aktürk et al also suggested a physical meaning of SM entropy, which is the free energy difference between the equilibrium and the offequilibrium distribution. In 2008, SM entropy was also investigated in multidimensional harmonic oscillator systems (Aktürk et al 2008). Similarly, SM relative entropy (mutual information) generalizes each of the Rényi, Tsallis and KL mutual information divergences. This work in physics domain motivated us to investigate SM Divergence in the Machine Learning domain.
A closedform expression for SM divergence between two Gaussian distributions was recently proposed
(Nielsen and Nock 2012), which motivated us to study this measure in structured regression setting. In this paper, we present a generalized framework for structured regression utilizing a family of divergence measures that includes SM divergence, Rényi divergence, Tsallis divergence and KL divergence. In particular, we study SM divergence within the context of Twin Gaussian Processes (TGP), a stateoftheart structuredoutput regression method. Bo and Sminchisescu (2010) proposed TGP as a structured prediction approach based on estimating the KL divergence from the input to output Gaussian Processes, denoted by KLTGP^{2}^{2}2 that is why it is called Twin Gaussian Processes. Since KL divergence is not symmetric, Bo and Sminchisescu (2010) also studied TGP based on KL divergence from the output to the input data, denoted by IKLTGP (Inverse KLTGP). In this work, we present a generalization for TGP using the SM divergence, denoted by SMTGP. Since SM divergence is a twoparameter family, we study the effect of these parameters and how they are related to the distribution of the data. In the context TGP, we show that these two parameters, and , could be interpreted as distribution bias and divergence order in the context of structured learning. We also highlight probabilistic causality direction of the SM objective function^{3}^{3}3This is mainly detailed in section 4. More specifically, there are six contributions to this paper
[noitemsep,topsep=0pt,parsep=0pt,partopsep=0pt,leftmargin=*]

The first presentation of SM divergence in the Machine Learning Community

A generalized version of TGP based on of SM divergence to predict structured outputs; see Subsections 3.2.

A simplification to the SM divergence closedform expression in (Nielsen and Nock 2012) for Multivariate Gaussian Distribution^{4}^{4}4This simplification could be useful out of the context TGP, while computing SMdivergence between two multivariate distributions, which reduced both the cost function evaluation and the gradient computation, used in our prediction framework; see Subsections 3.3 and 3.4.

Theoretical analysis of TGP under SM divergence in Section 4.

A certainty measure, that could be associated with each structured output prediction, is argued in subsection 4.2.

An experimental demonstration that SM divergence improves on KL divergence under TGP prediction by correctly tuning and through cross validation on two toy examples and three real datasets; see Section 5.
The rest of this paper is organized as follows: Section 2 presents background on SM Divergence and its available closedform expression for multivariate Gaussians. Section 3 presents the optimization problem used in our framework and the derived analytic gradients. Section 4 presents our theoretical analysis on TGP under our framework from spectral perspective. Section 5 presents our experimental validation. Finally, Section 6 discusses and concludes our work.
2 SharmaMittal Divergence
This section addresses a background on SMdivergence and its closed form for the multivariate Gaussian distribution.
2.1 SM Family Divergence Measures
The SM divergence, , between two distributions and is defined as (Sharma 1975)
(1) 
It was shown in (Akturk et al 2007) that most of the widely used divergence measures are special cases of SM divergence. Each of the Rényi, Tsallis and KL divergences can be defined as limiting cases of SM divergence as follows:
(2) 
where , and denotes Rényi, Tsallis, KL divergences respectively. We also found that Bhattacharyya divergence (Kailath 1967), denoted by is a limit case of SM and Rényi divergences as follows
While SM is a twoparameter generalized entropy measure originally introduced by Sharma (1975), it is worth to mention that twoparameter family of divergence functions has been recently proposed in the machine learning community since 2011 (Cichocki et al 2011; Zhang 2013). It is shown in (Cichocki and Ichi Amari 2010) that the Tsallis entropy is connected to the Alphadivergence (Cichocki et al 2008), and Betadivergence (Kompass 2007)^{5}^{5}5 Alpha and Beta divergence should not be confused with and parameters of Sharma Mittal divergence, while the Rényi entropy is related to the Gammadivergences (Cichocki and Ichi Amari 2010). The Tsallis and Rényi relative entropies are two different generalization of the standard BoltzmannGibbs entropy (or Shannon information). However, we focus here on SM divergence for three reasons (1) It generalizes over a considerable family of functions suitable for structured regression problems (2) Possible future consideration of this measure in works that study entropy and divergence functions, (3) SM divergence has a closedform expression, recently proposed for multivariate Gaussian distributions (Nielsen and Nock 2012), which is interesting to study.
Another motivations of this work is to study how the two parameters of the SM Divergence, as a generalized entropy measure, affect the performance of the structured regression problem. Here we show an analogy in the physics domain that motivates our study. As indicated by Masi (2005) in physics domain, it is important to understand that Tsallis and Rényi entropies are two different generalizations along two different paths. Tsallis generalizes to nonextensive systems^{6}^{6}6i.e., In Physics, Entropy is considered to have an extensive property if its value depends on the amount of material present; Tsallis is an nonextensive entropy, while Rényi to quasilinear means^{7}^{7}7i.e., Rényi entropy is could be interpreted as an averaging of quasiarithmetic function Akturk et al (2007). SM entropy generalizes to nonextensive sets and nonlinear means having Tsallis and Rényi measures as limiting cases. Hence, in TGP regression setting, this indicates resolving the tradeoff of having a control of the direction of bias towards one of the distributions (i.e. input and output distributions) by changing . It also allows higherorder divergence measure by changing . Another motivation from Physics is that SM entropy is the only entropy that gives rise to a thermostatistics based on escort mean values^{8}^{8}8
escort mean values are useful theoretical tools, used in thermostatistics,for describing basic properties of some probability density function
(Tsallis et al 2009) and admitting of a partition function (Frank and Plastino 2002).2.2 SMdivergence ClosedForm Expression for Multivariate Gaussians
In order to solve optimization problems efficiently over relative entropy, it is critical to have a closedform formula for the optimized function, which is SM relative entropy in our framework. Prediction over Gaussian Processes (Rasmussen and Williams 2005) is performed practically as a multivariate Gaussian distribution. Hence, we are interested in finding a closedform formula for SM relative entropy of distribution from , such that , and . In 2012, Frank Nielsen proposed a closed form expression for SM divergence (Nielsen and Nock 2012) as follows
(3) 
where , , is a positive definite matrix, and denotes the matrix determinant. The following section builds on this SM closedform expression to predict structured output under TGP, which leads an analytic gradient of the SMTGP cost function with cubic computational complexity. We then present a simplified expression of the closedform expression in Equation 3, which results in an equivalent SMTGP analytic gradient of quadratic complexity.
3 SharmaMittal TGP
In prediction problems, we expect that similar inputs produce similar predictions. This notion was adopted in (Bo and Sminchisescu 2010; Yamada et al 2012) to predict structured output based on KL divergence between two Gaussian Processes. This section presents TGP for structured regression by minimizing SM relative entropy. We follow that by our theoretical analysis of TGPs in Section 4
. We begin by introducing some notation. Let the joint distributions of the input and the output be defined as follows
(4) 
where is a new input test point, whose unknown outcome is and the training set is and matrices. is an matrix with , such that is the similarity kernel between and . is an
column vector with
. Similarly, is an matrix with , such that is the similarity kernel between and , and is an column vector with . By applying GaussianRBF kernel functions, the similarity kernels for inputs and outputs will be in the form of and , respectively, where and are the corresponding kernel bandwidths, and are regularization parameters to avoid overfitting and to handle noise in the data, and if , otherwise.3.1 KLTGP and IKLTGP Prediction
Bo and Sminchisescu (2010) firstly proposed TGP which minimizes the KullbackLeibler divergence between the marginal GP of inputs and outputs. However, they were focusing on the Human Pose Estimation problem. As a result, the estimated pose using TGP is given as the solution of the following optimization problem (Bo and Sminchisescu 2010)
(5) 
where , . The analytical gradient of this cost function is defined as follows (Bo and Sminchisescu 2010)
(6) 
where is the dimension index of the output . For Gaussian kernels, we have
The optimization problem can be solved using a second order BFGS quasiNewton optimizer with cubic polynomial line search for optimal step size selection. Since KL divergence is not symmetric, Bo and Sminchisescu (2010) also studied inverse KLdivergence between the output and the input distribution under TGP; we denote this model as IKLTGP. Equations 7 and 8 show the IKLTGP cost function and its corresponding gradient^{9}^{9}9 we derived this equation since it was not provided in (Bo and Sminchisescu 2010).
(7) 
(8) 
From Equations 6 and 8, it is not hard to see that the gradients of KLTGP and IKLTGP can be computed in quadratic complexity, given that and are precomputed once during training and stored, as it depends only on the training data. This quadratic complexity of KLTGP gradient presents a benchmark for us to compute the gradient for SMTGP in . Hence, we address this benchmark in our framework, as detailed in the following subsections.
3.2 SMTGP Prediction
By applying the closedform in Equation 3, SM divergence between and becomes in the following form
(9) 
From matrix algebra, . Similarly, . Hence, Equation 9 could be rewritten as follows
(10) 
is a positive constant, since and are positive definite matrices. Hence, it could be removed from the optimization problem. Same argument holds for , so could be also removed from the cost function. Having removed these constants, the prediction function reduces to minimizing the following expression
(11) 
It is worth mentioning that is quadratic to compute, given that is precomputed during the training; see Appendix A.
To avoid numerical instability problems in Equation 11 (introduced by determinant of the large matrix , we optimized instead of . We derived the gradient of by applying the matrix calculus directly on the logarithm of Equation 11, presented below; the derivation steps are detailed in Appendix B
(12) 
is computed by solving the following linear system of equations , is the first elements in , which is a vector of elements. The computational complexity of the gradient in Equation 12 is cubic at test time, due to solving this system. On the other hand, the gradient for KLTGP is quadratic. This problem motivated us to investigate the cost function to achieve a quadratic complexity of the gradient computation for SMTGP.
3.3 Quadratic SMTGP Prediction
We start by simplifying the closedform expression introduced in (Nielsen and Nock 2012), which led to the gradient computation.
Lemma 3.1.
SMdivergence between two Ndimensional multivariate Gaussians and can be written as
(13) 
Proof.
Under TGP setting, the exponential term in Equation 3 vanishes to 1, since (i.e. ). Then, could be simplified as follows:
(14) 
∎
We denote the original closedform expression as , while the simplified form . After applying the simplified SM expression in Lemma 3.1 to measure the divergence between and , the new cost function becomes in the following form
(15) 
where , . Since , , and are multiplicative positive constants that do not depend on , they can be dropped from the cost function. Also, is an additive constant that can be ignored under optimization. After ignoring these multiplicative positive constants and the added constant, the improved SMTGP cost function reduces to
(16) 
In contrast to in Equation 11, does not involve a determinant of a large matrix. Hence, we predict the output by directly^{10}^{10}10There is no need to optimize over the logarithm of because there is no numerical stability problem minimizing in Equation 16. Since the cost function has two factors that does depend on , we follow the rule that if where is a constant, and are functions, then which interprets the two terms of the derived gradient below, where , ,
(17) 
The computational complexity of the cost function in Equation 16 and the gradient in Equation 17 is quadratic at test time (i.e. ) on number of the training data. Since and depend only on the training points, they are precomputed in the training time. Hence, our hypothesis, about the quadratic computational complexity of improved SMTGP prediction function and gradient, is true since the remaining computations are . This indicates the advantage of using our closedform expression for SM divergence in lemma 3.1 against the closedform proposed in (Nielsen and Nock 2012) with cubic complexity. However, both expression are equivalent, it is straight forward to compute the gradient in quadratic complexity from expression.
3.4 Advantage of against out of SMTGP context
The previous subsection shows that the computational complexity of SMTGP prediction was decreased significantly using our at test time to be quadratic, compared to cubic complexity for . Out of the TGP context, we show here another general advantage of using our proposed closedform expression to generally compute SMdivergence between two Gaussian distributions and . is times faster to compute than under condition. This is since needs operations which is much less than operations needed to compute (i.e., requires less matrix operations); see Appendix C for the proof. We conclude this section by a general form of Lemma 3.1 in Equation 18, where . This equation was achieved by refactorizing the exponential term and using matrix identities.
(18) 
In case , is times faster than computing . This is since needs operations in this case which is less than operations needed to compute under ; see Appendix C. This indicates that the simplifications, we provided in this work, could be used to generally speedup the computation of SM divergence between two Gaussian Distributions, beyond the context of TGPs.
4 Theoretical Analysis
In order to understand the role of and parameters of SMTGP, we performed an eigen analysis of the cost function in Equation 15. Generally speaking, the basic notion of TGP prediction, is to extend the dimensionality of the divergence measure from training examples to examples, which involves the test point and the unknown output . Hence, we start by discussing the extension of a general Gaussian Process from (e.g. and ) to (e.g. and ), where is any domain and is the point that extends to , detailed in subsection 4.1. Based on this discussion, we will derive two lemmas to address some properties of the SMTGP prediction in Subsection 4.2, which will lead to a probabilistic interpretation that we provide in subsection 4.3.
4.1 A Gaussian Process from to points
In this section, we will use a superscript to disambiguate between the kernel matrix of size and , i.e. and . Let be a Gaussian process on an arbitrary domain . Let be the marginalization of the given Gaussian process over the training points (i.e. ). Let be the extension of the be the marginalization of over points after adding the point (i.e. )^{11}^{11}11This is linked to the extending to and to by and respectively. The kernel matrix is written in terms of as follows
(19) 
where . The matrix determinant of is related to by
(20) 
Since multivariate Gaussian distribution is a special case of the elliptical distributions, the eigen values of any covariance matrix (e.g.
) are interpreted as variance of the distribution in the direction of the corresponding eigen vectors. Hence, the determinant of the matrix (e.g.
) generalizes the notion of the variance in multiple dimensions as the volume of this elliptical distribution, which is oriented by the eigen vectors. From this notion, one could interpret as the ratio by which the variance (uncertainty) of the marginalized Gaussian process is scaled, introduced by the new data point . Looking closely at , we can notice(1)
Comments
There are no comments yet.