Prediction in Riemannian metrics derived from divergence functions

by   Henryk Gzyl, et al.

Divergence functions are interesting discrepancy measures. Even though they are not true distances, we can use them to measure how separated two points are. Curiously enough, when they are applied to random variables, they lead to a notion of best predictor that coincides with usual best predictor in Euclidean distance. Given a divergence function, we can derive from it a Riemannian metric, which leads to a distance in which means and best predictors do not coincide with their Euclidean counterparts. It is the purpose of this note to study the Riemannian metric derived from the divergence function as well as its use in prediction theory.



There are no comments yet.


page 1

page 2

page 3

page 4


Riemannian Metric Learning for Symmetric Positive Definite Matrices

Over the past few years, symmetric positive definite (SPD) matrices have...

Hessian transport Gradient flows

We derive new gradient flows of divergence functions in the probability ...

Experiments of Distance Measurements in a Foliage Plant Retrieval System

One of important components in an image retrieval system is selecting a ...

Probabilistic Alternatives to the Gower Distance: A Note on Deodata Predictors

A probabilistic alternative to the Gower distance is proposed. The proba...

Kriging prediction with isotropic Matérn correlations: robustness and experimental design

We investigate the prediction performance of the kriging predictors. We ...

Is the Chen-Sbert Divergence a Metric?

Recently, Chen and Sbert proposed a general divergence measure. This rep...

Geometry and clustering with metrics derived from separable Bregman divergences

Separable Bregman divergences induce Riemannian metric spaces that are i...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction and Preliminaries

In [11], Bregman introduced an iterative procedure to find points in an intersection of convex sets. At each step, the next point in the sequence is obtained by minimizing an objective function, that can be described as the vertical distance of the graph of the function to the tangent plane through the previous point. If is a convex set in some and is a strictly convex, continuously differentiable function, the divergence function that it defines is specified by


In Bregman’s work, was taken to be the Euclidean square norm

The concept was eventually extended, even to the infinite dimensional case, and now plays an important role in many applications. For example, in clustering. classification analysis and machine learning as in Banerjee et al.

[3], Boisonnat el al. [9], Banerjee et al. [4]. Fisher [16]. It plays a role in optimization theory as in Baushke and Borwein [5], Baushke and Lewis [6], Baushke and Combettes [8], Censor and Reich [13], Baushke et al. [7] and Censor and Zaknoon [14], or to solve operator equations as in Butnariu and Resmerita [11], in approximation theory in Banach spaces as in Baushke and Combettes [8] or Li et al. [19]. In applications of geometry to statistics and information theory as in Amari and Nagaoka [2], Csiszár [15], Amari and Cichoski [1], Calin and Urdiste [12] or Nielsen [22]. These are just a small sample of the many references to applications of Bregman functions, and the list cascades rapidly.

Is is a well known, and easy to verify fact, that

Thus our choice of notation is consistent. But as is not symmetric, nor does it satisfy the triangular inequality, it can not be a distance on Let now

be a probability space such that

is complete (contains all sets of zero measure). By we shall denote the usual classes of integrable or square integrable functions, identified up to sets of measure zero. The notion of divergence can be extended to random variables as follows

Definition 1.1

Let be -valued random variables such that and are in The divergence between and is defined by

Clearly, is neither symmetric nor satisfies the triangle inequality. But as above, we also have

we can think of it as a pseudo distance, cost or penalty function on

The motivation for this work comes from two directions. On the one hand, there is the fact that for Bregman divergences there is a notion of best predictor, and this best predictor happens to be the usual conditional expectation. To put it in symbols

Theorem 1.1

Let and let Then the solution to the problem

is given by

For the proof the reader can consult Banerjee et al., [3] or Fisher’s [16]. The other thread comes from Gzyl’s [17]

, where a geometry on the convex cone of strictly positive is considered. That geometry happens to be derivable from a divergence function, and it leads to a host of curious variations on the theme of best predictor, estimation, laws of large numbers and central limit theorems. The geometry considered there is that induced by the logarithmic distance, which makes

a Tits-Bruhat space, which happens to be a special commutative version of the theory explained in Lang [18], Lawson and Lin [20], Mohaker [21] and Schwartzman [24].

We should mention that the use of differential geometric methods in [2], or [12] and the many references cited therein, is different from the one described below. They consider geometric structure either on the class of probabilities on a finite set, or in the space of parameters characterizing a (usually exponential) family of distributions. Here we analyze how the geometry on the set in which the random variables take value, determines the nature of the standard estimation and prediction process.

From now on we shall suppose that where is a bounded or unbounded interval in We shall denote by a strictly convex, three times continuously differentiable function, and define


1.1 Some standard examples

In the next table we list five standard examples. The list could be quite longer, but the examples chosen because the in some of the cases the distance between random variables associated to the divergence bounds their divergence from above, whereas in the other, it is bounded by the divergence from above. The examples are displayed in Table 1.

Table 1: Standard convex functions used to generate Bregman divergences

1.2 Organization of the paper

We have established enough notations to describe the contents of the paper. In Section 2 we start from the divergence function on

and derive a metric tensor

from it. We then solve the geodesic equations to compute the geodesic distance between any two points and we compare it with the divergence between the two points. We shall see that there are cases in which one of them dominates the other for any pair of points.

The Riemannian distance between points in induces a distance between random variables taking values in there. In Section 3 we come to the main theme of this work, that is, to the computations of best predictors when the distance between random variables is measured in the induced Riemannian distance. We shall call such best predictors the mean and the -conditional expectation and denote them by and, respectively, In order to compare these to the best predictor in divergence, we use the prediction error as a comparison criterion. It is at this point at which the comparison results established in Section 2 come in.

In Section 4 we take up the issue of sample estimation and its properties. We shall see that the standard results hold for the -conditional expectation as well. That is, we shall see that the estimator of the -mean and that of the

-variance, are unbiased and converge to their true values as the size of the sample becomes infinitely large. In Section 5 we shall consider the arithmetic properties of the

-conditional expectation when there is a commutative group structure on In Section 6 we collect a few final comments, and in Appendix 7, we present one more derivation of the geodesic equations.

2 Riemannian metric induced by

The direct connection between -divergences stems from the fact that a strictly convex, at least twice differentiable function has a positive definite Hessian matrix. Even more, metric derived from a “separable” is diagonal, that is


Here we use for the standard Kronecker delta and we shall not distinguish between covariant and contravariant coordinates. This may make the description of standard symbols in differential geometry a bit more awkward.

All these examples have an interesting feature in common. The convex function defining the Bregman divergence is three times continuously differentiable, and defines a Riemannian metric in its domain by The equations for the geodesics in this metric are separated. It is actually easy to see that for each the equation for defining the geodesic which at time starts from and end at at time is the solution to


Despite the fact that it is easy to integrate this equation rapidly, we show how to integrate this equation in a short appendix at the end. Now denote by as a primitive of that is


therefore, it is strictly positive by assumption, it is invertible because it is strictly increasing. If we put for the compositional inverse of we can write the solution to (2.2) as


The are integration constants, which using the condition that turn out to be Notice now that the distance between and is given by


It takes a simple computation to verify that


For not to introduce more notation, we shall use the symbol to denote the map defined by Notice that is isometry between and its image in when the distance in the former is and in the Later is the Euclidean distance. Therefore geometric properties in have a counterpart in

Observe as well that the special form of (2.4) and (2.6) allows us to represent the middle point between and easily. As a mater of fact,we have

Lemma 2.1

With the notations introduced above, observe that if we put then

2.1 Comparison of Bregman and Geodesic distances

Here we shall examine the relationship between the -divergence and the distance induced by Observe to begin with, for any three times continuously differentiable function, we have Applying this once more under the integral sign, and rearranging a bit, we obtain


Notice that the left hand side is the building block of the -divergence. To make the distance (2.6) appear on the right hand side of (2.7), we rewrite it as follows. Use the fact that and invoke the previous identity applied to to obtain

Notice now that

With this, it is clear that

We can use the previous comments to complete the proof of the following result.

Theorem 2.1

With the notations introduced above, suppose furthermore that (and therefore ) has a constant sign. Then


This means that, for example, in the first case, a minimizer with respect to the geodesic distance, yields a smaller approximation error that the corresponding minimizer with respect to the divergence. The inequalities in Theorem 2.1 lead to the following result

Theorem 2.2

Let is be set of points in and and respectively denote the points in closer to that set in -divergence and geodesic distance. Then, for example, when (2.9) holds,

Proof If (2.9) holds, then for any Therefore, to begin with, since minimizes the right hand side, we have for any Now minimizing with respect to on the left hand side of this inequality we obtain the desired result.

That is, the approximation error is smaller for the minimizer computed with the geodesic distance than that computed for the divergence. We postpone the explicit computation of to Section 4, when we show how to compute sample estimators.

Comment Note that we can think of (2.7) as a way to construct a convex function starting from its second derivative. What the previous result asserts that if we start from a positive but strictly decreasing function, we generate a divergence satisfying (2.9), whereas if we start from a positive and strictly increasing function, we generate a divergence satisfying (2.8). This is why we included the second and third examples. Even though they would seem to be related by a simple reflection at the origin, their predictive properties are different.

Note that when is identically zero as in the first example of the list in Table 1, the two distances coincide. This example is the first case treated in the examples described below. The other examples are standard examples used to define Bregman divergences.

Note as well that when with the derived distance has smaller prediction error than that of the prediction error in divergence, whereas when the prediction error in divergence is smaller than the prediction error in its derived distance. And we already noted that for both coincide. But to compare the -metric with the Euclidean metric does not seem an easy task.

2.2 Examples of distances related to a Bregman divergence

2.2.1 Case 1:

In this case and The geodesics are the straight lines in and the induced distance is the standard Euclidean distance

2.2.2 Case 2:

Now The solution to the geodesic equation (2.2) is given by and therefore The geodesic distance between and is given by

2.2.3 Case 3:

Now but The solution to the geodesic equation (2.2) is given by and therefore The geodesic distance between and is given by

2.2.4 Case 4:

This time our domain is and whereas The solution to the geodesic (2.2) is given by where Therefore, the geodesic distance between and is

This look similar to the Hellinger distance used in probability theory. See Pollard’s


2.2.5 Case 5:

To finish, we shall consider another example on Now, and The geodesics turn out to be given by where which yields the representation Recall that all operations are to be understood componentwise (

-vectors are function on

). The distance between and is now given by

2.3 The semi-parallelogram law of the geodesic distances

As a consequence of Lemma 2.1 and the way the geodesic distances are related to the Euclidean distance through a bijection, we have the following result:

Theorem 2.3

With the notations introduced in the four examples listed above, the sets with the corresponding geodesic distances, satisfy the semi-parallelogram law. that is in all four cases considered, for any there exists a obtained as in Lemma 2.1, such that for any we have

That is, for separable Bregman divergences, the induced Riemannian geometry is a Tits-Bruhat geometry. The semi-parallelogram property is handy in proofs of uniqueness.

3 -conditional expectations related to Riemannian metrics derived from a Bregman divergences

As we do not have a distinguished point in which is the identity with respect to a commutative operation on in order to define a squared norm for -valued random variables we begin by introducing the following notation.

Definition 3.1

We shall say that a -valued random variable is integrable or square integrable, and write (for ) whenever

for some It is clear from the triangular inequality that this definition is independent of

But more important in the following simple result

Lemma 3.1

With the notations introduced above, from (2.6) if follows that is equivalent to

With identity (2.6) in mind, it is clear that the distance on extends to a distance between random variables by


Now that we have this definition in place, the extension of Theorem (2.1) to this case can be stated as follows.

Theorem 3.1

For any pair of -valued random variables such that the quantities written below are finite, we have


We can now move on to the determination of best predictors in the distance.

Theorem 3.2

Let be a probability space and let be a sub--algebra of Let be a -valued random variable such that is -square integrable. Then

Keep in mind that the both and its inverse act componentwise. This theorem has a curious corollary, to wit:

Corollary 3.1 (Intertwining)

With the notations in the statement of the last theorem, we have

3.1 Comparison of prediction errors

As a corollary of Theorem 3.1 to compare the prediction errors in the -metric or in divergence.

Theorem 3.3

With the notations of Theorem 3.1, we have


The proof is simple. For the first case say, begin with (3.3) and since the right hand side decreases by replacing with we have for any with the appropriate integrability. Now, minimize the left hand side of las inequality with respect to to obtain the desired conclusion.

3.2 Examples of conditional expectations

Even though the contents of the next table are obvious, they are worth recording. There we display the appearance of the conditional expectations of a -valued random variable in the metrics derived from the divergences listed in Table 1.

Domain Conditional Expectation
Table 2: Expected conditional values in metric

The only other information that we have about in this context is that it is a convex set in But we do not know if it is closed with respect to any group operation. IN this regard, see Section 5.1. Thus the only properties of the conditional expectations that we can verify at this points are those that depend only on its definition, and on the corresponding property for with respect to

Theorem 3.4

With the notations introduced in the previous result, and assuming that all variables mentioned are -integrable we have
1) Let be the trivial -algebra, then
2) Let be two sub--algebras of then 3) If then
As both and are defined component wise, and are increasing, we can also verify the monotonicity properties of the conditional expectations.
4) Let with then
We do not necessarily have a vector in but a monotone convergence property may be stated as
5) Let be a sequence in increasing to and suppose that there exist -measurable and Then

3.3 A simple application

Let us consider the following two strictly positive random variables (that is and ):

where and are two Gaussian, -correlated random variables, with It is a textbook exercise to verify that

If we consider the logarithmic distance on an application of the results in the previous section, taking into account that and generate the same -algebra (call it ) we have that

where we put For comparison note that the predictor in the Euclidean distance is given by

According to Theorem 3.3, the previous one is better than the last because its variance is smaller.

A possible interpretation of this example goes as follows. We might of and as the price of an asset today, tomorrow and the day after tomorrow. and might be thought of as the daily logarithmic return. We want to have a predictor of the price of the asset days from now, given that we observe the price tomorrow. Then gives us the standard estimator, whereas gives us the estimator in logarithmic distance.

4 Sample estimation in the Riemannian metric derived from a Bregman divergence

In this section we address the issue of sample estimation of the expected values in the metric. That is, how to estimate


when all that we have is a sample of The sample estimator is defined to be the point that minimizes the aggregate distance (“cost” function)

when ranges over Clearly, for the geodesic distance computed in (2.6) the minimizer is easy to compute. Again, as and are bijections, we have


Recall that this identity is to be understood componentwise, that is both sides are vectors in
Certainly, -mean (of the set is a good name for Given the special form (4.1) for it is clear that (4.2

) defines an unbiased estimator of the

-mean. At this point we mention that we leave as an exercise for the reader, to use the semi-parallelogram law to verify the uniqueness of the minimizer of the distance to a set of points given by (4.2).

But the worth of (4.2) is for the proof of the law of large numbers. But first, we need to note that the error in estimating by its expected value, that is, the variance of is


In this case, as with the standard proof of the weak law of large numbers we have

Theorem 4.1

Suppose that is an i.i.d -valued random variables, that have finite variance. Then in probability.

Proceeding as in the case of Euclidean geometry, we have

Theorem 4.2

With the same notations and assumptions as in the previous result, define the estimator of the variance by

Then is an unbiased estimator of

Comment: Observe that is a positive, real random variable. So, its expected value is the standard expected value.

5 Arithmetic properties of the expectation operation

When there is a commutative group operation on that leaves the metric invariant, then the best predictors have additional properties. The two standard examples that we have in mind are and the group operation being the standard addition of vectors, or and the group operation being the component wise multiplication. For definiteness, let us denote the group operation by and the inverse of with respect to that operation by Let denote the identity for that operation. That the distance invariant (or the group operation is an isometry), that is


Some simple consequences of this fact are the following. To begin with, we can define a norm derived from the distance by We leave it up to the reader to verify that in this notation the triangle inequality for becomes and that this implies that implies that

Let us now examine two examples of the situation described above. For the first example in Table 1, in which the conditional expectation in divergence and in the distance derived from it coincide, we know that the conditional expectation is linear. In the last example in Table 1, the analogue of multiplication by a scalar is the (componentwise) exponentiation. In this case, we saw that the conditional expectation of a strictly positive random variable with respect to a -algebra is

It is easy to verify, and it is proved in [17], that

Theorem 5.1

Let and be two -valued which are -integrable in the logarithmic metric. Let and be two real numbers, then

6 Concluding comments

6.1 General comments about prediction

A predictive procedure involves several aspects: To begin with, we have to specify the nature of the set in which the random variables of interest take values and the class of predictors that we are interested in. Next comes the criterion, cost function or error function used to quantify the “betterness” of a predictor, and finally, we need some way to decide on the uniqueness of the best predictor.

We mentioned at the outset that using the notion of divergence function, there exists a notion of best predictor for random variables taking values in convex subsets of some which, somewhat surprisingly, coincides with the standard least squares best predictor. The fact that in the Riemannian metric on derived from a divergence function a notion of best predictor exists, suggests the possibility of extending the notion of best predictor to Tits-Bruhat spaces. These are complete metric spaces, whose metric satisfies the semi-parallelogram law stated in Lemma 2.1. Using the completeness of the space the notion of “mean” of a finite set as the point that minimizes the sum of the squares of the distances to the points of the set, or that of best predictor are easy to establish. And using the semi-parallelogram law, the uniqueness of the best predictor can be established.

The best predictors can be seen to have some of the properties of conditional expectation, except those that depend on the underlying vector space structure of like Jensen’s inequality and the “linearity” of the best predictor.

6.2 Other remarks

In some cases it is interesting to consider the Legendre-Fenchel duals of the convex function generating the divergence, see [4], [9] or [22] for example. The Bregman divergences induce a dually flat space, and conversely, we can associate a canonical Bregman divergence to a dually flat space.111Thanks to Frank Nielsen for the remark The derived metric in this case is the (algebraic) inverse of the original metric, and it generates the same distance, see [1] for this. Therefore the same comparison results hold true in this case as well.

As remarked at the end of Section 2.1, to compare the derived metrics to the standard Euclidean metric, and therefore, to compare the prediction errors (or the -variance to the standard variance of a random variable does not seem to be an easy task. This is a pending issue to be settled.

We saw that the set on which the random variables of interest may be equipped with more than one distance. The results presented above open up the door to the following conceptual (or methodological) question: Which is the correct distance to be used to make predictions about -valued random variables?

Other pending issue corresponds to the general case in which is not of the type (1.2). In this case, by suitable localization we might reproduce the results of Section 2 locally. The problem is to paste together the representation of the geodesics and the rest using the local representation.

We saw as well that when there is no algebraic structure upon some properties of the estimators are related only to the metric properties of the space, while when there is a commutative operation on the best estimators have further algebraic properties. In reference to the examples in Section 2, an interesting question is which metrics admit a commutative group operation that leaves them invariant.

7 Appendix: Integration of the geodesic equations

Consider (2.2), that is

This is the Euler-Lagrange equation of the Lagrangian where we put

Notice now, that of we make the change of variables where in the new coordinates we can write the Lagrangian function as In these new coordinates the geodesics are straight lines

If at the geodesic starts at (or at ) and at it is at (or at ), we obtain

Acknowledgment I want to thank Frank Nielsen for his comments and suggestions on the first draft of this note.


  • [1] Amari, S and Cichocki, A. (2010). Information theory of divergence functions, Bull. of the Polish Acad. Sci., 58, 183-195.
  • [2] Amari, S. and Nagaoka, H. (2000). Methods of Information Geometry, Transl. of Mathem. Monographs, 191, Oxford Univ. Press, New York.
  • [3] Banerjee, A., Guo, X. and Wang, H. (2005). On the optimality of conditional expectation as Bregman predictor, IEEE Transactions on Information Theory, 51, 2664 - 2669
  • [4] Banerjee, A. Dhillon, I., Ghosh, I. Merugu, S and Modhi, D.s. (2008). A Generalized Maximum Entropy Approach to Bregman Co-clustering and Matrix ApproximationThe Jour. of Machine Learning Research, 8, 1919-1986.
  • [5] Baushke, H.H. and Borweinn, J.M. (1997). Legendre Functions and the Method of Random Bregman Projections, Journal of Convex Analysis, 4, 27-67.
  • [6] Baushke, H.H and Lewis, A. (1998). Dykstras algorithm with Bregman projections: A convergence proof, Optimization, 48, 409-427
  • [7] Baushke, H.H., Borwein, J.M and Combettes, P.L. (2003). Bregman monotone optimization algorithms, SIAM J. Control Optim., 42, 596-636.
  • [8] Baushke, H.H and Combettes, P.L. (2003). Construction of best Bregman approximation in reflective Banach spaces, Proc. Amer. Math. Soc., 131, 3757-3766
  • [9] Boissonnat, J-D., Nielsen, F and Nock, F. (2010). Bregman Voronoi Diagrams: Properties, algorithms and applications, Discrete Comput. Geom., 44, 281-307.
  • [10] Bregman, L. (1967). The relaxation method of finding common points of convex sets and its application to the solution of problems in convex programming, Comp. Math. Phys., USSR, 7, 2100-217.
  • [11] Butnariu, D. and Resmerita, E. (2005). Bregman distances, totally convex functions, and a method for solving operator equations, Abstract and Applied Analysis, 2006, Article ID 84919, 1-39.
  • [12] Calin, O. and Urdiste, C. Geometric Modeling in Probability and Statistics, Springer Internl. Pub., Switzerland, (2010).
  • [13] Censor, Y and Reich, S. (1998). The Dykstra algorithm with Bregman projections. Comm. Applied Analysis, 2, 407-419.
  • [14] Censor, Y. and Zaknoon, M. (2018). Algorithms and convergence results of projection methods for inconsistent feasibility problems: A review, arXiv:1802.07529v3 [math.OC].
  • [15] Csiszár, I. (2008). Axiomatic characterization of information measures, Entropy, 10, 261-273.
  • [16] Fisher, A. (20XX). Quantization and clustering with Bregman divergences

    , Journal of Multivariate Analysis,

    101, 2207-2221.
  • [17] Gzyl, H. (2017) Prediction in logarithmic distance. Available at
  • [18] Lang, S. Math talks for undergraduates, Springer, New York, (1999).
  • [19] Li, C., Song, W. and Yao, J-C. (2010). The Bregman distance, approximate compactness and convexity of Chebyshev sets in Banach spaces,Journal of Approximation Theory 162, 1128-1149.
  • [20] Lawson, J.D. and Lim, Y. (2001)

    The Geometric mean, matrices, metrics and more

    , Amer. Math.,Monthly, 108, 797-812.
  • [21] Moahker, M. (2005) A differential geometric approach to the geometric mean of symmetric positive definite matrices, SIAM. J. Matrix Anal. & Appl., 26, 735-747
  • [22] Nielsen, F. (2018). An elementary introduction to information theory. Available at
  • [23] Pollard, D. (2002). A user’s guide to measure theoretic probability, Cambridge Univ. Press., Cambridge.
  • [24] Schwartzmazn, A. (2015) Lognormal distribution and geometric averages of positive definite matrices, Int. Stat. Rev., 84, 456-486.