1 Introduction and Preliminaries
In [11], Bregman introduced an iterative procedure to find points in an intersection of convex sets. At each step, the next point in the sequence is obtained by minimizing an objective function, that can be described as the vertical distance of the graph of the function to the tangent plane through the previous point. If is a convex set in some and is a strictly convex, continuously differentiable function, the divergence function that it defines is specified by
(1.1) 
In Bregman’s work, was taken to be the Euclidean square norm
The concept was eventually extended, even to the infinite dimensional case, and now plays an important role in many applications. For example, in clustering. classification analysis and machine learning as in Banerjee et al.
[3], Boisonnat el al. [9], Banerjee et al. [4]. Fisher [16]. It plays a role in optimization theory as in Baushke and Borwein [5], Baushke and Lewis [6], Baushke and Combettes [8], Censor and Reich [13], Baushke et al. [7] and Censor and Zaknoon [14], or to solve operator equations as in Butnariu and Resmerita [11], in approximation theory in Banach spaces as in Baushke and Combettes [8] or Li et al. [19]. In applications of geometry to statistics and information theory as in Amari and Nagaoka [2], Csiszár [15], Amari and Cichoski [1], Calin and Urdiste [12] or Nielsen [22]. These are just a small sample of the many references to applications of Bregman functions, and the list cascades rapidly.Is is a well known, and easy to verify fact, that
Thus our choice of notation is consistent. But as is not symmetric, nor does it satisfy the triangular inequality, it can not be a distance on Let now
be a probability space such that
is complete (contains all sets of zero measure). By we shall denote the usual classes of integrable or square integrable functions, identified up to sets of measure zero. The notion of divergence can be extended to random variables as followsDefinition 1.1
Let be valued random variables such that and are in The divergence between and is defined by
Clearly, is neither symmetric nor satisfies the triangle inequality. But as above, we also have
we can think of it as a pseudo distance, cost or penalty function on
The motivation for this work comes from two directions. On the one hand, there is the fact that for Bregman divergences there is a notion of best predictor, and this best predictor happens to be the usual conditional expectation. To put it in symbols
Theorem 1.1
Let and let Then the solution to the problem
is given by
For the proof the reader can consult Banerjee et al., [3] or Fisher’s [16]. The other thread comes from Gzyl’s [17]
, where a geometry on the convex cone of strictly positive is considered. That geometry happens to be derivable from a divergence function, and it leads to a host of curious variations on the theme of best predictor, estimation, laws of large numbers and central limit theorems. The geometry considered there is that induced by the logarithmic distance, which makes
a TitsBruhat space, which happens to be a special commutative version of the theory explained in Lang [18], Lawson and Lin [20], Mohaker [21] and Schwartzman [24].We should mention that the use of differential geometric methods in [2], or [12] and the many references cited therein, is different from the one described below. They consider geometric structure either on the class of probabilities on a finite set, or in the space of parameters characterizing a (usually exponential) family of distributions. Here we analyze how the geometry on the set in which the random variables take value, determines the nature of the standard estimation and prediction process.
From now on we shall suppose that where is a bounded or unbounded interval in We shall denote by a strictly convex, three times continuously differentiable function, and define
(1.2) 
1.1 Some standard examples
In the next table we list five standard examples. The list could be quite longer, but the examples chosen because the in some of the cases the distance between random variables associated to the divergence bounds their divergence from above, whereas in the other, it is bounded by the divergence from above. The examples are displayed in Table 1.
Domain  

1.2 Organization of the paper
We have established enough notations to describe the contents of the paper. In Section 2 we start from the divergence function on
and derive a metric tensor
from it. We then solve the geodesic equations to compute the geodesic distance between any two points and we compare it with the divergence between the two points. We shall see that there are cases in which one of them dominates the other for any pair of points.The Riemannian distance between points in induces a distance between random variables taking values in there. In Section 3 we come to the main theme of this work, that is, to the computations of best predictors when the distance between random variables is measured in the induced Riemannian distance. We shall call such best predictors the mean and the conditional expectation and denote them by and, respectively, In order to compare these to the best predictor in divergence, we use the prediction error as a comparison criterion. It is at this point at which the comparison results established in Section 2 come in.
In Section 4 we take up the issue of sample estimation and its properties. We shall see that the standard results hold for the conditional expectation as well. That is, we shall see that the estimator of the mean and that of the
variance, are unbiased and converge to their true values as the size of the sample becomes infinitely large. In Section 5 we shall consider the arithmetic properties of the
conditional expectation when there is a commutative group structure on In Section 6 we collect a few final comments, and in Appendix 7, we present one more derivation of the geodesic equations.2 Riemannian metric induced by
The direct connection between divergences stems from the fact that a strictly convex, at least twice differentiable function has a positive definite Hessian matrix. Even more, metric derived from a “separable” is diagonal, that is
(2.1) 
Here we use for the standard Kronecker delta and we shall not distinguish between covariant and contravariant coordinates. This may make the description of standard symbols in differential geometry a bit more awkward.
All these examples have an interesting feature in common. The convex function defining the Bregman divergence is three times continuously differentiable, and defines a Riemannian metric in its domain by The equations for the geodesics in this metric are separated. It is actually easy to see that for each the equation for defining the geodesic which at time starts from and end at at time is the solution to
(2.2) 
Despite the fact that it is easy to integrate this equation rapidly, we show how to integrate this equation in a short appendix at the end. Now denote by as a primitive of that is
(2.3) 
therefore, it is strictly positive by assumption, it is invertible because it is strictly increasing. If we put for the compositional inverse of we can write the solution to (2.2) as
(2.4) 
The are integration constants, which using the condition that turn out to be Notice now that the distance between and is given by
(2.5) 
It takes a simple computation to verify that
(2.6) 
For not to introduce more notation, we shall use the symbol to denote the map defined by Notice that is isometry between and its image in when the distance in the former is and in the Later is the Euclidean distance. Therefore geometric properties in have a counterpart in
Observe as well that the special form of (2.4) and (2.6) allows us to represent the middle point between and easily. As a mater of fact,we have
Lemma 2.1
With the notations introduced above, observe that if we put then
2.1 Comparison of Bregman and Geodesic distances
Here we shall examine the relationship between the divergence and the distance induced by Observe to begin with, for any three times continuously differentiable function, we have Applying this once more under the integral sign, and rearranging a bit, we obtain
(2.7) 
Notice that the left hand side is the building block of the divergence. To make the distance (2.6) appear on the right hand side of (2.7), we rewrite it as follows. Use the fact that and invoke the previous identity applied to to obtain
Notice now that
With this, it is clear that
We can use the previous comments to complete the proof of the following result.
Theorem 2.1
With the notations introduced above, suppose furthermore that (and therefore ) has a constant sign. Then
(2.8)  
(2.9) 
This means that, for example, in the first case, a minimizer with respect to the geodesic distance, yields a smaller approximation error that the corresponding minimizer with respect to the divergence. The inequalities in Theorem 2.1 lead to the following result
Theorem 2.2
Let is be set of points in and and respectively denote the points in closer to that set in divergence and geodesic distance. Then, for example, when (2.9) holds,
Proof If (2.9) holds, then for any Therefore, to begin with, since minimizes the right hand side, we have for any Now minimizing with respect to on the left hand side of this inequality we obtain the desired result.
That is, the approximation error is smaller for the minimizer computed with the geodesic distance than that computed for the divergence. We postpone the explicit computation of to Section 4, when we show how to compute sample estimators.
Comment Note that we can think of (2.7) as a way to construct a convex function starting from its second derivative. What the previous result asserts that if we start from a positive but strictly decreasing function, we generate a divergence satisfying (2.9), whereas if we start from a positive and strictly increasing function, we generate a divergence satisfying (2.8). This is why we included the second and third examples. Even though they would seem to be related by a simple reflection at the origin, their predictive properties are different.
Note that when is identically zero as in the first example of the list in Table 1, the two distances coincide. This example is the first case treated in the examples described below. The other examples are standard examples used to define Bregman divergences.
Note as well that when with the derived distance has smaller prediction error than that of the prediction error in divergence, whereas when the prediction error in divergence is smaller than the prediction error in its derived distance. And we already noted that for both coincide. But to compare the metric with the Euclidean metric does not seem an easy task.
2.2 Examples of distances related to a Bregman divergence
2.2.1 Case 1:
In this case and The geodesics are the straight lines in and the induced distance is the standard Euclidean distance
2.2.2 Case 2:
Now The solution to the geodesic equation (2.2) is given by and therefore The geodesic distance between and is given by
2.2.3 Case 3:
Now but The solution to the geodesic equation (2.2) is given by and therefore The geodesic distance between and is given by
2.2.4 Case 4:
This time our domain is and whereas The solution to the geodesic (2.2) is given by where Therefore, the geodesic distance between and is
This look similar to the Hellinger distance used in probability theory. See Pollard’s
[23]2.2.5 Case 5:
To finish, we shall consider another example on Now, and The geodesics turn out to be given by where which yields the representation Recall that all operations are to be understood componentwise (
vectors are function on
). The distance between and is now given by2.3 The semiparallelogram law of the geodesic distances
As a consequence of Lemma 2.1 and the way the geodesic distances are related to the Euclidean distance through a bijection, we have the following result:
Theorem 2.3
With the notations introduced in the four examples listed above, the sets with the corresponding geodesic distances, satisfy the semiparallelogram law. that is in all four cases considered, for any there exists a obtained as in Lemma 2.1, such that for any we have
That is, for separable Bregman divergences, the induced Riemannian geometry is a TitsBruhat geometry. The semiparallelogram property is handy in proofs of uniqueness.
3 conditional expectations related to Riemannian metrics derived from a Bregman divergences
As we do not have a distinguished point in which is the identity with respect to a commutative operation on in order to define a squared norm for valued random variables we begin by introducing the following notation.
Definition 3.1
We shall say that a valued random variable is integrable or square integrable, and write (for ) whenever
for some It is clear from the triangular inequality that this definition is independent of
But more important in the following simple result
Lemma 3.1
With the notations introduced above, from (2.6) if follows that is equivalent to
With identity (2.6) in mind, it is clear that the distance on extends to a distance between random variables by
(3.1) 
Now that we have this definition in place, the extension of Theorem (2.1) to this case can be stated as follows.
Theorem 3.1
For any pair of valued random variables such that the quantities written below are finite, we have
(3.2)  
(3.3)  
We can now move on to the determination of best predictors in the distance.
Theorem 3.2
Let be a probability space and let be a subalgebra of Let be a valued random variable such that is square integrable. Then
Keep in mind that the both and its inverse act componentwise. This theorem has a curious corollary, to wit:
Corollary 3.1 (Intertwining)
With the notations in the statement of the last theorem, we have
3.1 Comparison of prediction errors
As a corollary of Theorem 3.1 to compare the prediction errors in the metric or in divergence.
Theorem 3.3
With the notations of Theorem 3.1, we have
(3.4)  
The proof is simple. For the first case say, begin with (3.3) and since the right hand side decreases by replacing with we have for any with the appropriate integrability. Now, minimize the left hand side of las inequality with respect to to obtain the desired conclusion.
3.2 Examples of conditional expectations
Even though the contents of the next table are obvious, they are worth recording. There we display the appearance of the conditional expectations of a valued random variable in the metrics derived from the divergences listed in Table 1.
Domain  Conditional Expectation  

The only other information that we have about in this context is that it is a convex set in But we do not know if it is closed with respect to any group operation. IN this regard, see Section 5.1. Thus the only properties of the conditional expectations that we can verify at this points are those that depend only on its definition, and on the corresponding property for with respect to
Theorem 3.4
With the notations introduced in the previous result, and assuming that all variables mentioned are integrable we have
1) Let be the trivial algebra, then
2) Let be two subalgebras of then
3) If then
As both and are defined component wise, and are increasing, we can also verify the monotonicity properties of the conditional expectations.
4) Let with then
We do not necessarily have a vector in but a monotone convergence property may be stated as
5) Let be a sequence in increasing to and suppose that there exist measurable and Then
3.3 A simple application
Let us consider the following two strictly positive random variables (that is and ):
where and are two Gaussian, correlated random variables, with It is a textbook exercise to verify that
If we consider the logarithmic distance on an application of the results in the previous section, taking into account that and generate the same algebra (call it ) we have that
where we put For comparison note that the predictor in the Euclidean distance is given by
According to Theorem 3.3, the previous one is better than the last because its variance is smaller.
A possible interpretation of this example goes as follows. We might of and as the price of an asset today, tomorrow and the day after tomorrow. and might be thought of as the daily logarithmic return. We want to have a predictor of the price of the asset days from now, given that we observe the price tomorrow. Then gives us the standard estimator, whereas gives us the estimator in logarithmic distance.
4 Sample estimation in the Riemannian metric derived from a Bregman divergence
In this section we address the issue of sample estimation of the expected values in the metric. That is, how to estimate
(4.1) 
when all that we have is a sample of The sample estimator is defined to be the point that minimizes the aggregate distance (“cost” function)
when ranges over Clearly, for the geodesic distance computed in (2.6) the minimizer is easy to compute. Again, as and are bijections, we have
(4.2) 
Recall that this identity is to be understood componentwise, that is both sides are vectors in
Certainly, mean (of the set is a good name for Given the special form (4.1) for it is clear that (4.2
) defines an unbiased estimator of the
mean. At this point we mention that we leave as an exercise for the reader, to use the semiparallelogram law to verify the uniqueness of the minimizer of the distance to a set of points given by (4.2).But the worth of (4.2) is for the proof of the law of large numbers. But first, we need to note that the error in estimating by its expected value, that is, the variance of is
(4.3) 
In this case, as with the standard proof of the weak law of large numbers we have
Theorem 4.1
Suppose that is an i.i.d valued random variables, that have finite variance. Then in probability.
Proceeding as in the case of Euclidean geometry, we have
Theorem 4.2
With the same notations and assumptions as in the previous result, define the estimator of the variance by
Then is an unbiased estimator of
Comment: Observe that is a positive, real random variable. So, its expected value is the standard expected value.
5 Arithmetic properties of the expectation operation
When there is a commutative group operation on that leaves the metric invariant, then the best predictors have additional properties. The two standard examples that we have in mind are and the group operation being the standard addition of vectors, or and the group operation being the component wise multiplication. For definiteness, let us denote the group operation by and the inverse of with respect to that operation by Let denote the identity for that operation. That the distance invariant (or the group operation is an isometry), that is
(5.1) 
Some simple consequences of this fact are the following. To begin with, we can define a norm derived from the distance by We leave it up to the reader to verify that in this notation the triangle inequality for becomes and that this implies that implies that
Let us now examine two examples of the situation described above. For the first example in Table 1, in which the conditional expectation in divergence and in the distance derived from it coincide, we know that the conditional expectation is linear. In the last example in Table 1, the analogue of multiplication by a scalar is the (componentwise) exponentiation. In this case, we saw that the conditional expectation of a strictly positive random variable with respect to a algebra is
It is easy to verify, and it is proved in [17], that
Theorem 5.1
Let and be two valued which are integrable in the logarithmic metric. Let and be two real numbers, then
6 Concluding comments
6.1 General comments about prediction
A predictive procedure involves several aspects: To begin with, we have to specify the nature of the set in which the random variables of interest take values and the class of predictors that we are interested in. Next comes the criterion, cost function or error function used to quantify the “betterness” of a predictor, and finally, we need some way to decide on the uniqueness of the best predictor.
We mentioned at the outset that using the notion of divergence function, there exists a notion of best predictor for random variables taking values in convex subsets of some which, somewhat surprisingly, coincides with the standard least squares best predictor. The fact that in the Riemannian metric on derived from a divergence function a notion of best predictor exists, suggests the possibility of extending the notion of best predictor to TitsBruhat spaces. These are complete metric spaces, whose metric satisfies the semiparallelogram law stated in Lemma 2.1. Using the completeness of the space the notion of “mean” of a finite set as the point that minimizes the sum of the squares of the distances to the points of the set, or that of best predictor are easy to establish. And using the semiparallelogram law, the uniqueness of the best predictor can be established.
The best predictors can be seen to have some of the properties of conditional expectation, except those that depend on the underlying vector space structure of like Jensen’s inequality and the “linearity” of the best predictor.
6.2 Other remarks
In some cases it is interesting to consider the LegendreFenchel duals of the convex function generating the divergence, see [4], [9] or [22] for example. The Bregman divergences induce a dually flat space, and conversely, we can associate a canonical Bregman divergence to a dually flat space.^{1}^{1}1Thanks to Frank Nielsen for the remark The derived metric in this case is the (algebraic) inverse of the original metric, and it generates the same distance, see [1] for this. Therefore the same comparison results hold true in this case as well.
As remarked at the end of Section 2.1, to compare the derived metrics to the standard Euclidean metric, and therefore, to compare the prediction errors (or the variance to the standard variance of a random variable does not seem to be an easy task. This is a pending issue to be settled.
We saw that the set on which the random variables of interest may be equipped with more than one distance. The results presented above open up the door to the following conceptual (or methodological) question: Which is the correct distance to be used to make predictions about valued random variables?
Other pending issue corresponds to the general case in which is not of the type (1.2). In this case, by suitable localization we might reproduce the results of Section 2 locally. The problem is to paste together the representation of the geodesics and the rest using the local representation.
We saw as well that when there is no algebraic structure upon some properties of the estimators are related only to the metric properties of the space, while when there is a commutative operation on the best estimators have further algebraic properties. In reference to the examples in Section 2, an interesting question is which metrics admit a commutative group operation that leaves them invariant.
7 Appendix: Integration of the geodesic equations
Notice now, that of we make the change of variables where in the new coordinates we can write the Lagrangian function as In these new coordinates the geodesics are straight lines
If at the geodesic starts at (or at ) and at it is at (or at ), we obtain
Acknowledgment I want to thank Frank Nielsen for his comments and suggestions on the first draft of this note.
References
 [1] Amari, S and Cichocki, A. (2010). Information theory of divergence functions, Bull. of the Polish Acad. Sci., 58, 183195.
 [2] Amari, S. and Nagaoka, H. (2000). Methods of Information Geometry, Transl. of Mathem. Monographs, 191, Oxford Univ. Press, New York.
 [3] Banerjee, A., Guo, X. and Wang, H. (2005). On the optimality of conditional expectation as Bregman predictor, IEEE Transactions on Information Theory, 51, 2664  2669
 [4] Banerjee, A. Dhillon, I., Ghosh, I. Merugu, S and Modhi, D.s. (2008). A Generalized Maximum Entropy Approach to Bregman Coclustering and Matrix ApproximationThe Jour. of Machine Learning Research, 8, 19191986.
 [5] Baushke, H.H. and Borweinn, J.M. (1997). Legendre Functions and the Method of Random Bregman Projections, Journal of Convex Analysis, 4, 2767.
 [6] Baushke, H.H and Lewis, A. (1998). Dykstras algorithm with Bregman projections: A convergence proof, Optimization, 48, 409427
 [7] Baushke, H.H., Borwein, J.M and Combettes, P.L. (2003). Bregman monotone optimization algorithms, SIAM J. Control Optim., 42, 596636.
 [8] Baushke, H.H and Combettes, P.L. (2003). Construction of best Bregman approximation in reflective Banach spaces, Proc. Amer. Math. Soc., 131, 37573766
 [9] Boissonnat, JD., Nielsen, F and Nock, F. (2010). Bregman Voronoi Diagrams: Properties, algorithms and applications, Discrete Comput. Geom., 44, 281307.
 [10] Bregman, L. (1967). The relaxation method of finding common points of convex sets and its application to the solution of problems in convex programming, Comp. Math. Phys., USSR, 7, 2100217.
 [11] Butnariu, D. and Resmerita, E. (2005). Bregman distances, totally convex functions, and a method for solving operator equations, Abstract and Applied Analysis, 2006, Article ID 84919, 139.
 [12] Calin, O. and Urdiste, C. Geometric Modeling in Probability and Statistics, Springer Internl. Pub., Switzerland, (2010).
 [13] Censor, Y and Reich, S. (1998). The Dykstra algorithm with Bregman projections. Comm. Applied Analysis, 2, 407419.
 [14] Censor, Y. and Zaknoon, M. (2018). Algorithms and convergence results of projection methods for inconsistent feasibility problems: A review, arXiv:1802.07529v3 [math.OC].
 [15] Csiszár, I. (2008). Axiomatic characterization of information measures, Entropy, 10, 261273.

[16]
Fisher, A. (20XX). Quantization and clustering with Bregman divergences
, Journal of Multivariate Analysis,
101, 22072221. 
[17]
Gzyl, H. (2017) Prediction in logarithmic distance. Available at
http://arxiv.org/abs/1703.08696.  [18] Lang, S. Math talks for undergraduates, Springer, New York, (1999).
 [19] Li, C., Song, W. and Yao, JC. (2010). The Bregman distance, approximate compactness and convexity of Chebyshev sets in Banach spaces,Journal of Approximation Theory 162, 11281149.

[20]
Lawson, J.D. and Lim, Y. (2001)
The Geometric mean, matrices, metrics and more
, Amer. Math.,Monthly, 108, 797812.  [21] Moahker, M. (2005) A differential geometric approach to the geometric mean of symmetric positive definite matrices, SIAM. J. Matrix Anal. & Appl., 26, 735747
 [22] Nielsen, F. (2018). An elementary introduction to information theory. Available at https://arxiv.org/abs/1808.08271
 [23] Pollard, D. (2002). A user’s guide to measure theoretic probability, Cambridge Univ. Press., Cambridge.
 [24] Schwartzmazn, A. (2015) Lognormal distribution and geometric averages of positive definite matrices, Int. Stat. Rev., 84, 456486.
Comments
There are no comments yet.