1 Introduction and Preliminaries
1.1 Riemannian geometry induced by a Bregman divergence
In [19] it was shown how to put a Riemannian structure on a convex open subset of derived from a Bregman divergence
with realvalued generator function . There where is an interval (bounded or unbounded) in the real line, and the separable Bregman generator with is a three times continuously differentiable function. This set up was chosen for computational convenience, and . In this case the Riemannian metric on is given by .
The Riemannian distance in induced by that metric is given by
(1.1) 
where is a an increasing continuously differentiable function defined as the primitive of (also called antiderivative). Its compositional inverse will be denoted by (with ) and we shall use the same letter for the extensions of the functions to , that is, we shall write to denote the transpose of and the same for . We term distance the RiemannBregman distance.
The equation of the Riemannian geodesic between and was reported in [19] as , where the are the constants of integration such that and . That is, the geodesics are expressed as
(1.2) 
Notice that by introducing the Legendre convex conjugate
(1.3)  
(1.4) 
with (and ), we can write the dual Riemannian metric as , and we have the following identity: .
It follows that we can express the BregmanRiemannian distance either by using the primal affine coordinate system, or by using the dual affine coordinate system (with and ):
Lemma 1.1 (Dual RiemannBregman distances)
We have , and
(1.5) 
Example 1.1
Consider (extended Shannon negentropy). We have and . It follows that , where is a constant. The Legendre conjugate is and , where is a constant. We have , and . We check that since .
Figure 1 illustrates the Legendre duality for defining the RiemannBregman distances in the dual coordinate systems induced by the Legendre transformation.
Remark 1.1
In information geometry [2, 13], a smooth dissimilarity measure on a manifold induces a dual structure defined by a pair of affine connection
coupled to the metric tensor
. The Riemannian geometry is a selfdual informationgeometric structure with the dual connections coinciding with the LeviCivita connection . Bregman divergences yield duallyflat informationgeometric spaces (meaning flat and flat) although the RiemannBregman geometry is usual curved [30] (meaning curved).1.2 Examples of RiemannBregman distances
Let us now list a few examples. These will be used below to exemplify the geometric objects defined by the distances and to exemplify the clustering algorithms.
1.2.1 Case 1:
In this case and The induced distance is the standard squared Euclidean distance
1.2.2 Case 2:
Now The geodesic distance between and is given by
1.2.3 Case 3:
The geodesic distance between and is given by
1.2.4 Case 4: (Shannon negentropy)
This time our domain is and , and . The geodesic distance between and is
This looks similar to the squared Hellinger distance used in probability theory. See Pollard’s
[31]1.2.5 Case 5: (Burg negentropy)
To finish, we shall consider another example on . We have , and . Now, the distance between and is now given by
To continue, let us mention that the thrust in [19]
was to examine the prediction process when considering random variables taking values in metric space
that is to examine the concepts of best predictor of a random variable taking values in when the prediction error is measured in the distance. The special relation of the distance (1.1) to the Euclidean distance makes the following result clear:Theorem 1.1 (Separable RiemannBregman centroid)
Given a collection of points in the point the realizes the minimum of
is unique and given by
(1.6) 
Observe that formula Eq. 1.6 coincides with the leftsided Bregman centroid [28]. However, leftsided Bregman means and RiemannBregman means will differ in the assignment steps because they rely on different dissimilarity measures.
We shall make extensive use of this definition when computing the centers of the clusters in the various metrics. This is done in Section 4 below, where we start from a given data set and organize it in clusters according to different clustering procedures for each of the distances, and compute the centers of the resulting clusters each distance as in (1.6).
In Section 2 we consider some standard geometric objects, like Balls, Voronoi cells, in the different distances, whereas in Section 3 we consider the problem of fixed rate quantization for valued random variables .
2 Balls and Voronoi cells in the derived metrics
2.1 Balls
The examples listed show that is a nonlinear isometry. This allows us to describe balls in as
(2.1) 
Here, denotes the ball in of radius centered at . The space of balls can also be defined using a potential function following [10].
2.2 Voronoi diagrams
Let be some subset The Voronoi cell in of is defined by
(2.2) 
And as above, if we denote by the Voronoi cell of in the Euclidean metric, it is clear that
(2.3) 
A few graphical examples are shown in Figure 2.
Figure 3 displays four types of Voronoi diagrams induced by a Bregman generator; The pictures illustrate the fact that the symmetrized Bregman Voronoi diagram is different from the (symmetric metric) RiemannBregman Voronoi diagram.
3 Fixed rate quantization
Let be provided with a metric derived from a divergence. Let be a given integer and be a given set, called the codebook and its elements are called the
code vectors
. A fixed rate quantizer (with codebook ) is a measurable mapping The cardinality of is called the quantization rate. Two quantizers have the same rate if their codebooks have the same cardinalityFor a valued random variable defined on a probability space the distortion of in quantizing is given by the expected reconstruction square error:
(3.1) 
If we are interested in only one valued random variable we may replace by think of as the Borel sets of think of as the identity mapping, and think of as a probability measure on Also, our notation is slightly different from that of Linder [24] in that we put a square into the definition of the square error. Since is constant on the “level” sets for the former expectation can be written as
Here we denote by the subalgebra of generated by the partition of induced by the quantizer When holds for all blocks of the partition, the conditional expectation takes values
(3.2) 
Notice that given there is a quantizer having the same rate and realizing an error smaller than The existence and uniqueness of such quantizer is given by the following result in [19].
Theorem 3.1
Let be a subalgebra of Let be such that (Lebesgue space of functions that are square integrable). Then there exists a unique measurable random variable, denoted by satisfying
The result mentioned a few lines above can be stated as
Theorem 3.2
Let be a quantizer of rate Suppose that the blocks of the partition of determined by are such that for all Then
(3.3) 
is the quantizer with codebook whose elements are which makes the error smaller over all codebooks which determine the same partition of
An interesting way to associate sets to a codebook of rate is by means of Voronoi cells. are the Voronoi cells determined by their intersections are polyhedra of dimension lower than (faces), and the probability does not change them, that is if whenever If we put for the interior of the Voronoi cells, they generate a subalgebra of such that differs from by a set of probability
In the notation of Theorem 3.2 we have
Corollary 3.1
If the algebra is generated buy the interiors of the Voronoi cells determined by a codebook then
The proof is clear: The integral is a weighted average of points in a convex set.
4 RiemannBregman Clustering
In this section we shall consider a dimensional data set in which several clusters (indicated) by different colors in the figures displayed below are apparent. We shall suppose that the base space in which the random variable takes values comes equipped with one of the five distances listed in Section 1.2.
We shall suppose that the points belong one of the possible manifold on which a RiemannBregman metric has been defined, and the object of the exercise is to identify the clusters. To assign the data points into clusters, we shall apply the
means, the EM (ExpectationMaximization), the HCPC (Hierarchical Clustering Principal Component) and the HAC (Hierarchical Agglomerative Clustering) techniques. Once the clusters are identified, we determine their centers by invoking Theorem (
1.1 and (1.6). In the four examples described below, we shall see how the choice of the metric determines the clusters.The points were generated all in in order to be able to visually compare the result of the clustering procedures in the distances listed on Section 1.2.
Strategy:
Step 1 Generate a cloud in
Step 2 Apply the mapping and form the cloud in
Step 3 Apply the clustering algorithms. If are the clusters of the new cloud, then the clusters in the original coordinates are .
Step 4 Plot the “centers” in the original cloud.
Notice that in all examples, the mappings and are invertible, and are applied componentwise.
The aim of the exercise is to see how the choice of the metric determines the clusters.
In particular, we observe that although the centroid update rule coincide with the leftsided Bregman clustering [28], the assignment steps of the leftsided Bregman clustering and the RiemannBregman clustering are different.
4.1 The data set
For all the the numerical experiments considered below, the “original” data is composed of samples of four pairs of independent normal variables with parameters specified in Table 1 shown next.
Cluster  

1  0.5  0.5  
8  6.5  
2  0.4  0.45  
9  7.5  
3  0.35  0.35  
8.5  9  
4  0.6  0.6  
8  10 
Thus we obtain four point clouds with some overlap. The number of points in the four original groups is, respectively, These numbers will be used as reference to check upon the performance of each clustering technique when the distance between points is measured in each of the five different RiemannBregman metrics.
4.2 Clustering by means
The kmeans clustering algorithm is an iterative methodology that assigns data points to a cluster, in such a way that the sum of the squared distances between the data points and the cluster’s centroid (defined as the arithmetic mean) is minimal. The less variation we have within clusters, the more homogeneous the data points are within the same cluster.
In Figure 4 we present the result of the application of the means procedure to the original data set depicted in panel (a)a. The caption of each of the panels describes the function that has been applied to the data before applying the clustering procedure. Once the clusters are determined, the incluster mean is computed and the whole set is mapped back to the original space.
We add at this point that if one suspects (or knows) that the data sets are in different scales, it might be convenient to standardize the data. This is important since clustering algorithms are based on distance comparison procedures. In [5], Banerjee et al. highlight the bijection between regular exponential families and a class of Bregman divergences called regular Bregman divergences. They demonstrate experimentally that the means clustering perform best in practice on mixture of exponential families (with cumulant function ) when the Bregman divergence is chosen for the dual convex conjugate .
The centers of the original data set and those of the clusters are listed in Table 2.
Cluster  1  2  3  4 

original  (7.93, 6.63)  (9.01, 7.51)  (8.5, 9.08)  (8.01, 10.01) 
(7.97, 6.6)  (9.04, 7.53)  (8.5, 9.08)  (7.93, 10.15)  
(7.97, 6.58)  (9.02, 7.53)  (8.52, 9.09)  (7.9, 10.12)  
(7.96, 6.56)  (9.02, 7.52)  (8.55, 9.1)  (7.87, 10.07)  
(8.27, 7.06)  (9.24, 7.6)  (8.38, 9.18)  (8.05, 10.32)  
(7.76, 6.19)  (8.6, 7)  (9.05, 7.74)  (8.29, 9.49) 
From the table it is clear that the Euclidean distance and the logarithmic distance perform better than the others in capturing the original centers of the data. Actually, the centers of the original data are the minimizers to the original data set in the Euclidean distance and are the empirical means of the normals used to generate the data.
As a further test of the clustering algorithm we provide a “headcount” of points in each cluster and compare the numbers with the number of original points in each cluster. The result is displayed in Table 3.
Cluster  1  2  3  4  Accuracy 

Original  100  300  200  150  1 
109  286  226  129  0.953  
106  289  225  130  0.959  
103  292  221  134  0.968  
214  196  240  100  0.768  
46  146  211  347  0.388 
4.3 Clustering with the EM algorithm
The expectation maximization (EM) algorithm [17]
is an iterative method to estimate the parameters in a statistical model, in which the model depends on unobserved variables. The EM iteration alternates between performing an expectation (E) step and a maximization (M) step. In the first step a likelihood function is created and in the second it is maximized. See
[26] for details.In the EM methodology an initial guess is made for the model’s parameters and a distribution is created, this is the Estep or Expected distribution, then the parameters are updated in the (Mstep) maximizing the expectation computed in the Estep. Both steps are repeated until the convergence is clear (distribution that doesn’t change from the Estep to the Mstep)
In the upper left panel of Figure 5 we display the original clusters (the same as above) for the sake of easy visual comparison, as well as the clusters obtained by applying the EM clustering methodology.
Form Table 5 it becomes clear how the distance between points changes plus the type of clustering algorithm determine the centers of the clusters.
Cluster  1  2  3  4 

Original  (7.93, 6.63)  (9.01, 7.51)  (8.5, 9.08)  (8.01, 10.01) 
(7.82, 6.52)  (8.99, 7.49)  (8.47, 9.2)  (7.79, 10.29)  
(7.72, 6.41)  (8.94, 7.44)  (8.26, 9.32)  (8.49, 10.53)  
(7.95, 6.55)  (9.01, 7.53)  (8.55, 9.07)  (7.95, 10.03)  
(7.79, 6.59)  (8.88, 7.37)  (9.55, 7.67)  (8.31, 9.34)  
(7.92, 6.6)  (9.01, 7.53)  (8.52, 9.1)  (7.93, 10.09) 
In Table (5
), we list the size of the original clusters along with the size of the clusters obtained in the different distances. Also, in the last column of the table we list the accuracy of the procedure, defined as the number of cases correctly classified. Obviously, with the original data all the cases are correct, but note that since the distance between the points changes with the metric, it is natural that the accuracy of the procedure would vary.
Cluster  1  2  3  4  Accuracy 

Original  100  300  200  150  1 
85  315  259  91  0.901  
64  333  307  46  0.809  
101  298  200  151  0.996  
83  217  69  381  0.577  
101  303  208  138  0.977 
5 Clustering using the HCPC algorithm
The hierarchical clustering on principal components (HCPC) algorithm is a combination of methods. First it uses principal component analysis (PCA) to reduce possible hidden high dimensionality in the data. Then it uses hierarchical and partitional clustering to describe similarity between the points. See
[20] for details and further references.In Figure 6 we display the results of the clustering procedure. The panels are as displayed as above. In the upper leftmost one, appears the original data, and the remaining panels we display the results of the clustering process. The resulting clusters are in different colors.
In Table 6 we list the centers of the HCPCclusters in the different distances.
Cluster  1  2  3  4 

original  (7.93, 6.63)  (9.01, 7.51)  (8.5, 9.08)  (8.01, 10.01) 
(7.9, 6.71)  (9.06, 7.49)  (8.54, 9.26)  (7.67, 9.93)  
(7.89, 6.71)  (9.06, 7.48)  (8.55, 9.26)  (7.67, 9.91)  
(7.87, 6.67)  (9.05, 7.47)  (8.55, 9.25)  (7.66, 9.9)  
(7.96, 7.52)  (9.38, 7.71)  (8.77, 8.04)  (8.04, 9.98)  
(7.85, 6.5)  (9.02, 7.46)  (8.51, 9.29)  (7.54, 9.79) 
In Table 7, we can observe the size of the original clusters along with the size obtained with the transformations. Also, this table contains a column that is the accuracy, this value is defined as the number of cases correctly classified. Obviously, with the original data all the cases are correct, for that reason we have a accuracy of 1, but for the other transformations we fail.
Cluster  1  2  3  4  Accuracy 

original  100  300  200  150  1 
106  290  247  107  0.929  
105  288  250  107  0.927  
100  289  256  105  0.925  
166  135  276  173  0.749  
83  297  285  85  0.864 
6 The Hierarchical Agglomerative Clustering (HAC) process
The Hierarchical Agglomerative Clustering (HAC) is designed to build a hierarchy of clusters using a bottomup or agglomerative strategy. Here each data point is treated as a single cluster and then a distance (or similarity) between each of the clusters is computed in order to merge two or more into one (e.g., single linkage or average linkage), that is, the two most “similar” clusters are merged at each step. See [21] and [30] for details.
In Figure 7 we display the results of the hierarchical clustering algorithm. The panels are organized as above.
In Table 8 we list the centers associated to the resulting clusters. As usual, they are the arithmetic means of the clusters provided by the hierarchical clustering algorithm.
Cluster  1  2  3  4 

original  (7.93, 6.63)  (9.01, 7.51)  (8.5, 9.08)  (8.01, 10.01) 
(8.16, 6.71)  (9.06, 7.61)  (8.28, 9.26)  (8.34, 10.15)  
(8.38, 6.92)  (9.22, 7.68)  (8.41, 9.31)  (7.38, 10.29)  
(7.86, 6.69)  (9.04, 7.44)  (8.56, 9.09)  (7.86, 9.98)  
(8.72, 7.31)  (8.32, 9.42)  (8.07, 10.84)  (8, 11.59)  
(7.99, 5.89)  (8.67, 7.1)  (8.57, 9.01)  (7.23, 10.14) 
In Table 9, we list the size of the original clusters along with the sizes of the clusters in the different metrics. We also list the accuracy of the classifier.
Comment At this point we mention that in all examples the accuracy can be provided because we new the true clusters before hand. Therefore, this accuracy measure is an indirect quality test of the different clustering methods. Other measures of cluster quality can be considered (e.g., the adjusted Rand index or the Normalized Mutual Information [5]).
Cluster  1  2  3  4  Accuracy 

original  100  300  200  150  1 
144  256  261  89  0.860  
215  170  323  42  0.683  
94  291  226  139  0.957  
407  325  13  5  0.140  
19  263  435  33  0.579 
7 Concluding comments
In this note, we considered the Riemannian metric distance induced by a separable Bregman divergence, and studied the fixed rate quantization problem. We concisely described the RiemannBregman Voronoi diagrams that can be obtained from an ordinary Euclidean Voronoi diagram after a monotone isometric embedding. Then we reported on the partitionbased (means), soft (EM) and hierarchical (HCPC/HCA) clusterings with respect to these RiemannBregman distances. As was to be expected, the resulting clusters depend on the clustering method as well as on the metric used to measure the distance between points.
References
 [1] Amari, S and Cichocki, A. (2010). Information theory of divergence functions, Bull. of the Polish Acad. Sci., 58, 183195.
 [2] Amari, S. and Nagaoka, H. (2000). Methods of Information Geometry, Transl. of Mathem. Monographs, 191, Oxford Univ. Press, New York.
 [3] Banerjee, A., Guo, X. and Wang, H. (2005). On the optimality of conditional expectation as Bregman predictor, IEEE Transactions on Information Theory, 51, 2664  2669

[4]
Banerjee, A. Dhillon, I., Ghosh, I. Merugu, S and Modhi, D.s. (2008). A Generalized Maximum Entropy Approach to Bregman Coclustering and Matrix Approximation
The Journal of Machine Learning Research,
8, 19191986.  [5] Banerjee, A., Merugu, S., Dhillon, I. S., and Ghosh, J. (2005) Clustering with Bregman divergences. Journal of machine learning research 6: 17051749.
 [6] Baushke, H.H. and Borweinn, J.M. (1997). Legendre Functions and the Method of Random Bregman Projections, Journal of Convex Analysis, 4, 2767.
 [7] Baushke, H.H and Lewis, A. (1998). Dykstras algorithm with Bregman projections: A convergence proof, Optimization, 48, 409427
 [8] Baushke, H.H., Borwein, J.M and Combettes, P.L. (2003). Bregman monotone optimization algorithms, SIAM J. Control Optim., 42, 596636.
 [9] Baushke, H.H and Combettes, P.L. (2003). Construction of best Bregman approximation in reflective Banach spaces, Proc. Amer. Math. Soc., 131, 37573766
 [10] Boissonnat, JD., Nielsen, F and Nock, F. (2010). Bregman Voronoi Diagrams, Discrete Comput. Geom., 44, 281307.
 [11] Bregman, L. (1967). The relaxation method of finding common points of convex sets and its application to the solution of problems in convex programming, Comp. Math. Phys., USSR, 7, 2100217.
 [12] Butnariu, D. and Resmerita, E. (2005). Bregman distances, totally convex functions, and a method for solving operator equations, Abstract and Applied Analysis, 2006, Article ID 84919, 139.
 [13] Calin, O. and Urdiste, C. Geometric Modeling in Probability and Statistics, Springer Internl. Pub., Switzerland, (2010).
 [14] Censor, Y and Reich, S. (1998). The Dykstra algorithm with Bregman projections. Comm. Applied Analysis, 2, 407419.
 [15] Censor, Y. and Zaknoon, M. (2018). Algorithms and convergence results of projection methods for inconsistent feasibility problems: A review, arXiv:1802.07529v3 [math.OC].
 [16] Csiszár, I. (2008). Axiomatic characterization of information measures, Entropy, 10, 261273.
 [17] Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the royal statistical society. Series B (methodological), 138.

[18]
Fisher, A. (2010). Quantization and clustering with Bregman divergences
, Journal of Multivariate Analysis,
101, 22072221.  [19] Gzyl, H. (2018). Prediction in Riemannian metrics derived from divergence functions, http://arxiv.org/abs/1808.01638.
 [20] Husson, F, Josse, J. and Pagès J. (2010). ——Principal Component Methods  Hierarchical Clustering  Partitional Clustering: Why Would We Need to Choose for Visualizing Data?” Technical Rept, Mathem. and Stats Dept., Univ Rennes.

[21]
Kassambara, A. (2017). Practical Guide To Cluster Analysis in R. CreateSpace: North Charleston, SC, USA.
 [22] Lang, S. Math talks for undergraduates, Springer, New York, (1999).
 [23] Li, C., Song, W. and Yao, JC. (2010). The Bregman distance, approximate compactness and convexity of Chebyshev sets in Banach spaces,Journal of Approximation Theory 162, 11281149.
 [24] Linder, T. (2001). Learning theoretic methods in vector quantization, in Principles of Nonparametric Learning, I. Gyorfi, ED., CISM Lecture Notes, Springer, New York.

[25]
Lawson, J.D. and Lim, Y. (2001)
The Geometric mean, matrices, metrics and more
, Amer. Math.,Monthly, 108, 797812.  [26] McLachlan, G.J. and Peel, D. (2000). Finite Mixture Models, Wiley, New York.
 [27] Moahker, M. (2005) A differential geometric approach to the geometric mean of symmetric positive definite matrices, SIAM. J. Matrix Anal. & Appl., 26, 735747.
 [28] Nielsen, F., and Nock, R. (2009). Sided and symmetrized Bregman centroids. IEEE transactions on Information Theory 55.6: 28822904.

[29]
Nielsen, F. (2016). Hierarchical Clustering in
Introduction to HPC with MPI for Data Science
Springer, Berlin.  [30] Nielsen, F. (2018). An elementary introduction to information geometry. Available at https://arxiv.org/abs/1808.08271
 [31] Pollard, D. (2002). A user’s guide to measure theoretic probability, Cambridge Univ. Press., Cambridge.
 [32] Schwartzmazn, A. (2015) Lognormal distribution and geometric averages of positive definite matrices, Int. Stat. Rev., 84, 456486.
Comments
There are no comments yet.