Geometry and clustering with metrics derived from separable Bregman divergences

Separable Bregman divergences induce Riemannian metric spaces that are isometric to the Euclidean space after monotone embeddings. We investigate fixed rate quantization and its codebook Voronoi diagrams, and report on experimental performances of partition-based, hierarchical, and soft clustering algorithms with respect to these Riemann-Bregman distances.



There are no comments yet.


page 1

page 2

page 3

page 4


Hodge Theory on Metric Spaces

Hodge theory is a beautiful synthesis of geometry, topology, and analysi...

On the Metric Distortion of Embedding Persistence Diagrams into separable Hilbert spaces

Persistence diagrams are important descriptors in Topological Data Analy...

Every metric space is separable in function realizability

We first show that in the function realizability topos every metric spac...

Embeddings of Persistence Diagrams into Hilbert Spaces

Since persistence diagrams do not admit an inner product structure, a ma...

Prediction in Riemannian metrics derived from divergence functions

Divergence functions are interesting discrepancy measures. Even though t...

Geometric clustering in normed planes

Given two sets of points A and B in a normed plane, we prove that there ...

A survey of mass partitions

Mass partition problems describe the partitions we can induce on a famil...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction and Preliminaries

1.1 Riemannian geometry induced by a Bregman divergence

In [19] it was shown how to put a Riemannian structure on a convex open subset of derived from a Bregman divergence

with real-valued generator function . There where is an interval (bounded or unbounded) in the real line, and the separable Bregman generator with is a three times continuously differentiable function. This set up was chosen for computational convenience, and . In this case the Riemannian metric on is given by .

The Riemannian distance in induced by that metric is given by


where is a an increasing continuously differentiable function defined as the primitive of (also called antiderivative). Its compositional inverse will be denoted by (with ) and we shall use the same letter for the extensions of the functions to , that is, we shall write to denote the transpose of and the same for . We term distance the Riemann-Bregman distance.

The equation of the Riemannian geodesic between and was reported in [19] as , where the are the constants of integration such that and . That is, the geodesics are expressed as


Notice that by introducing the Legendre convex conjugate


with (and ), we can write the dual Riemannian metric as , and we have the following identity: .

It follows that we can express the Bregman-Riemannian distance either by using the primal affine -coordinate system, or by using the dual affine -coordinate system (with and ):

Lemma 1.1 (Dual Riemann-Bregman distances)

We have , and

Example 1.1

Consider (extended Shannon negentropy). We have and . It follows that , where is a constant. The Legendre conjugate is and , where is a constant. We have , and . We check that since .

Figure 1 illustrates the Legendre duality for defining the Riemann-Bregman distances in the dual coordinate systems induced by the Legendre transformation.

Figure 1: The Riemann-Bregman distances expressed in the dual coordinate system induced by the Legendre transformation.
Remark 1.1

In information geometry [2, 13], a smooth dissimilarity measure on a manifold induces a dual structure defined by a pair of affine connection

coupled to the metric tensor

. The Riemannian geometry is a self-dual information-geometric structure with the dual connections coinciding with the Levi-Civita connection . Bregman divergences yield dually-flat information-geometric spaces (meaning -flat and -flat) although the Riemann-Bregman geometry is usual curved [30] (meaning -curved).

1.2 Examples of Riemann-Bregman distances

Let us now list a few examples. These will be used below to exemplify the geometric objects defined by the distances and to exemplify the clustering algorithms.

1.2.1 Case 1:

In this case and The induced distance is the standard squared Euclidean distance

1.2.2 Case 2:

Now The geodesic distance between and is given by

1.2.3 Case 3:

The geodesic distance between and is given by

1.2.4 Case 4: (Shannon negentropy)

This time our domain is and , and . The geodesic distance between and is

This looks similar to the squared Hellinger distance used in probability theory. See Pollard’s


1.2.5 Case 5: (Burg negentropy)

To finish, we shall consider another example on . We have , and . Now, the distance between and is now given by

To continue, let us mention that the thrust in [19]

was to examine the prediction process when considering random variables taking values in metric space

that is to examine the concepts of best predictor of a random variable taking values in when the prediction error is measured in the distance. The special relation of the distance (1.1) to the Euclidean distance makes the following result clear:

Theorem 1.1 (Separable Riemann-Bregman centroid)

Given a collection of points in the point the realizes the minimum of

is unique and given by


Observe that formula Eq. 1.6 coincides with the left-sided Bregman centroid [28]. However, left-sided Bregman -means and Riemann-Bregman -means will differ in the assignment steps because they rely on different dissimilarity measures.

We shall make extensive use of this definition when computing the centers of the clusters in the various metrics. This is done in Section 4 below, where we start from a given data set and organize it in clusters according to different clustering procedures for each of the distances, and compute the centers of the resulting clusters each distance as in (1.6).

In Section 2 we consider some standard geometric objects, like Balls, Voronoi cells, in the different distances, whereas in Section 3 we consider the problem of fixed rate quantization for valued random variables .

2 Balls and Voronoi cells in the derived metrics

2.1 Balls

The examples listed show that is a non-linear isometry. This allows us to describe balls in as


Here, denotes the ball in of radius centered at . The space of balls can also be defined using a potential function following [10].

2.2 Voronoi diagrams

Let be some subset The Voronoi cell in of is defined by


And as above, if we denote by the Voronoi cell of in the Euclidean metric, it is clear that


A few graphical examples are shown in Figure 2.

Figure 2: Voronoi diagrams induced by the Riemann-Bregman distances

Figure 3 displays four types of Voronoi diagrams induced by a Bregman generator; The pictures illustrate the fact that the symmetrized Bregman Voronoi diagram is different from the (symmetric metric) Riemann-Bregman Voronoi diagram.

(a) (Shannon)

(b) (Burg)
Figure 3: Four types of Voronoi diagrams induced by a Bregman generator: The left-sided Bregman Voronoi diagram (blue), the right-sided Bregman Voronoi diagram (red), the symmetrized Bregman Voronoi diagram (green), and the Riemann-Bregman Voronoi diagram (purple).

3 Fixed rate quantization

Let be provided with a metric derived from a divergence. Let be a given integer and be a given set, called the codebook and its elements are called the

code vectors

. A fixed rate quantizer (with codebook ) is a measurable mapping The cardinality of is called the quantization rate. Two quantizers have the same rate if their codebooks have the same cardinality

For a valued random variable defined on a probability space the distortion of in quantizing is given by the expected reconstruction square error:


If we are interested in only one valued random variable we may replace by think of as the Borel sets of think of as the identity mapping, and think of as a probability measure on Also, our notation is slightly different from that of Linder [24] in that we put a square into the definition of the square error. Since is constant on the “level” sets for the former expectation can be written as

Here we denote by the subalgebra of generated by the partition of induced by the quantizer When holds for all blocks of the partition, the conditional expectation takes values


Notice that given there is a quantizer having the same rate and realizing an error smaller than The existence and uniqueness of such quantizer is given by the following result in [19].

Theorem 3.1

Let be a subalgebra of Let be such that (Lebesgue space of functions that are square integrable). Then there exists a unique measurable random variable, denoted by satisfying

The result mentioned a few lines above can be stated as

Theorem 3.2

Let be a quantizer of rate Suppose that the blocks of the partition of determined by are such that for all Then


is the quantizer with codebook whose elements are which makes the error smaller over all codebooks which determine the same partition of

An interesting way to associate sets to a codebook of rate is by means of Voronoi cells. are the Voronoi cells determined by their intersections are polyhedra of dimension lower than (faces), and the probability does not change them, that is if whenever If we put for the interior of the Voronoi cells, they generate a sub-algebra of such that differs from by a set of probability

In the notation of Theorem 3.2 we have

Corollary 3.1

If the algebra is generated buy the interiors of the Voronoi cells determined by a codebook then

The proof is clear: The integral is a weighted average of points in a convex set.

4 Riemann-Bregman Clustering

In this section we shall consider a dimensional data set in which several clusters (indicated) by different colors in the figures displayed below are apparent. We shall suppose that the base space in which the random variable takes values comes equipped with one of the five distances listed in Section 1.2.

We shall suppose that the points belong one of the possible manifold on which a Riemann-Bregman metric has been defined, and the object of the exercise is to identify the clusters. To assign the data points into clusters, we shall apply the

-means, the EM (Expectation-Maximization), the HCPC (Hierarchical Clustering Principal Component) and the HAC (Hierarchical Agglomerative Clustering) techniques. Once the clusters are identified, we determine their centers by invoking Theorem (

1.1 and (1.6). In the four examples described below, we shall see how the choice of the metric determines the clusters.

The points were generated all in in order to be able to visually compare the result of the clustering procedures in the distances listed on Section 1.2.

Step 1 Generate a cloud in
Step 2 Apply the mapping and form the cloud in
Step 3 Apply the clustering algorithms. If are the clusters of the new cloud, then the clusters in the original coordinates are .
Step 4 Plot the “centers” in the original cloud.

Notice that in all examples, the mappings and are invertible, and are applied componentwise. The aim of the exercise is to see how the choice of the metric determines the clusters.
In particular, we observe that although the centroid update rule coincide with the left-sided Bregman clustering [28], the assignment steps of the left-sided Bregman clustering and the Riemann-Bregman clustering are different.

4.1 The data set

For all the the numerical experiments considered below, the “original” data is composed of samples of four pairs of independent normal variables with parameters specified in Table 1 shown next.

1 0.5 0.5
8 6.5
2 0.4 0.45
9 7.5
3 0.35 0.35
8.5 9
4 0.6 0.6
8 10
Table 1: Centers of -means clusters for a groundtruth synthetic dataset.

Thus we obtain four point clouds with some overlap. The number of points in the four original groups is, respectively, These numbers will be used as reference to check upon the performance of each clustering technique when the distance between points is measured in each of the five different Riemann-Bregman metrics.

4.2 Clustering by -means

The k-means clustering algorithm is an iterative methodology that assigns data points to a cluster, in such a way that the sum of the squared distances between the data points and the cluster’s centroid (defined as the arithmetic mean) is minimal. The less variation we have within clusters, the more homogeneous the data points are within the same cluster.

In Figure 4 we present the result of the application of the means procedure to the original data set depicted in panel (a)a. The caption of each of the panels describes the function that has been applied to the data before applying the clustering procedure. Once the clusters are determined, the in-cluster mean is computed and the whole set is mapped back to the original space.

We add at this point that if one suspects (or knows) that the data sets are in different scales, it might be convenient to standardize the data. This is important since clustering algorithms are based on distance comparison procedures. In [5], Banerjee et al. highlight the bijection between regular exponential families and a class of Bregman divergences called regular Bregman divergences. They demonstrate experimentally that the -means clustering perform best in practice on mixture of exponential families (with cumulant function ) when the Bregman divergence is chosen for the dual convex conjugate .

The centers of the original data set and those of the clusters are listed in Table 2.

(a) Data set
Figure 4: Clustering by -means
Cluster 1 2 3 4
original (7.93, 6.63) (9.01, 7.51) (8.5, 9.08) (8.01, 10.01)
(7.97, 6.6) (9.04, 7.53) (8.5, 9.08) (7.93, 10.15)
(7.97, 6.58) (9.02, 7.53) (8.52, 9.09) (7.9, 10.12)
(7.96, 6.56) (9.02, 7.52) (8.55, 9.1) (7.87, 10.07)
(8.27, 7.06) (9.24, 7.6) (8.38, 9.18) (8.05, 10.32)
(7.76, 6.19) (8.6, 7) (9.05, 7.74) (8.29, 9.49)
Table 2: centers of -means clusters obtained with different methodologies

From the table it is clear that the Euclidean distance and the logarithmic distance perform better than the others in capturing the original centers of the data. Actually, the centers of the original data are the minimizers to the original data set in the Euclidean distance and are the empirical means of the normals used to generate the data.

As a further test of the clustering algorithm we provide a “headcount” of points in each cluster and compare the numbers with the number of original points in each cluster. The result is displayed in Table 3.

Cluster 1 2 3 4 Accuracy
Original 100 300 200 150 1
109 286 226 129 0.953
106 289 225 130 0.959
103 292 221 134 0.968
214 196 240 100 0.768
46 146 211 347 0.388
Table 3: Size of the -means clusters and accuracy obtained with different transformations

4.3 Clustering with the EM algorithm

The expectation maximization (EM) algorithm [17]

is an iterative method to estimate the parameters in a statistical model, in which the model depends on unobserved variables. The EM iteration alternates between performing an expectation (E) step and a maximization (M) step. In the first step a likelihood function is created and in the second it is maximized. See

[26] for details.

In the EM methodology an initial guess is made for the model’s parameters and a distribution is created, this is the E-step or Expected distribution, then the parameters are updated in the (M-step) maximizing the expectation computed in the E-step. Both steps are repeated until the convergence is clear (distribution that doesn’t change from the E-step to the M-step)

In the upper left panel of Figure 5 we display the original clusters (the same as above) for the sake of easy visual comparison, as well as the clusters obtained by applying the EM clustering methodology.

(a) Data set
Figure 5: Clustering by the EM algorithm

Form Table 5 it becomes clear how the distance between points changes plus the type of clustering algorithm determine the centers of the clusters.

Cluster 1 2 3 4
Original (7.93, 6.63) (9.01, 7.51) (8.5, 9.08) (8.01, 10.01)
(7.82, 6.52) (8.99, 7.49) (8.47, 9.2) (7.79, 10.29)
(7.72, 6.41) (8.94, 7.44) (8.26, 9.32) (8.49, 10.53)
(7.95, 6.55) (9.01, 7.53) (8.55, 9.07) (7.95, 10.03)
(7.79, 6.59) (8.88, 7.37) (9.55, 7.67) (8.31, 9.34)
(7.92, 6.6) (9.01, 7.53) (8.52, 9.1) (7.93, 10.09)
Table 4: centers of EM clusters in the different metrics

In Table (5

), we list the size of the original clusters along with the size of the clusters obtained in the different distances. Also, in the last column of the table we list the accuracy of the procedure, defined as the number of cases correctly classified. Obviously, with the original data all the cases are correct, but note that since the distance between the points changes with the metric, it is natural that the accuracy of the procedure would vary.

Cluster 1 2 3 4 Accuracy
Original 100 300 200 150 1
85 315 259 91 0.901
64 333 307 46 0.809
101 298 200 151 0.996
83 217 69 381 0.577
101 303 208 138 0.977
Table 5: Size of the EM clusters and accuracy for the different metrics

5 Clustering using the HCPC algorithm

The hierarchical clustering on principal components (HCPC) algorithm is a combination of methods. First it uses principal component analysis (PCA) to reduce possible hidden high dimensionality in the data. Then it uses hierarchical and partitional clustering to describe similarity between the points. See

[20] for details and further references.

In Figure 6 we display the results of the clustering procedure. The panels are as displayed as above. In the upper leftmost one, appears the original data, and the remaining panels we display the results of the clustering process. The resulting clusters are in different colors.

(a) Data set
Figure 6: Clustering by the HCPC algorithm

In Table 6 we list the centers of the HCPC-clusters in the different distances.

Cluster 1 2 3 4
original (7.93, 6.63) (9.01, 7.51) (8.5, 9.08) (8.01, 10.01)
(7.9, 6.71) (9.06, 7.49) (8.54, 9.26) (7.67, 9.93)
(7.89, 6.71) (9.06, 7.48) (8.55, 9.26) (7.67, 9.91)
(7.87, 6.67) (9.05, 7.47) (8.55, 9.25) (7.66, 9.9)
(7.96, 7.52) (9.38, 7.71) (8.77, 8.04) (8.04, 9.98)
(7.85, 6.5) (9.02, 7.46) (8.51, 9.29) (7.54, 9.79)
Table 6: centers of the HCPC clusters in the different metrics

In Table 7, we can observe the size of the original clusters along with the size obtained with the transformations. Also, this table contains a column that is the accuracy, this value is defined as the number of cases correctly classified. Obviously, with the original data all the cases are correct, for that reason we have a accuracy of 1, but for the other transformations we fail.

Cluster 1 2 3 4 Accuracy
original 100 300 200 150 1
106 290 247 107 0.929
105 288 250 107 0.927
100 289 256 105 0.925
166 135 276 173 0.749
83 297 285 85 0.864
Table 7: Size of the HCPC clusters and accuracy obtained with different transformations

6 The Hierarchical Agglomerative Clustering (HAC) process

The Hierarchical Agglomerative Clustering (HAC) is designed to build a hierarchy of clusters using a bottom-up or agglomerative strategy. Here each data point is treated as a single cluster and then a distance (or similarity) between each of the clusters is computed in order to merge two or more into one (e.g., single linkage or average linkage), that is, the two most “similar” clusters are merged at each step. See [21] and [30] for details.

In Figure 7 we display the results of the hierarchical clustering algorithm. The panels are organized as above.

(a) Data set
Figure 7: Clustering by the hierarchical clustering algorithm

In Table 8 we list the centers associated to the resulting clusters. As usual, they are the arithmetic means of the clusters provided by the hierarchical clustering algorithm.

Cluster 1 2 3 4
original (7.93, 6.63) (9.01, 7.51) (8.5, 9.08) (8.01, 10.01)
(8.16, 6.71) (9.06, 7.61) (8.28, 9.26) (8.34, 10.15)
(8.38, 6.92) (9.22, 7.68) (8.41, 9.31) (7.38, 10.29)
(7.86, 6.69) (9.04, 7.44) (8.56, 9.09) (7.86, 9.98)
(8.72, 7.31) (8.32, 9.42) (8.07, 10.84) (8, 11.59)
(7.99, 5.89) (8.67, 7.1) (8.57, 9.01) (7.23, 10.14)
Table 8: centers of the clusters provided by the hierarchical clustering

In Table 9, we list the size of the original clusters along with the sizes of the clusters in the different metrics. We also list the accuracy of the classifier.

Comment At this point we mention that in all examples the accuracy can be provided because we new the true clusters before hand. Therefore, this accuracy measure is an indirect quality test of the different clustering methods. Other measures of cluster quality can be considered (e.g., the adjusted Rand index or the Normalized Mutual Information [5]).

Cluster 1 2 3 4 Accuracy
original 100 300 200 150 1
144 256 261 89 0.860
215 170 323 42 0.683
94 291 226 139 0.957
407 325 13 5 0.140
19 263 435 33 0.579
Table 9: Size of the Hierarchical clustering and accuracy obtained with different transformations

7 Concluding comments

In this note, we considered the Riemannian metric distance induced by a separable Bregman divergence, and studied the fixed rate quantization problem. We concisely described the Riemann-Bregman Voronoi diagrams that can be obtained from an ordinary Euclidean Voronoi diagram after a monotone isometric embedding. Then we reported on the partition-based (-means), soft (EM) and hierarchical (HCPC/HCA) clusterings with respect to these Riemann-Bregman distances. As was to be expected, the resulting clusters depend on the clustering method as well as on the metric used to measure the distance between points.


  • [1] Amari, S and Cichocki, A. (2010). Information theory of divergence functions, Bull. of the Polish Acad. Sci., 58, 183-195.
  • [2] Amari, S. and Nagaoka, H. (2000). Methods of Information Geometry, Transl. of Mathem. Monographs, 191, Oxford Univ. Press, New York.
  • [3] Banerjee, A., Guo, X. and Wang, H. (2005). On the optimality of conditional expectation as Bregman predictor, IEEE Transactions on Information Theory, 51, 2664 - 2669
  • [4] Banerjee, A. Dhillon, I., Ghosh, I. Merugu, S and Modhi, D.s. (2008). A Generalized Maximum Entropy Approach to Bregman Co-clustering and Matrix Approximation

    The Journal of Machine Learning Research,

    8, 1919-1986.
  • [5] Banerjee, A., Merugu, S., Dhillon, I. S., and Ghosh, J. (2005) Clustering with Bregman divergences. Journal of machine learning research 6: 1705-1749.
  • [6] Baushke, H.H. and Borweinn, J.M. (1997). Legendre Functions and the Method of Random Bregman Projections, Journal of Convex Analysis, 4, 27-67.
  • [7] Baushke, H.H and Lewis, A. (1998). Dykstras algorithm with Bregman projections: A convergence proof, Optimization, 48, 409-427
  • [8] Baushke, H.H., Borwein, J.M and Combettes, P.L. (2003). Bregman monotone optimization algorithms, SIAM J. Control Optim., 42, 596-636.
  • [9] Baushke, H.H and Combettes, P.L. (2003). Construction of best Bregman approximation in reflective Banach spaces, Proc. Amer. Math. Soc., 131, 3757-3766
  • [10] Boissonnat, J-D., Nielsen, F and Nock, F. (2010). Bregman Voronoi Diagrams, Discrete Comput. Geom., 44, 281-307.
  • [11] Bregman, L. (1967). The relaxation method of finding common points of convex sets and its application to the solution of problems in convex programming, Comp. Math. Phys., USSR, 7, 2100-217.
  • [12] Butnariu, D. and Resmerita, E. (2005). Bregman distances, totally convex functions, and a method for solving operator equations, Abstract and Applied Analysis, 2006, Article ID 84919, 1-39.
  • [13] Calin, O. and Urdiste, C. Geometric Modeling in Probability and Statistics, Springer Internl. Pub., Switzerland, (2010).
  • [14] Censor, Y and Reich, S. (1998). The Dykstra algorithm with Bregman projections. Comm. Applied Analysis, 2, 407-419.
  • [15] Censor, Y. and Zaknoon, M. (2018). Algorithms and convergence results of projection methods for inconsistent feasibility problems: A review, arXiv:1802.07529v3 [math.OC].
  • [16] Csiszár, I. (2008). Axiomatic characterization of information measures, Entropy, 10, 261-273.
  • [17] Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the royal statistical society. Series B (methodological), 1-38.
  • [18] Fisher, A. (2010). Quantization and clustering with Bregman divergences

    , Journal of Multivariate Analysis,

    101, 2207-2221.
  • [19] Gzyl, H. (2018). Prediction in Riemannian metrics derived from divergence functions,
  • [20] Husson, F, Josse, J. and Pagès J. (2010). ——Principal Component Methods - Hierarchical Clustering - Partitional Clustering: Why Would We Need to Choose for Visualizing Data?” Technical Rept, Mathem. and Stats Dept., Univ Rennes.
  • [21]

    Kassambara, A. (2017). Practical Guide To Cluster Analysis in R. CreateSpace: North Charleston, SC, USA.

  • [22] Lang, S. Math talks for undergraduates, Springer, New York, (1999).
  • [23] Li, C., Song, W. and Yao, J-C. (2010). The Bregman distance, approximate compactness and convexity of Chebyshev sets in Banach spaces,Journal of Approximation Theory 162, 1128-1149.
  • [24] Linder, T. (2001). Learning theoretic methods in vector quantization, in Principles of Nonparametric Learning, I. Gyorfi, ED., CISM Lecture Notes, Springer, New York.
  • [25] Lawson, J.D. and Lim, Y. (2001)

    The Geometric mean, matrices, metrics and more

    , Amer. Math.,Monthly, 108, 797-812.
  • [26] McLachlan, G.J. and Peel, D. (2000). Finite Mixture Models, Wiley, New York.
  • [27] Moahker, M. (2005) A differential geometric approach to the geometric mean of symmetric positive definite matrices, SIAM. J. Matrix Anal. & Appl., 26, 735-747.
  • [28] Nielsen, F., and Nock, R. (2009). Sided and symmetrized Bregman centroids. IEEE transactions on Information Theory 55.6: 2882-2904.
  • [29] Nielsen, F. (2016). Hierarchical Clustering in

    Introduction to HPC with MPI for Data Science

    Springer, Berlin.
  • [30] Nielsen, F. (2018). An elementary introduction to information geometry. Available at
  • [31] Pollard, D. (2002). A user’s guide to measure theoretic probability, Cambridge Univ. Press., Cambridge.
  • [32] Schwartzmazn, A. (2015) Lognormal distribution and geometric averages of positive definite matrices, Int. Stat. Rev., 84, 456-486.