# Geometry and clustering with metrics derived from separable Bregman divergences

Separable Bregman divergences induce Riemannian metric spaces that are isometric to the Euclidean space after monotone embeddings. We investigate fixed rate quantization and its codebook Voronoi diagrams, and report on experimental performances of partition-based, hierarchical, and soft clustering algorithms with respect to these Riemann-Bregman distances.

## Authors

• 1 publication
• 3 publications
• 70 publications
12/01/2009

### Hodge Theory on Metric Spaces

Hodge theory is a beautiful synthesis of geometry, topology, and analysi...
06/19/2018

### On the Metric Distortion of Embedding Persistence Diagrams into separable Hilbert spaces

Persistence diagrams are important descriptors in Topological Data Analy...
04/02/2018

### Every metric space is separable in function realizability

We first show that in the function realizability topos every metric spac...
05/11/2019

### Embeddings of Persistence Diagrams into Hilbert Spaces

Since persistence diagrams do not admit an inner product structure, a ma...
08/05/2018

### Prediction in Riemannian metrics derived from divergence functions

Divergence functions are interesting discrepancy measures. Even though t...
09/13/2017

### Geometric clustering in normed planes

Given two sets of points A and B in a normed plane, we prove that there ...
10/01/2020

### A survey of mass partitions

Mass partition problems describe the partitions we can induce on a famil...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction and Preliminaries

### 1.1 Riemannian geometry induced by a Bregman divergence

In [19] it was shown how to put a Riemannian structure on a convex open subset of derived from a Bregman divergence

 δΦ(x,x′)=Φ(x)−Φ(x′)−(x−x′)⊤∇Φ(x′),

with real-valued generator function . There where is an interval (bounded or unbounded) in the real line, and the separable Bregman generator with is a three times continuously differentiable function. This set up was chosen for computational convenience, and . In this case the Riemannian metric on is given by .

The Riemannian distance in induced by that metric is given by

 dϕ(x,x′)2=K∑j=1(h(xj)−h(x′j))2, (1.1)

where is a an increasing continuously differentiable function defined as the primitive of (also called antiderivative). Its compositional inverse will be denoted by (with ) and we shall use the same letter for the extensions of the functions to , that is, we shall write to denote the transpose of and the same for . We term distance the Riemann-Bregman distance.

The equation of the Riemannian geodesic between and was reported in [19] as , where the are the constants of integration such that and . That is, the geodesics are expressed as

 γi(t)=h−1((1−t)h(xi)+th(x′i)),t∈[0,1]. (1.2)

Notice that by introducing the Legendre convex conjugate

 ϕ∗(y) = supx∈J{yx−ϕ(x)}, (1.3) = yϕ′−1(y)−ϕ(ϕ′−1(y)), (1.4)

with (and ), we can write the dual Riemannian metric as , and we have the following identity: .

It follows that we can express the Bregman-Riemannian distance either by using the primal affine -coordinate system, or by using the dual affine -coordinate system (with and ):

###### Lemma 1.1 (Dual Riemann-Bregman distances)

We have , and

 dϕ(x,x′)=dϕ∗(y,y′)=dϕ∗(∇Φ(x),∇Φ(x′)). (1.5)
###### Example 1.1

Consider (extended Shannon negentropy). We have and . It follows that , where is a constant. The Legendre conjugate is and , where is a constant. We have , and . We check that since .

Figure 1 illustrates the Legendre duality for defining the Riemann-Bregman distances in the dual coordinate systems induced by the Legendre transformation.

###### Remark 1.1

In information geometry [2, 13], a smooth dissimilarity measure on a manifold induces a dual structure defined by a pair of affine connection

coupled to the metric tensor

. The Riemannian geometry is a self-dual information-geometric structure with the dual connections coinciding with the Levi-Civita connection . Bregman divergences yield dually-flat information-geometric spaces (meaning -flat and -flat) although the Riemann-Bregman geometry is usual curved [30] (meaning -curved).

### 1.2 Examples of Riemann-Bregman distances

Let us now list a few examples. These will be used below to exemplify the geometric objects defined by the distances and to exemplify the clustering algorithms.

#### 1.2.1 Case 1: ϕ=x2/2

In this case and The induced distance is the standard squared Euclidean distance

 dϕ(x,y)2=K∑i=1(xi−yi)2.

#### 1.2.2 Case 2: ϕ(x)=ex

Now The geodesic distance between and is given by

 dϕ(x,y)2=K∑i=1(eyi/2−exi/2)2.

#### 1.2.3 Case 3: ϕ(x)=e−x

The geodesic distance between and is given by

 dϕ(x,y)2=K∑i=1(e−yi/2−e−xi/2)2.

#### 1.2.4 Case 4: ϕ(x)=xlnx (Shannon negentropy)

This time our domain is and , and . The geodesic distance between and is

 dϕ(x,y)2=2K∑i=1(√yi−√xi)2.

This looks similar to the squared Hellinger distance used in probability theory. See Pollard’s

[31]

#### 1.2.5 Case 5: ϕ(x)=−lnx (Burg negentropy)

To finish, we shall consider another example on . We have , and . Now, the distance between and is now given by

 dϕ(x,y)2=K∑i=1(lnyi−lnxi)2.

To continue, let us mention that the thrust in [19]

was to examine the prediction process when considering random variables taking values in metric space

that is to examine the concepts of best predictor of a random variable taking values in when the prediction error is measured in the distance. The special relation of the distance (1.1) to the Euclidean distance makes the following result clear:

###### Theorem 1.1 (Separable Riemann-Bregman centroid)

Given a collection of points in the point the realizes the minimum of

 N∑n=1dϕ(xn,ξ)2%overξ∈M

is unique and given by

 ^x=H(1NN∑n=1h(xn)). (1.6)

Observe that formula Eq. 1.6 coincides with the left-sided Bregman centroid [28]. However, left-sided Bregman -means and Riemann-Bregman -means will differ in the assignment steps because they rely on different dissimilarity measures.

We shall make extensive use of this definition when computing the centers of the clusters in the various metrics. This is done in Section 4 below, where we start from a given data set and organize it in clusters according to different clustering procedures for each of the distances, and compute the centers of the resulting clusters each distance as in (1.6).

In Section 2 we consider some standard geometric objects, like Balls, Voronoi cells, in the different distances, whereas in Section 3 we consider the problem of fixed rate quantization for valued random variables .

## 2 Balls and Voronoi cells in the derived metrics

### 2.1 Balls

The examples listed show that is a non-linear isometry. This allows us to describe balls in as

 Bϕ(x,r)={y∈M:dϕ(y,x)≤r}=h−1(B(h(x),r)). (2.1)

Here, denotes the ball in of radius centered at . The space of balls can also be defined using a potential function following [10].

### 2.2 Voronoi diagrams

Let be some subset The Voronoi cell in of is defined by

 vorϕ(yi)={x∈M : dϕ(x,yi)≤dϕ(x,yj),∀yj∈M}. (2.2)

And as above, if we denote by the Voronoi cell of in the Euclidean metric, it is clear that

 vorϕ(yi)=h−1(vor(h(yi))). (2.3)

A few graphical examples are shown in Figure 2.

Figure 3 displays four types of Voronoi diagrams induced by a Bregman generator; The pictures illustrate the fact that the symmetrized Bregman Voronoi diagram is different from the (symmetric metric) Riemann-Bregman Voronoi diagram.

## 3 Fixed rate quantization

Let be provided with a metric derived from a divergence. Let be a given integer and be a given set, called the codebook and its elements are called the

code vectors

. A fixed rate quantizer (with codebook ) is a measurable mapping The cardinality of is called the quantization rate. Two quantizers have the same rate if their codebooks have the same cardinality

For a valued random variable defined on a probability space the distortion of in quantizing is given by the expected reconstruction square error:

 Dϕ(X,q(X))2=E[dϕ(X,q(X))2]. (3.1)

If we are interested in only one valued random variable we may replace by think of as the Borel sets of think of as the identity mapping, and think of as a probability measure on Also, our notation is slightly different from that of Linder [24] in that we put a square into the definition of the square error. Since is constant on the “level” sets for the former expectation can be written as

 Dϕ(X,q(X))2=E[E[dϕ(X,q(X))2|σ(q)]].

Here we denote by the subalgebra of generated by the partition of induced by the quantizer When holds for all blocks of the partition, the conditional expectation takes values

 E[dϕ(X,q(X))2|σ(q)]=1P(Si)∫Sidϕ(X,yi)2dPon the setSi,i=1,...,N. (3.2)

Notice that given there is a quantizer having the same rate and realizing an error smaller than The existence and uniqueness of such quantizer is given by the following result in [19].

###### Theorem 3.1

Let be a subalgebra of Let be such that (Lebesgue space of functions that are square integrable). Then there exists a unique measurable random variable, denoted by satisfying

 Eϕ[X|G]=argmin{Dϕ(X,Y)2|Y∈G,h(Y)∈L2}

The result mentioned a few lines above can be stated as

###### Theorem 3.2

Let be a quantizer of rate Suppose that the blocks of the partition of determined by are such that for all Then

 q∗(X)=Eϕ[X|G]=N∑i=11P(Si)Eϕ[X;Si]ISi(X):=N∑i=1yiISi(X). (3.3)

is the quantizer with codebook whose elements are which makes the error smaller over all codebooks which determine the same partition of

An interesting way to associate sets to a codebook of rate is by means of Voronoi cells. are the Voronoi cells determined by their intersections are polyhedra of dimension lower than (faces), and the probability does not change them, that is if whenever If we put for the interior of the Voronoi cells, they generate a sub-algebra of such that differs from by a set of probability

In the notation of Theorem 3.2 we have

###### Corollary 3.1

If the algebra is generated buy the interiors of the Voronoi cells determined by a codebook then

 y∗i=1P(Si)Eϕ[X;Si]=1P(Bi)Eϕ[X;Bi]∈Bi⊂Si,∀i=1,...,N.

The proof is clear: The integral is a weighted average of points in a convex set.

## 4 Riemann-Bregman Clustering

In this section we shall consider a dimensional data set in which several clusters (indicated) by different colors in the figures displayed below are apparent. We shall suppose that the base space in which the random variable takes values comes equipped with one of the five distances listed in Section 1.2.

We shall suppose that the points belong one of the possible manifold on which a Riemann-Bregman metric has been defined, and the object of the exercise is to identify the clusters. To assign the data points into clusters, we shall apply the

-means, the EM (Expectation-Maximization), the HCPC (Hierarchical Clustering Principal Component) and the HAC (Hierarchical Agglomerative Clustering) techniques. Once the clusters are identified, we determine their centers by invoking Theorem (

1.1 and (1.6). In the four examples described below, we shall see how the choice of the metric determines the clusters.

The points were generated all in in order to be able to visually compare the result of the clustering procedures in the distances listed on Section 1.2.

Strategy:
Step 1 Generate a cloud in
Step 2 Apply the mapping and form the cloud in
Step 3 Apply the clustering algorithms. If are the clusters of the new cloud, then the clusters in the original coordinates are .
Step 4 Plot the “centers” in the original cloud.

Notice that in all examples, the mappings and are invertible, and are applied componentwise. The aim of the exercise is to see how the choice of the metric determines the clusters.
In particular, we observe that although the centroid update rule coincide with the left-sided Bregman clustering [28], the assignment steps of the left-sided Bregman clustering and the Riemann-Bregman clustering are different.

### 4.1 The data set

For all the the numerical experiments considered below, the “original” data is composed of samples of four pairs of independent normal variables with parameters specified in Table 1 shown next.

Thus we obtain four point clouds with some overlap. The number of points in the four original groups is, respectively, These numbers will be used as reference to check upon the performance of each clustering technique when the distance between points is measured in each of the five different Riemann-Bregman metrics.

### 4.2 Clustering by k-means

The k-means clustering algorithm is an iterative methodology that assigns data points to a cluster, in such a way that the sum of the squared distances between the data points and the cluster’s centroid (defined as the arithmetic mean) is minimal. The less variation we have within clusters, the more homogeneous the data points are within the same cluster.

In Figure 4 we present the result of the application of the means procedure to the original data set depicted in panel (a)a. The caption of each of the panels describes the function that has been applied to the data before applying the clustering procedure. Once the clusters are determined, the in-cluster mean is computed and the whole set is mapped back to the original space.

We add at this point that if one suspects (or knows) that the data sets are in different scales, it might be convenient to standardize the data. This is important since clustering algorithms are based on distance comparison procedures. In [5], Banerjee et al. highlight the bijection between regular exponential families and a class of Bregman divergences called regular Bregman divergences. They demonstrate experimentally that the -means clustering perform best in practice on mixture of exponential families (with cumulant function ) when the Bregman divergence is chosen for the dual convex conjugate .

The centers of the original data set and those of the clusters are listed in Table 2.

From the table it is clear that the Euclidean distance and the logarithmic distance perform better than the others in capturing the original centers of the data. Actually, the centers of the original data are the minimizers to the original data set in the Euclidean distance and are the empirical means of the normals used to generate the data.

As a further test of the clustering algorithm we provide a “headcount” of points in each cluster and compare the numbers with the number of original points in each cluster. The result is displayed in Table 3.

### 4.3 Clustering with the EM algorithm

The expectation maximization (EM) algorithm [17]

is an iterative method to estimate the parameters in a statistical model, in which the model depends on unobserved variables. The EM iteration alternates between performing an expectation (E) step and a maximization (M) step. In the first step a likelihood function is created and in the second it is maximized. See

[26] for details.

In the EM methodology an initial guess is made for the model’s parameters and a distribution is created, this is the E-step or Expected distribution, then the parameters are updated in the (M-step) maximizing the expectation computed in the E-step. Both steps are repeated until the convergence is clear (distribution that doesn’t change from the E-step to the M-step)

In the upper left panel of Figure 5 we display the original clusters (the same as above) for the sake of easy visual comparison, as well as the clusters obtained by applying the EM clustering methodology.

Form Table 5 it becomes clear how the distance between points changes plus the type of clustering algorithm determine the centers of the clusters.

In Table (5

), we list the size of the original clusters along with the size of the clusters obtained in the different distances. Also, in the last column of the table we list the accuracy of the procedure, defined as the number of cases correctly classified. Obviously, with the original data all the cases are correct, but note that since the distance between the points changes with the metric, it is natural that the accuracy of the procedure would vary.

## 5 Clustering using the HCPC algorithm

The hierarchical clustering on principal components (HCPC) algorithm is a combination of methods. First it uses principal component analysis (PCA) to reduce possible hidden high dimensionality in the data. Then it uses hierarchical and partitional clustering to describe similarity between the points. See

[20] for details and further references.

In Figure 6 we display the results of the clustering procedure. The panels are as displayed as above. In the upper leftmost one, appears the original data, and the remaining panels we display the results of the clustering process. The resulting clusters are in different colors.

In Table 6 we list the centers of the HCPC-clusters in the different distances.

In Table 7, we can observe the size of the original clusters along with the size obtained with the transformations. Also, this table contains a column that is the accuracy, this value is defined as the number of cases correctly classified. Obviously, with the original data all the cases are correct, for that reason we have a accuracy of 1, but for the other transformations we fail.

## 6 The Hierarchical Agglomerative Clustering (HAC) process

The Hierarchical Agglomerative Clustering (HAC) is designed to build a hierarchy of clusters using a bottom-up or agglomerative strategy. Here each data point is treated as a single cluster and then a distance (or similarity) between each of the clusters is computed in order to merge two or more into one (e.g., single linkage or average linkage), that is, the two most “similar” clusters are merged at each step. See [21] and [30] for details.

In Figure 7 we display the results of the hierarchical clustering algorithm. The panels are organized as above.

In Table 8 we list the centers associated to the resulting clusters. As usual, they are the arithmetic means of the clusters provided by the hierarchical clustering algorithm.

In Table 9, we list the size of the original clusters along with the sizes of the clusters in the different metrics. We also list the accuracy of the classifier.

Comment At this point we mention that in all examples the accuracy can be provided because we new the true clusters before hand. Therefore, this accuracy measure is an indirect quality test of the different clustering methods. Other measures of cluster quality can be considered (e.g., the adjusted Rand index or the Normalized Mutual Information [5]).

In this note, we considered the Riemannian metric distance induced by a separable Bregman divergence, and studied the fixed rate quantization problem. We concisely described the Riemann-Bregman Voronoi diagrams that can be obtained from an ordinary Euclidean Voronoi diagram after a monotone isometric embedding. Then we reported on the partition-based (-means), soft (EM) and hierarchical (HCPC/HCA) clusterings with respect to these Riemann-Bregman distances. As was to be expected, the resulting clusters depend on the clustering method as well as on the metric used to measure the distance between points.

## References

• [1] Amari, S and Cichocki, A. (2010). Information theory of divergence functions, Bull. of the Polish Acad. Sci., 58, 183-195.
• [2] Amari, S. and Nagaoka, H. (2000). Methods of Information Geometry, Transl. of Mathem. Monographs, 191, Oxford Univ. Press, New York.
• [3] Banerjee, A., Guo, X. and Wang, H. (2005). On the optimality of conditional expectation as Bregman predictor, IEEE Transactions on Information Theory, 51, 2664 - 2669
• [4] Banerjee, A. Dhillon, I., Ghosh, I. Merugu, S and Modhi, D.s. (2008). A Generalized Maximum Entropy Approach to Bregman Co-clustering and Matrix Approximation

The Journal of Machine Learning Research,

8, 1919-1986.
• [5] Banerjee, A., Merugu, S., Dhillon, I. S., and Ghosh, J. (2005) Clustering with Bregman divergences. Journal of machine learning research 6: 1705-1749.
• [6] Baushke, H.H. and Borweinn, J.M. (1997). Legendre Functions and the Method of Random Bregman Projections, Journal of Convex Analysis, 4, 27-67.
• [7] Baushke, H.H and Lewis, A. (1998). Dykstras algorithm with Bregman projections: A convergence proof, Optimization, 48, 409-427
• [8] Baushke, H.H., Borwein, J.M and Combettes, P.L. (2003). Bregman monotone optimization algorithms, SIAM J. Control Optim., 42, 596-636.
• [9] Baushke, H.H and Combettes, P.L. (2003). Construction of best Bregman approximation in reflective Banach spaces, Proc. Amer. Math. Soc., 131, 3757-3766
• [10] Boissonnat, J-D., Nielsen, F and Nock, F. (2010). Bregman Voronoi Diagrams, Discrete Comput. Geom., 44, 281-307.
• [11] Bregman, L. (1967). The relaxation method of finding common points of convex sets and its application to the solution of problems in convex programming, Comp. Math. Phys., USSR, 7, 2100-217.
• [12] Butnariu, D. and Resmerita, E. (2005). Bregman distances, totally convex functions, and a method for solving operator equations, Abstract and Applied Analysis, 2006, Article ID 84919, 1-39.
• [13] Calin, O. and Urdiste, C. Geometric Modeling in Probability and Statistics, Springer Internl. Pub., Switzerland, (2010).
• [14] Censor, Y and Reich, S. (1998). The Dykstra algorithm with Bregman projections. Comm. Applied Analysis, 2, 407-419.
• [15] Censor, Y. and Zaknoon, M. (2018). Algorithms and convergence results of projection methods for inconsistent feasibility problems: A review, arXiv:1802.07529v3 [math.OC].
• [16] Csiszár, I. (2008). Axiomatic characterization of information measures, Entropy, 10, 261-273.
• [17] Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the royal statistical society. Series B (methodological), 1-38.
• [18] Fisher, A. (2010). Quantization and clustering with Bregman divergences

, Journal of Multivariate Analysis,

101, 2207-2221.
• [19] Gzyl, H. (2018). Prediction in Riemannian metrics derived from divergence functions, http://arxiv.org/abs/1808.01638.
• [20] Husson, F, Josse, J. and Pagès J. (2010). ——Principal Component Methods - Hierarchical Clustering - Partitional Clustering: Why Would We Need to Choose for Visualizing Data?” Technical Rept, Mathem. and Stats Dept., Univ Rennes.
• [21]

Kassambara, A. (2017). Practical Guide To Cluster Analysis in R. CreateSpace: North Charleston, SC, USA.

• [22] Lang, S. Math talks for undergraduates, Springer, New York, (1999).
• [23] Li, C., Song, W. and Yao, J-C. (2010). The Bregman distance, approximate compactness and convexity of Chebyshev sets in Banach spaces,Journal of Approximation Theory 162, 1128-1149.
• [24] Linder, T. (2001). Learning theoretic methods in vector quantization, in Principles of Nonparametric Learning, I. Gyorfi, ED., CISM Lecture Notes, Springer, New York.
• [25] Lawson, J.D. and Lim, Y. (2001)

The Geometric mean, matrices, metrics and more

, Amer. Math.,Monthly, 108, 797-812.
• [26] McLachlan, G.J. and Peel, D. (2000). Finite Mixture Models, Wiley, New York.
• [27] Moahker, M. (2005) A differential geometric approach to the geometric mean of symmetric positive definite matrices, SIAM. J. Matrix Anal. & Appl., 26, 735-747.
• [28] Nielsen, F., and Nock, R. (2009). Sided and symmetrized Bregman centroids. IEEE transactions on Information Theory 55.6: 2882-2904.
• [29] Nielsen, F. (2016). Hierarchical Clustering in

Introduction to HPC with MPI for Data Science

Springer, Berlin.
• [30] Nielsen, F. (2018). An elementary introduction to information geometry. Available at https://arxiv.org/abs/1808.08271
• [31] Pollard, D. (2002). A user’s guide to measure theoretic probability, Cambridge Univ. Press., Cambridge.
• [32] Schwartzmazn, A. (2015) Lognormal distribution and geometric averages of positive definite matrices, Int. Stat. Rev., 84, 456-486.