Log In Sign Up

Efficient Information Theoretic Clustering on Discrete Lattices

by   Christian Bauckhage, et al.

We consider the problem of clustering data that reside on discrete, low dimensional lattices. Canonical examples for this setting are found in image segmentation and key point extraction. Our solution is based on a recent approach to information theoretic clustering where clusters result from an iterative procedure that minimizes a divergence measure. We replace costly processing steps in the original algorithm by means of convolutions. These allow for highly efficient implementations and thus significantly reduce runtime. This paper therefore bridges a gap between machine learning and signal processing.


page 3

page 5

page 7


Undecidability of Underfitting in Learning Algorithms

Using recent machine learning results that present an information-theore...

An Information-Theoretic Perspective on Overfitting and Underfitting

We present an information-theoretic framework for understanding overfitt...

Bayesian estimation of information-theoretic metrics for sparsely sampled distributions

Estimating the Shannon entropy of a discrete distribution from which we ...

Information-Theoretic Analysis of Minimax Excess Risk

Two main concepts studied in machine learning theory are generalization ...

On the optimality of kernels for high-dimensional clustering

This paper studies the optimality of kernel methods in high-dimensional ...

Co-Clustering via Information-Theoretic Markov Aggregation

We present an information-theoretic cost function for co-clustering, i.e...

Nonparametric Extrema Analysis in Time Series for Envelope Extraction, Peak Detection and Clustering

In this paper, we propose a nonparametric approach that can be used in e...

1 Motivation and Background

Clustering is an important tool of the trade in data mining and pattern recognition. Numerous algorithms have been proposed and an extended review of their characteristics and merits would be beyond the scope of this paper. However, we note that many algorithms used to cluster numerical data tacitly assume the data to be drawn from distributions that are continuous.

Alas, many data are explicitly discrete. Consider, for instance, the pixels of a standard digital image. Owing to modern recording technologies, domain and range of such images are discrete and we may distinguish three elementary cases:

(i) In a binary (black and white) image, each pixel is switched either on or off. Pixels that are switched off need not be represented; pixels that are switched on are characterized by their two-dimensional grid coordinate .

(ii) In an intensity (gray value) image, pixels are located on a two-dimensional grid and assume discrete gray values; they are thus fully characterized by a three-dimensional grid coordinate .

(iii) In a color (RGB) image, each pixel has a fixed location on a two-dimensional grid and maps to a discrete three-dimensional color vector; color pixels are thus fully characterized by a five-dimensional grid coordinate


Digital image therefore correspond to sets of data points that reside on low dimensional lattices. If such data need to be clustered into coherent groups, it is common practice to resort to algorithms that perform continuous optimization and thus will likely produce cluster centers that are located in between lattice points. In most practical applications, this is perfectly acceptable and our concern in this paper is therefore not with precision but with efficiency.

In this paper, we adapt an algorithm for continuous, information theoretic clustering to data on discrete lattices. Our basic idea is to connect recent machine learning techniques with well established methods for digital signal processing.

We show that recasting essential steps of information theoretic clustering in terms of efficient convolution operations significantly reduces its runtime and enables us to quickly compute cluster centers in image data. To simplify our discussion, we focus on applications in binary image processing and provide qualitative examples to illustrate behavior and performance of the algorithm. Quantitative experiments on a large data base of binary images reveal that our accelerated algorithm is two orders of magnitude faster than the original one. Moreover, we demonstrate that weighted clustering is straightforward in our framework.

Next, we review information theoretic clustering and discuss its properties. Then, we derive our accelerated version and present experimental results that underline its favorable performance. Finally, we discuss how the proposed algorithm immediately allows for efficient weighted clustering and conclude this paper with a summary and an outlook to practical applications of our approach.

2 Information Theoretic Clustering

The particular approach to information theoretic clustering (ITC) we consider in this paper was introduced in [11, 12, 13]. Given a set

of data points drawn from a continuous domain, it applies an entropy-based measure to determine clusters in the data. Similar to -means clustering, ITC starts from a random initialization of the cluster centers

and iteratively updates these codebook vectors in order to minimize the entropy measure under consideration.

(a) a set of 2D data points
(b) pdf of
(c) initial pdf of
(d) optimized pdf of
Figure 1: Information theoretic clustering of a binary image: (a) a binary shape can be understood as a set

of 2D data points; (b) visualization of a Parzen estimate of the probability density function

of ; (c) a randomly initialized set of codebook vectors (30 in this case) typically has a pdf that diverges from ; (d) an iterative minimization the Cauchy-Schwartz divergence of and moves the codebook vectors towards modes of .

2.1 Iterative Entropy Minimization

ITC assumes that two probability density functions and are available that characterize the sets and , respectively. Given these densities, ITC optimizes the placement of codebook vectors by means of minimizing the Cauchy-Schwartz divergence


between and . Note that the last term of does not depend on . The entropies in its first and second term are given by


and are known as Renyi’s cross entropy between and and Renyi’s entropy of , respectively [14].

The fundamental idea in [11, 12, 13] is to model the densities and using their Parzen estimates


where and

are Gaussian kernels of variance

and , respectively.

Applying the Gaussian product theorem and using properties of the so called overlap integral, and can be written as


where and .

Therefore, differentiating with respect to , equating to zero, and rearranging the resulting terms yields the following fix point update rule for every codebook vector in


where the constant .

2.2 Properties of ITC

In [12], it is shown that the popular mean shift procedure [7, 8] is a special case of ITC. The cross entropy term in the Cauchy-Schwartz divergence causes ITC to be a mode seeking algorithm. The quantity , however, can be understood as the potential energy of a repellent force between the codebook vectors [12]. After convergence of the algorithm, the codebook vectors will therefore be located close to local modes but because of the repellent force they will not have collapsed into only a few modes.

Figure 1 illustrates how ITC determines a codebook of 30 vectors from a binary shape image. Every foreground pixel of the original image in the leftmost panel is understood as a 2D vector ; together they form the data set . Using (2) to compute an estimate of the density of yields the result shown in the second panel. The third panel demonstrates that a random initialization of a set of codebook vectors typically yields a density function that diverges from the density of the data points. Iterative updates of the codebook vectors using (4) will minimize this divergence. The rightmost panel of the figure shows how, upon termination of the ITC algorithm, the elements in have moved towards local modes of and that the two densities and are now in close agreement. Since the approach prevents codebook vectors from collapsing into a few modes, the final distribution is found to trace out local principal curves of the binary shape.

For data points and codebook vectors, every iteration of ITC requires efforts of . However, in contrast to related clustering algorithms of similar complexity such, for instance, -means, there is a large constant that factors into the overall runtime of ITC. Each of the updates in (4) requires the computation of distances and distances and each distance computation entails the evaluation of an exponential function in a Gaussian kernel.

With respect to modern practical applications, for instance in large scale image processing, the algorithm will therefore be too slow to be useful. It would neither allow for an efficient processing of a few very large images nor would it apply to the processing of large amounts of simple binary images of only moderate size.

3 ITC on Discrete Lattices

Addressing the efficiency concerns raised in the previous section, we now derive an alternative update rule for information theoretic clustering. Our key contribution at this point is to model the Parzen density estimates in (2) and (3) using methods from signal processing. Mathematically, our result is equivalent to the original formulation but it is cast in a form that allows for avoiding repeated computations of distances and exponentials. Especially for data that reside on discrete lattices our new formulation of ITC enables highly efficient implementations based on discrete filter masks and therefore considerably accelerate the procedure. To ease our discussion, we will again focus on examples from binary image processing but the methods discussed below immediately generalize to higher dimensional lattices, too.

3.1 Entropy Minimization Using Convolutions

Recall that the convolution of two functions and is defined as

where integration is over the whole, possibly higher dimensional domain.

To begin our derivation, we note that the convolution of a function with a shifted delta impulse shifts the function

A Parzen density estimate using a sum of Gaussians each centered at data points is therefore equivalent to a sum of shifted delta impulses convolved with a Gaussian

With respect to applications in binary image processing, this observation is pivotal because every active binary pixel can be understood as a discrete delta impulse located at some coordinate on a regular 2D lattice. In a slight abuse of notation, we may therefore write the density function of a set of pixels in (2) as

Convolution with a Gaussian is a standard and well understood operation in low-level signal processing. In digital image processing, it typically realized using discrete 2D filter masks. In cases where is small, the Gaussian convolution kernel can be separated into two 1D convolution kernels. Successively applying corresponding pre-computed 1D filters of small spatial support will then significantly reduce computation time. In cases where is large, it is more efficient to resort to multiplication in the Fourier domain or to apply recursive filters [9, 17]. The density of a set of binary pixels can therefore be computed very efficiently; although it will still require efforts proportional to the number of pixels, the overall effort per pixel will become very small.

Note that the set of codebook vectors , too, can be understood as a collection of delta impulses on a discrete lattice. With respect to binary image processing, this is to say that can be understood as an image containing foreground pixels. Accordingly, its density estimate

can be computed just as efficiently. On a discrete lattice such as an image pixel array, the two entropy integrals and are therefore readily available.

It thus remains to derive a modified update rule for the codebook vectors that assumes the role of (4). To this end, we have another look at (3) and note that


Using this, we differentiate the divergence in (1) with respect to , equate to zero, rearrange the resulting terms, and obtain the following update rule:


where .

Note, that in contrast to (4), every Gaussian kernel in (6) is of the form . That is, they are now centered at , are all of the same variance, and do not specifically depend on the or the other anymore.

On a discrete lattice, this allows for further acceleration. Following common practice in digital signal processing, we can approximate the continuous Gaussian by means of a pre-computed discrete filter mask of finite support. Instead of evaluating the integrals in (6) over the whole lattice, it then suffices to consider a neighborhood of whose size depends on . That is, on a discrete lattice, it is valid to assume the approximation

and similar expressions hold for the other integrals in (6).

Concluding our derivation up to this point, we observe that, in contrast to the original update algorithm for ITC, our version for discrete lattices (i) avoids the computation of Euclidean distances and (ii) does not necessitate explicit evaluation of exponentials. On a discrete lattice both these steps can be replaced by highly efficient convolution operations using pre-computed filter masks.

3.2 ITC for Binary Images

In this subsection, we discuss implementation details of our accelerated ITC algorithm. To simplify our discussion, we focus once again on binary images and show how they may be processed efficiently.

For binary image processing, our efficient version of the original ITC algorithm draws on an isomorphism between a binary image matrix and a finite set of vectors . That is, in an implementation of our algorithm as well as in our discussion, we capitalize on the isomorphism

so that we may use and interchangeably. Of course, a similar equivalence also holds for the set of codebook vectors so that we may refer to them using either or .

1:  randomly initialize , for instance by sampling from the foreground pixels of
2:  compute the pdf using
3:  repeat
4:     compute the pdf using
5:     compute the entropy
6:     compute the entropy
7:     for  do
8:        compute using (6)
9:     end for
10:     update
11:  until  or
Algorithm 1 Efficient ITC for binary image quantization

Given these prerequisites, Algorithm 1 summarizes how to use ITC in order to cluster binary images into coherent segments. The procedure begins with randomly sampling codebook vectors from the pixels of the shape image . Discrete approximations and of the two probability density functions and are (pre-)computed using convolutions.

Also, since the algorithm is tailored to data on discrete lattices, the integrals in and are replaced by finite sums over the elements of and . With these ingredients, the codebook vectors are updated according to equation (6). These steps are iterated until the Cauchy-Schwartz divergence between and falls below a user defined threshold or the updates become very small. In practice, we found that it typically requires less than 20 iterations for the procedure to converges to a stable solution.

4 Performance Evaluation

Next, we present and discuss results obtained from experimenting with our accelerated version of information theoretic clustering and the original algorithm according to [11, 12, 13]. In addition, we present results obtained from applying -means clustering to the pixels of binary images in order to provide a well known baseline for comparison.

Figure 2 provides an example of the placement of as well as of codebook vectors on a binary shape. The results shown here are representative for what we found in our experiments. The original version of ITC and our adapted algorithm produce almost identical results. This was to be expected, since mathematically the two are equivalent. The minor differences visible in the figure can be attributed to the fact, that the original ITC algorithm performs a continuous optimization, while our version places the codebook vectors on a discrete grid of coordinates using filter masks of finite support.

(a) -means
(b) original ITC
(c) accelerated ITC
Figure 2: Examples of the placement of and codebook vectors on a binary image of 26.996 foreground pixels. For , -means and ITC appear to produce similar results. For , however, the mode seeking property of ITC becomes apparent. While -means produces an equidistant covering of the shape, ITC places the cluster centers along its principal curve. Also, while -means cluster centers do not necessarily reside on the shape, those resulting from ITC always do. Finally, the slight differences between the two ITC approaches are due to our use of Gaussian filters of rather small spatial support in the accelerated version. Larger filter masks would increase precision at the cost of increased runtime.

Interestingly, for a larger number of clusters the codebook obtained from -means clustering is rather similar to the ones obtained from ITC. However, -means clustering considers variance minimization for optimization and places cluster centers at local means. It therefore tends to produce blob-like clusters of comparable sizes with equidistant centers. This becomes especially apparent, if a smaller number of clusters are to be determined. In such cases, ITC yields results that are noticeably different. This difference is further stressed in Fig. 3 which, in addition to the the different cluster centers produced by -means and ITC, also shows the corresponding segmentations of the shape.

(a) -means
(b) accelerated ITC
Figure 3: . Examples of clusters or shape segments that result from computing codebook vectors using either -means or accelerated ITC.
Figure 4: Left panel: average computing time per iteration w.r.t. shape size. The plots indicate how accelerated ITC, original ITC, and

-means perform in extracting 100 codebook vectors from shapes of growing size. Right panel: average computing time per iteration w.r.t. number of codebook vectors extracted from a shape of about 10.000 pixels. Original ITC and k-means show linear growth (note the logarithmic scale); for accelerated ITC, our way of choosing local support regions (see text) even decreases runtime.

Figure 4 summarizes quantitative results obtained from runtime experiments with the MPEG-7 shape data set. All our experiments on runtime behavior of the different algorithms were carried out on an Intel Core i3 processor. The variance parameter in our accelerated ITC algorithm was set dynamically, taking into account the number of pixels in a shape and the number of codebook vectors to be produced. By choosing , a smaller codebook will lead to larger discrete filter masks applied in computing ITC according to (6). Note that in order to benefit from the idea of using localized filter masks of finite support, it may in fact be preferable to consider convolution masks of smaller diameter. The parameter was set to . Finally, all the results we present here were obtained form averaging over 10 trials with different random initializations of .

Both, the left and right panel of Figures 4 underline the runtime improvement of accelerated ITC over the original algorithm. For larger choices of and , the time per iteration of our accelerated algorithm is two orders of magnitude smaller than the time required by the original version. The declining time per iteration of accelerated ITC for a growing number of clusters is a consequence of the above choice of and its impact on the size of filter masks.

5 Weighted ITC

In addition to its favorable runtime characteristics, our adapted ITC algorithm also easily extends towards efficient weighted clustering. Our contribution in this section is to replace the simple Parzen estimate of the density of the data

by a more flexible Gaussian mixture model. As we shall see next, for data on a discrete lattice this does not cause any noticeable overhead for our algorithm.

If we extend the model for in equation (2) to the more general form

we immediately recognize yet another convolution

We also note that in our adapted version of the ITC algorithm there is only one place where is computed explicitly. This happens in the second preparatory step just outside of the loop that iteratively updates the codebook vectors.

With respect to the quantization of binary images, we can therefore introduce a weighting scheme for the pixels in at almost no costs. For instance, we may apply efficient distance transforms [6] to the foreground pixels of to assign higher weights to the pixels in the interiors of a shape. If we denote this transform by , the second step in our algorithm simply becomes .

(a) binary image
(b) chamfer transform
(c) density
(d) 30 cluster centers
Figure 5: Example of weighted, accelerated information theoretic clustering of the pixels of a binary image: (a) the original binary image ; (b) the chamfer transform yields a gray value image where a pixel’s shade of gray indicates its distance to the shape boundary; (c) the correspondingly weighted pdf of ; (d) codebook vectors resulting from weighted ITC; using a distance transform for weighting, causes the cluster centers to move closer to local principal axes of the shape than in the unweighted case in Fig. 2.

Figures 5 and 6 show examples of the effect of assigning higher weights to pixels in the interior of a shape. In contrast to the normal variant of ITC, the weighted version causes the cluster centers to be concentrated closer to the shape interiors. Since the weighting scheme only impacts the mode seeking component of the algorithm but not the term that causes cluster centers to repel each other, the resulting codebook vectors trace local principal curves of the shape considered here.

Figure 6: . Examples of clusters or shape segments after computing codebook vectors using weighted ITC based on the distance transform illustrated in Fig. 5.

6 Summary and Outlook

This paper introduced an efficient algorithm for clustering data that reside on a discrete lattice. We demonstrated how costly operations in information theoretic clustering can be be replaced by convolutions. Since convolutions on low dimensional lattices can be computed efficiently, our modified algorithm provides a powerful tool for the quantization of image pixels or voxels.

We illustrated the behavior of the algorithm by means of binary shape images. In experiments carried out to assess the runtime behavior of the modified procedure, we found it to be two orders of magnitude faster than the original algorithm but nevertheless to yield results of similar quality.

Finally, we demonstrated that for data on a lattice, it is straightforward to introduce weighting schemes into the adapted algorithm. Again, given the example of binary images, we showed the effect of applying distance transforms for the purpose of weighting. With this adaptation, the cluster centers produced by the algorithm trace local principal curves or latent features of a shape.

Binary images are a straightforward application domain for our algorithm because their pixels form a distribution of 2D data points. Gray value- and color images, too, can be understood as distributions on low-dimensional lattices. Yet, in practice it would introduce considerable memory overhead to represent them this way. At first sight, the practical applicability of the method proposed in this paper may therefore seem limited. However, this is not the case! On the contrary, modern imaging technology such as tomography or hyper-spectral imaging produces data that meet the prerequisites for our method.

Hyper-spectral images record a whole spectrum of wavelengths for every spatial coordinate of an image. They can thus be thought of as three-dimensional lattices where the data at each grid point is physically related to its neighbors. In this sense, a hyper-spectral image forms a discrete three-dimensional distribution of measurements and accelerated ITC provides a useful tool to uncover latent structures in these data.

In fact, due to their enormous data volumes, hyper-spectral images pose considerable challenges for statistical analysis as traditional methods do simply not apply anymore [10, 16]. Yet, there is a lot to gain. For example, in areas such as biology or phytomedicine, hyper-spectral data has been shown to provide new insights into physiological processes within plants [1, 2, 3] and using accelerated ICT promises further interesting results in this demanding big data domain.

Another interesting, though lesser known application area, lies in the field of game AI [4, 5, 15]. Many modern computer games that are set in simulated 3D worlds require the player to act according to his or her current location. For the programming of convincing artificial game agents this poses the problem of equipping them with a sense of where they are. However, due to the growing size and complexity of virtual, manual annotations that capture all essential areas and aspects may soon become infeasible. Accelerated ICT therefore provides a data-driven approach to the semantic clustering of in-game navigation meshes and first results already underline that it indeed uncovers key locations of in-game maps that make sense from a human player’s point of view.

It is in areas like these, where the method introduced in this paper is of considerable practical benefit and where bridges between advanced machine learning and highly efficient signal processing are in dire need.


  • [1] A. Ballvora, C. Römer, M. Wahabzada, U. Rascher, C. Thurau, C. Bauckhage, K. Kersting, L. Plümer, and J. Leon. Deep Phenotyping of Early Plant Response to Abiotic Stress Using Non-invasive Approaches in Barley. In Proc. Int. Barley Genetics Symposium, 2012.
  • [2] C. Bauckhage and K. Kersting. Data Mining and Pattern Recognition in Agriculture. KI – Künstliche Intelligenz, 27(4):313–324, 2013.
  • [3] C. Bauckhage, K. Kersting, and A. Schmidt. Agriculture’s Technological Makeover. IEEE Pervasive Computing, 11(2):4–7, 2012.
  • [4] C. Bauckhage, M. Roth, and V. Hafner.

    Where am I? – On Providing Gamebots with a Sense of Location Using Spectral Clustering of Waypoints.

    In Proc. Optimizing Player Satisfaction in Computer and Physical Games, 2006.
  • [5] C. Bauckhage and C. Thurau. Towards a Fair ’n Square Aimbot – Using Mixtures of Experts to Learn Context Aware Weapon Handling. In Proc. GAME-ON, 2004.
  • [6] G. Borgefors. An Improved Version of the Chamfer Matching Algorithm. In Proc. ICPR, 1984.
  • [7] Y. Cheng. Mean Shift, Mode Seeking, and Clustering. IEEE Trans. PAMI, 17(8), 1995.
  • [8] D. Comaniciu and P. Meer. Mean Shift: A Robust Approach towards Feature Space Analysis. IEEE Trans. PAMI, 24(5), 2002.
  • [9] R. Deriche. Recursively Implementing the Gaussian and its Derivatives. In Proc. ICIP, 1992.
  • [10] K.Kersting, Z. Xu, M. Wahabzada, C. Bauckhage, C. Thurau, C. Römer, A. Ballvora, U. Rascher, J. Leon, and L. Plümer. Pre-symptomatic Prediction of Plant Drought Stress using Dirichlet-Aggregation Regression on Hyperspectral Images. In Proc. AAAI, 2012.
  • [11] T. Lehn-Schioler, A. Hedge, D. Erdogmus, and J. Principe. Vector Quantization Using Information Theoretic Concepts. Natural Computing, 4(1), 2005.
  • [12] S. Rao, A. de Medeiros Martins, and J. Principe. Mean Shift: An Information Theoretic Perspective. Pattern Recognition Letters, 30(3), 2009.
  • [13] S. Rao, S. Han, and J. Principe. Information Theoretic Vector Quantization with Fixed Point Updates. In Proc. IJCNN, 2007.
  • [14] A. Renyi. On Measures of Information and Entropy. In Proc. Berkeley Symp. on Mathematics, Statistics, and Probability, 1961.
  • [15] R. Sifa and C. Bauckhage. Archetypical Motion: Supervised Game Behavior Learning with Archetypal Analysis. In Proc. CIG, 2013.
  • [16] C. Thurau, K. Kersting, M. Wahabzada, and C. Bauckhage. Descriptive Matrix Factorization for Sustainability: Adopting the Principle of Opposites. Data Mining and Knowledge Discovery, 24(2):325–354, 2012.
  • [17] L. van Vliet, I. Young, and P. Verbeek. Recursive Gaussian Derivative Filters. In Proc. ICPR, 1998.