1 Motivation and Background
Clustering is an important tool of the trade in data mining and pattern recognition. Numerous algorithms have been proposed and an extended review of their characteristics and merits would be beyond the scope of this paper. However, we note that many algorithms used to cluster numerical data tacitly assume the data to be drawn from distributions that are continuous.
Alas, many data are explicitly discrete. Consider, for instance, the pixels of a standard digital image. Owing to modern recording technologies, domain and range of such images are discrete and we may distinguish three elementary cases:
(i) In a binary (black and white) image, each pixel is switched either on or off. Pixels that are switched off need not be represented; pixels that are switched on are characterized by their twodimensional grid coordinate .
(ii) In an intensity (gray value) image, pixels are located on a twodimensional grid and assume discrete gray values; they are thus fully characterized by a threedimensional grid coordinate .
(iii) In a color (RGB) image, each pixel has a fixed location on a twodimensional grid and maps to a discrete threedimensional color vector; color pixels are thus fully characterized by a fivedimensional grid coordinate
.Digital image therefore correspond to sets of data points that reside on low dimensional lattices. If such data need to be clustered into coherent groups, it is common practice to resort to algorithms that perform continuous optimization and thus will likely produce cluster centers that are located in between lattice points. In most practical applications, this is perfectly acceptable and our concern in this paper is therefore not with precision but with efficiency.
In this paper, we adapt an algorithm for continuous, information theoretic clustering to data on discrete lattices. Our basic idea is to connect recent machine learning techniques with well established methods for digital signal processing.
We show that recasting essential steps of information theoretic clustering in terms of efficient convolution operations significantly reduces its runtime and enables us to quickly compute cluster centers in image data. To simplify our discussion, we focus on applications in binary image processing and provide qualitative examples to illustrate behavior and performance of the algorithm. Quantitative experiments on a large data base of binary images reveal that our accelerated algorithm is two orders of magnitude faster than the original one. Moreover, we demonstrate that weighted clustering is straightforward in our framework.
Next, we review information theoretic clustering and discuss its properties. Then, we derive our accelerated version and present experimental results that underline its favorable performance. Finally, we discuss how the proposed algorithm immediately allows for efficient weighted clustering and conclude this paper with a summary and an outlook to practical applications of our approach.
2 Information Theoretic Clustering
The particular approach to information theoretic clustering (ITC) we consider in this paper was introduced in [11, 12, 13]. Given a set
of data points drawn from a continuous domain, it applies an entropybased measure to determine clusters in the data. Similar to means clustering, ITC starts from a random initialization of the cluster centers
and iteratively updates these codebook vectors in order to minimize the entropy measure under consideration.
of 2D data points; (b) visualization of a Parzen estimate of the probability density function
of ; (c) a randomly initialized set of codebook vectors (30 in this case) typically has a pdf that diverges from ; (d) an iterative minimization the CauchySchwartz divergence of and moves the codebook vectors towards modes of .2.1 Iterative Entropy Minimization
ITC assumes that two probability density functions and are available that characterize the sets and , respectively. Given these densities, ITC optimizes the placement of codebook vectors by means of minimizing the CauchySchwartz divergence
(1) 
between and . Note that the last term of does not depend on . The entropies in its first and second term are given by
and  
and are known as Renyi’s cross entropy between and and Renyi’s entropy of , respectively [14].
The fundamental idea in [11, 12, 13] is to model the densities and using their Parzen estimates
(2)  
(3) 
where and
are Gaussian kernels of variance
and , respectively.Applying the Gaussian product theorem and using properties of the so called overlap integral, and can be written as
and  
where and .
Therefore, differentiating with respect to , equating to zero, and rearranging the resulting terms yields the following fix point update rule for every codebook vector in
(4) 
where the constant .
2.2 Properties of ITC
In [12], it is shown that the popular mean shift procedure [7, 8] is a special case of ITC. The cross entropy term in the CauchySchwartz divergence causes ITC to be a mode seeking algorithm. The quantity , however, can be understood as the potential energy of a repellent force between the codebook vectors [12]. After convergence of the algorithm, the codebook vectors will therefore be located close to local modes but because of the repellent force they will not have collapsed into only a few modes.
Figure 1 illustrates how ITC determines a codebook of 30 vectors from a binary shape image. Every foreground pixel of the original image in the leftmost panel is understood as a 2D vector ; together they form the data set . Using (2) to compute an estimate of the density of yields the result shown in the second panel. The third panel demonstrates that a random initialization of a set of codebook vectors typically yields a density function that diverges from the density of the data points. Iterative updates of the codebook vectors using (4) will minimize this divergence. The rightmost panel of the figure shows how, upon termination of the ITC algorithm, the elements in have moved towards local modes of and that the two densities and are now in close agreement. Since the approach prevents codebook vectors from collapsing into a few modes, the final distribution is found to trace out local principal curves of the binary shape.
For data points and codebook vectors, every iteration of ITC requires efforts of . However, in contrast to related clustering algorithms of similar complexity such, for instance, means, there is a large constant that factors into the overall runtime of ITC. Each of the updates in (4) requires the computation of distances and distances and each distance computation entails the evaluation of an exponential function in a Gaussian kernel.
With respect to modern practical applications, for instance in large scale image processing, the algorithm will therefore be too slow to be useful. It would neither allow for an efficient processing of a few very large images nor would it apply to the processing of large amounts of simple binary images of only moderate size.
3 ITC on Discrete Lattices
Addressing the efficiency concerns raised in the previous section, we now derive an alternative update rule for information theoretic clustering. Our key contribution at this point is to model the Parzen density estimates in (2) and (3) using methods from signal processing. Mathematically, our result is equivalent to the original formulation but it is cast in a form that allows for avoiding repeated computations of distances and exponentials. Especially for data that reside on discrete lattices our new formulation of ITC enables highly efficient implementations based on discrete filter masks and therefore considerably accelerate the procedure. To ease our discussion, we will again focus on examples from binary image processing but the methods discussed below immediately generalize to higher dimensional lattices, too.
3.1 Entropy Minimization Using Convolutions
Recall that the convolution of two functions and is defined as
where integration is over the whole, possibly higher dimensional domain.
To begin our derivation, we note that the convolution of a function with a shifted delta impulse shifts the function
A Parzen density estimate using a sum of Gaussians each centered at data points is therefore equivalent to a sum of shifted delta impulses convolved with a Gaussian
With respect to applications in binary image processing, this observation is pivotal because every active binary pixel can be understood as a discrete delta impulse located at some coordinate on a regular 2D lattice. In a slight abuse of notation, we may therefore write the density function of a set of pixels in (2) as
Convolution with a Gaussian is a standard and well understood operation in lowlevel signal processing. In digital image processing, it typically realized using discrete 2D filter masks. In cases where is small, the Gaussian convolution kernel can be separated into two 1D convolution kernels. Successively applying corresponding precomputed 1D filters of small spatial support will then significantly reduce computation time. In cases where is large, it is more efficient to resort to multiplication in the Fourier domain or to apply recursive filters [9, 17]. The density of a set of binary pixels can therefore be computed very efficiently; although it will still require efforts proportional to the number of pixels, the overall effort per pixel will become very small.
Note that the set of codebook vectors , too, can be understood as a collection of delta impulses on a discrete lattice. With respect to binary image processing, this is to say that can be understood as an image containing foreground pixels. Accordingly, its density estimate
can be computed just as efficiently. On a discrete lattice such as an image pixel array, the two entropy integrals and are therefore readily available.
It thus remains to derive a modified update rule for the codebook vectors that assumes the role of (4). To this end, we have another look at (3) and note that
(5) 
Using this, we differentiate the divergence in (1) with respect to , equate to zero, rearrange the resulting terms, and obtain the following update rule:
(6) 
where .
Note, that in contrast to (4), every Gaussian kernel in (6) is of the form . That is, they are now centered at , are all of the same variance, and do not specifically depend on the or the other anymore.
On a discrete lattice, this allows for further acceleration. Following common practice in digital signal processing, we can approximate the continuous Gaussian by means of a precomputed discrete filter mask of finite support. Instead of evaluating the integrals in (6) over the whole lattice, it then suffices to consider a neighborhood of whose size depends on . That is, on a discrete lattice, it is valid to assume the approximation
and similar expressions hold for the other integrals in (6).
Concluding our derivation up to this point, we observe that, in contrast to the original update algorithm for ITC, our version for discrete lattices (i) avoids the computation of Euclidean distances and (ii) does not necessitate explicit evaluation of exponentials. On a discrete lattice both these steps can be replaced by highly efficient convolution operations using precomputed filter masks.
3.2 ITC for Binary Images
In this subsection, we discuss implementation details of our accelerated ITC algorithm. To simplify our discussion, we focus once again on binary images and show how they may be processed efficiently.
For binary image processing, our efficient version of the original ITC algorithm draws on an isomorphism between a binary image matrix and a finite set of vectors . That is, in an implementation of our algorithm as well as in our discussion, we capitalize on the isomorphism
so that we may use and interchangeably. Of course, a similar equivalence also holds for the set of codebook vectors so that we may refer to them using either or .
Given these prerequisites, Algorithm 1 summarizes how to use ITC in order to cluster binary images into coherent segments. The procedure begins with randomly sampling codebook vectors from the pixels of the shape image . Discrete approximations and of the two probability density functions and are (pre)computed using convolutions.
Also, since the algorithm is tailored to data on discrete lattices, the integrals in and are replaced by finite sums over the elements of and . With these ingredients, the codebook vectors are updated according to equation (6). These steps are iterated until the CauchySchwartz divergence between and falls below a user defined threshold or the updates become very small. In practice, we found that it typically requires less than 20 iterations for the procedure to converges to a stable solution.
4 Performance Evaluation
Next, we present and discuss results obtained from experimenting with our accelerated version of information theoretic clustering and the original algorithm according to [11, 12, 13]. In addition, we present results obtained from applying means clustering to the pixels of binary images in order to provide a well known baseline for comparison.
Figure 2 provides an example of the placement of as well as of codebook vectors on a binary shape. The results shown here are representative for what we found in our experiments. The original version of ITC and our adapted algorithm produce almost identical results. This was to be expected, since mathematically the two are equivalent. The minor differences visible in the figure can be attributed to the fact, that the original ITC algorithm performs a continuous optimization, while our version places the codebook vectors on a discrete grid of coordinates using filter masks of finite support.



Interestingly, for a larger number of clusters the codebook obtained from means clustering is rather similar to the ones obtained from ITC. However, means clustering considers variance minimization for optimization and places cluster centers at local means. It therefore tends to produce bloblike clusters of comparable sizes with equidistant centers. This becomes especially apparent, if a smaller number of clusters are to be determined. In such cases, ITC yields results that are noticeably different. This difference is further stressed in Fig. 3 which, in addition to the the different cluster centers produced by means and ITC, also shows the corresponding segmentations of the shape.


means perform in extracting 100 codebook vectors from shapes of growing size. Right panel: average computing time per iteration w.r.t. number of codebook vectors extracted from a shape of about 10.000 pixels. Original ITC and kmeans show linear growth (note the logarithmic scale); for accelerated ITC, our way of choosing local support regions (see text) even decreases runtime.
Figure 4 summarizes quantitative results obtained from runtime experiments with the MPEG7 shape data set. All our experiments on runtime behavior of the different algorithms were carried out on an Intel Core i3 processor. The variance parameter in our accelerated ITC algorithm was set dynamically, taking into account the number of pixels in a shape and the number of codebook vectors to be produced. By choosing , a smaller codebook will lead to larger discrete filter masks applied in computing ITC according to (6). Note that in order to benefit from the idea of using localized filter masks of finite support, it may in fact be preferable to consider convolution masks of smaller diameter. The parameter was set to . Finally, all the results we present here were obtained form averaging over 10 trials with different random initializations of .
Both, the left and right panel of Figures 4 underline the runtime improvement of accelerated ITC over the original algorithm. For larger choices of and , the time per iteration of our accelerated algorithm is two orders of magnitude smaller than the time required by the original version. The declining time per iteration of accelerated ITC for a growing number of clusters is a consequence of the above choice of and its impact on the size of filter masks.
5 Weighted ITC
In addition to its favorable runtime characteristics, our adapted ITC algorithm also easily extends towards efficient weighted clustering. Our contribution in this section is to replace the simple Parzen estimate of the density of the data
by a more flexible Gaussian mixture model. As we shall see next, for data on a discrete lattice this does not cause any noticeable overhead for our algorithm.
If we extend the model for in equation (2) to the more general form
we immediately recognize yet another convolution
We also note that in our adapted version of the ITC algorithm there is only one place where is computed explicitly. This happens in the second preparatory step just outside of the loop that iteratively updates the codebook vectors.
With respect to the quantization of binary images, we can therefore introduce a weighting scheme for the pixels in at almost no costs. For instance, we may apply efficient distance transforms [6] to the foreground pixels of to assign higher weights to the pixels in the interiors of a shape. If we denote this transform by , the second step in our algorithm simply becomes .
Figures 5 and 6 show examples of the effect of assigning higher weights to pixels in the interior of a shape. In contrast to the normal variant of ITC, the weighted version causes the cluster centers to be concentrated closer to the shape interiors. Since the weighting scheme only impacts the mode seeking component of the algorithm but not the term that causes cluster centers to repel each other, the resulting codebook vectors trace local principal curves of the shape considered here.
6 Summary and Outlook
This paper introduced an efficient algorithm for clustering data that reside on a discrete lattice. We demonstrated how costly operations in information theoretic clustering can be be replaced by convolutions. Since convolutions on low dimensional lattices can be computed efficiently, our modified algorithm provides a powerful tool for the quantization of image pixels or voxels.
We illustrated the behavior of the algorithm by means of binary shape images. In experiments carried out to assess the runtime behavior of the modified procedure, we found it to be two orders of magnitude faster than the original algorithm but nevertheless to yield results of similar quality.
Finally, we demonstrated that for data on a lattice, it is straightforward to introduce weighting schemes into the adapted algorithm. Again, given the example of binary images, we showed the effect of applying distance transforms for the purpose of weighting. With this adaptation, the cluster centers produced by the algorithm trace local principal curves or latent features of a shape.
Binary images are a straightforward application domain for our algorithm because their pixels form a distribution of 2D data points. Gray value and color images, too, can be understood as distributions on lowdimensional lattices. Yet, in practice it would introduce considerable memory overhead to represent them this way. At first sight, the practical applicability of the method proposed in this paper may therefore seem limited. However, this is not the case! On the contrary, modern imaging technology such as tomography or hyperspectral imaging produces data that meet the prerequisites for our method.
Hyperspectral images record a whole spectrum of wavelengths for every spatial coordinate of an image. They can thus be thought of as threedimensional lattices where the data at each grid point is physically related to its neighbors. In this sense, a hyperspectral image forms a discrete threedimensional distribution of measurements and accelerated ITC provides a useful tool to uncover latent structures in these data.
In fact, due to their enormous data volumes, hyperspectral images pose considerable challenges for statistical analysis as traditional methods do simply not apply anymore [10, 16]. Yet, there is a lot to gain. For example, in areas such as biology or phytomedicine, hyperspectral data has been shown to provide new insights into physiological processes within plants [1, 2, 3] and using accelerated ICT promises further interesting results in this demanding big data domain.
Another interesting, though lesser known application area, lies in the field of game AI [4, 5, 15]. Many modern computer games that are set in simulated 3D worlds require the player to act according to his or her current location. For the programming of convincing artificial game agents this poses the problem of equipping them with a sense of where they are. However, due to the growing size and complexity of virtual, manual annotations that capture all essential areas and aspects may soon become infeasible. Accelerated ICT therefore provides a datadriven approach to the semantic clustering of ingame navigation meshes and first results already underline that it indeed uncovers key locations of ingame maps that make sense from a human player’s point of view.
It is in areas like these, where the method introduced in this paper is of considerable practical benefit and where bridges between advanced machine learning and highly efficient signal processing are in dire need.
References
 [1] A. Ballvora, C. Römer, M. Wahabzada, U. Rascher, C. Thurau, C. Bauckhage, K. Kersting, L. Plümer, and J. Leon. Deep Phenotyping of Early Plant Response to Abiotic Stress Using Noninvasive Approaches in Barley. In Proc. Int. Barley Genetics Symposium, 2012.
 [2] C. Bauckhage and K. Kersting. Data Mining and Pattern Recognition in Agriculture. KI – Künstliche Intelligenz, 27(4):313–324, 2013.
 [3] C. Bauckhage, K. Kersting, and A. Schmidt. Agriculture’s Technological Makeover. IEEE Pervasive Computing, 11(2):4–7, 2012.

[4]
C. Bauckhage, M. Roth, and V. Hafner.
Where am I? – On Providing Gamebots with a Sense of Location Using Spectral Clustering of Waypoints.
In Proc. Optimizing Player Satisfaction in Computer and Physical Games, 2006.  [5] C. Bauckhage and C. Thurau. Towards a Fair ’n Square Aimbot – Using Mixtures of Experts to Learn Context Aware Weapon Handling. In Proc. GAMEON, 2004.
 [6] G. Borgefors. An Improved Version of the Chamfer Matching Algorithm. In Proc. ICPR, 1984.
 [7] Y. Cheng. Mean Shift, Mode Seeking, and Clustering. IEEE Trans. PAMI, 17(8), 1995.
 [8] D. Comaniciu and P. Meer. Mean Shift: A Robust Approach towards Feature Space Analysis. IEEE Trans. PAMI, 24(5), 2002.
 [9] R. Deriche. Recursively Implementing the Gaussian and its Derivatives. In Proc. ICIP, 1992.
 [10] K.Kersting, Z. Xu, M. Wahabzada, C. Bauckhage, C. Thurau, C. Römer, A. Ballvora, U. Rascher, J. Leon, and L. Plümer. Presymptomatic Prediction of Plant Drought Stress using DirichletAggregation Regression on Hyperspectral Images. In Proc. AAAI, 2012.
 [11] T. LehnSchioler, A. Hedge, D. Erdogmus, and J. Principe. Vector Quantization Using Information Theoretic Concepts. Natural Computing, 4(1), 2005.
 [12] S. Rao, A. de Medeiros Martins, and J. Principe. Mean Shift: An Information Theoretic Perspective. Pattern Recognition Letters, 30(3), 2009.
 [13] S. Rao, S. Han, and J. Principe. Information Theoretic Vector Quantization with Fixed Point Updates. In Proc. IJCNN, 2007.
 [14] A. Renyi. On Measures of Information and Entropy. In Proc. Berkeley Symp. on Mathematics, Statistics, and Probability, 1961.
 [15] R. Sifa and C. Bauckhage. Archetypical Motion: Supervised Game Behavior Learning with Archetypal Analysis. In Proc. CIG, 2013.
 [16] C. Thurau, K. Kersting, M. Wahabzada, and C. Bauckhage. Descriptive Matrix Factorization for Sustainability: Adopting the Principle of Opposites. Data Mining and Knowledge Discovery, 24(2):325–354, 2012.
 [17] L. van Vliet, I. Young, and P. Verbeek. Recursive Gaussian Derivative Filters. In Proc. ICPR, 1998.