1 Introduction
Suppose that we are given a set of data points and a number of clusters . We would like assort these data points “optimally” into clusters. For each point let denote the center of gravity of the cluster is assigned to. We call the vectors centroids. The standard kmeans method [2] attempts to produce such a clustering of the data points that the sum of squared centroid distances is minimized.
The main difficulty of this method is that it requires the data points to be the elements of a Euclidean space, since we need to average the data points somehow. A generalization of kmeans called relational kmeans has been proposed [3] to address this issue. This generalized kmeans variant does not require the data points to be vectors.
Instead, the data points can form an abstract set with a completely arbitrary distance function . We only require that is symmetrical, and for all . Note that need not even satisfy the triangle inequality.
2 Relational kmeans
First, we provide a brief outline of the algorithm. Let be the squared distance matrix. That is, . The algorithm starts with some initial clustering and improves it by repeatedly performing an iteration step. The algorithm stops if the last iteration did not decrease the value of the clustering (defined below). Of course, if the iteration increased the value of the clustering, the algorithm reverts the last iteration.
Now we describe the algorithm in detail. At any time during execution, let denote the clusters. For each data point , let denote the index of the cluster is assigned to. Let denote the th standard basis vector, and, for and let . Let us call the quantity the squared centroid distance corresponding to the point and the cluster .
In [3] it is observed that, if the distance function is derived from a Euclidean representation of the data points, then equals to the squared distance of and the center of gravity of . Thus is indeed an extension of the classical notion of squared centroid distances.
Define the value of a clustering as . We say that a clustering is better than another one if and only if its value is less than the value of the other clustering.
The relational kmeans algorithm takes an initial clustering (e.g. a random one), and improves it through repeated applications of an iteration step which reclusters the data points. The iteration step simply reassigns each data point to the cluster which minimized the squared centroid distance in the previous clustering. If the value of clustering does not decrease through the reassignment, then the reassignment is undone and the algorithm stops.
In the nonEuclidean case there might be scenarios when an iteration actually worsens the clustering. Should this peculiar behavior be undesirable, it can be avoided by “stretching” the distance matrix and thus making it Euclidean.
Stretching means replacing with , where is the matrix whose entries are all 1’s,
is the identity matrix, and
is the smallest real number for which is a Euclidean squared distance matrix, i.e. it equals to the squared distance matrix of some vectors. It can be easily deducted that such a exists. This method is called spread transformation (see [1]).The algorithm is thus as follows:

Start with some clustering, e.g. a random one

Calculate the value of the current clustering and store it in

For each data point, calculate and store the squared centroid distances

Reassign each data point to the cluster that yielded the smallest squared centroid distance in the previous step

Calculate the value of the current clustering and store it in

If , then undo the previous reassignment and stop

Go to line number 2
3 Time complexity
The algorithm is clearly finite because it gradually decreases the value of the current clustering and the number of different clusterings is finite. Each iteration step can easily be implemented in time: for each data point, we need to calculate quadratic forms, which can be done in time. This is unfortunately too slow for practical applications.
However, this can be improved down to . The squared centroid distances can be transformed as follows (using ):
Here the first summand is independent of and thus needs to be calculated for each cluster only once per iteration. On the other hand, the second summand can be calculated in time. To sum up, the amount of arithmetic operations per iteration is at most constant times .
The current implementation makes several attempts to find a better clustering. In each attempt, the full relational kmeans algorithm is run, starting from a new random clustering. Every attempt has the possibility to produce a clustering which is better than the best one among the previous attempts. If the sofar best clustering has not been improved in the last attempts (where is a parameter), then it is assumed that the clustering which is currently the best is not too far from the optimum, and the program execution stops.
Attempts do not depend on each other’s result and do not modify shared resources (apart from a shared random generator). Our implementation uses Parallel.For for launching multiple attempts at once. The number of parallel threads can be customized via a commandline switch, and by default equals to the number of logical processors. This results in near 100% CPU utilization. Launching less than threads allows leaving some CPU time for other processes.
4 Test results and conclusion
We implemented the above algorithm in and run the program on a set of >1000 proteins with a Levenshteinlike distance function. The value of (maximum allowed number of failed attempts, i.e. the “bad luck streak”) was 20, and the value of (number of clusters) was 10. The testing was done on an average dual core laptop computer and finished in 30..60 seconds. This proves that relational kmeans can be implemented in a way efficient enough to be applied to realworld datasets.
Since the program is almost fully parallelized, we could expect it to finish in <8 seconds for the same dataset on a 16core machine. Note that the total runtime is proportional to the number of attempts made, which is highly variable due to the random nature of the algorithm.
A C++ implementation could further reduce execution time. According to our estimate, it could make the program cca. twice as fast.
5 Attachments
5.1 Program source code in
Program.cs
5.2 Example input
The first lines of the input contain the names of the objects which need to be clustered. Then a line containing two slashes follows. After that, a matrix containing the pairwise distances is listed in semicolonseparated CSV format. Fractional distances must be input with the dot character (.) as decimal separator.
testinput.txt
References
 [1] Richard J. Hathaway and James C. Bezdek. Nerf cmeans: Noneuclidean relational fuzzy clustering. Pattern Recognition, 27(3):429–437, 1994.

[2]
J. B. MacQueen.
Some methods for classification and analysis of multivariate
observations.
In L. M. Le Cam and J. Neyman, editors,
Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability
, pages 281–297. University of California Press, 1967.  [3] B. Szalkai. Generalizing kmeans for an arbitrary distance matrix. ArXiv eprints http://arxiv.org/abs/1303.6001, March 2013.
Comments
There are no comments yet.