Fast Approximate K-Means via Cluster Closures

by   Jingdong Wang, et al.
Peking University

K-means, a simple and effective clustering algorithm, is one of the most widely used algorithms in multimedia and computer vision community. Traditional k-means is an iterative algorithm---in each iteration new cluster centers are computed and each data point is re-assigned to its nearest center. The cluster re-assignment step becomes prohibitively expensive when the number of data points and cluster centers are large. In this paper, we propose a novel approximate k-means algorithm to greatly reduce the computational complexity in the assignment step. Our approach is motivated by the observation that most active points changing their cluster assignments at each iteration are located on or near cluster boundaries. The idea is to efficiently identify those active points by pre-assembling the data into groups of neighboring points using multiple random spatial partition trees, and to use the neighborhood information to construct a closure for each cluster, in such a way only a small number of cluster candidates need to be considered when assigning a data point to its nearest cluster. Using complexity analysis, image data clustering, and applications to image retrieval, we show that our approach out-performs state-of-the-art approximate k-means algorithms in terms of clustering quality and efficiency.



There are no comments yet.


page 15

page 16

page 20


Distributional Clustering: A distribution-preserving clustering method

One key use of k-means clustering is to identify cluster prototypes whic...

Exploring Rawlsian Fairness for K-Means Clustering

We conduct an exploratory study that looks at incorporating John Rawls' ...

Clustering by connection center evolution

The determination of cluster centers generally depends on the scale that...

An efficient K -means clustering algorithm for massive data

The analysis of continously larger datasets is a task of major importanc...

Fast k-means based on KNN Graph

In the era of big data, k-means clustering has been widely adopted as a ...

An Effective Evolutionary Clustering Algorithm: Hepatitis C Case Study

Clustering analysis plays an important role in scientific research and c...

From Clustering to Cluster Explanations via Neural Networks

A wealth of algorithms have been developed to extract natural cluster st...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

-means MacQueen67

has been widely used in multimedia, computer vision and machine learning for clustering and vector quantization. In large-scale image retrieval, it is advantageous to learn a large codebook containing one million or more entries 

NisterS06 ; PhilbinCISZ07 ; SivicZ03 , which requires clustering tens or even hundreds of millions of high-dimensional feature descriptors into one million or more clusters. Another emerging application of large-scale clustering is to organize a large corpus of web images for various purposes such as web image browsing/exploring WangJH11 .

The standard -means algorithm, Lloyd’s algorithm Forgy65 ; Lloyd82 ; MacQueen67 , is an iterative refinement approach that greedily minimizes the sum of squared distances between each point and its assigned cluster center. It consists of two iterative steps, the assignment step and the update step. The assignment step aims to find the nearest cluster for each point by checking the distance between the point and each cluster center; The update step re-computes the cluster centers based on current assignments. When clustering points into clusters, the assignment step costs . For applications with large , the assignment step in exact -means becomes prohibitively expensive. Therefore many approximate solutions, such as hierarchial -means (HKM) NisterS06 and approximate -means (AKM) PhilbinCISZ07 , have been developed.

In this paper, we introduce a novel and effective approximate -means algorithm 111A conference version appeared in WangWKZL12 .. Our approach is motivated by the observation that active points, defined as the points whose cluster assignments change in each iteration, often locate at or near boundaries of different clusters. The idea is to identify those active points at or near cluster boundaries to improve both the efficiency and accuracy in the assignment step of the -means algorithm. We generate a neighborhood set for each data point by pre-assembling the data points using multiple random partition trees VermaKD09 . A cluster closure is then formed by expanding each point in the cluster into its neighborhood set, as illustrated in Figure 2. When assigning a point to its nearest cluster, we only need to consider those clusters that contain in their closures. Typically a point belongs to a small number of cluster closures, thus the number of candidate clusters are greatly reduced in the assignment step.

We evaluate our algorithm by complexity analysis, the performance on clustering real data sets, and the performance of image retrieval applications with codebooks learned by clustering. Our proposed algorithm achieves significant improvements compared to the state-of-the-art, in both accuracy and running time. When clustering a real data set of -dimensional GIST features into clusters, our algorithm converges more than faster than the state-of-the-art algorithms. In the image retrieval application on a standard dataset, our algorithm learns a codebook with visual words that outperforms the codebooks with visual words learned by other state-of-the-art algorithms – even our codebook with visual words is superior over other codebooks with visual words.

2 Literature review

Given a set of points , where each point is a -dimensional vector, -means clustering aims to partition these points into () groups, , by minimizing the within-cluster sum of squared distortions (WCSSD):


where is the center of cluster , , and . In the following, we use group and cluster interchangeably.

2.1 Lloyd algorithm

Minimizing the objective function in Equation 1 is NP-hard in many cases MahajanNV09

. Thus, various heuristic algorithms are used in practice, and

-means (or Lloyd’s algorithm) Forgy65 ; Lloyd82 ; MacQueen67 is the most commonly used algorithm. It starts from a set of cluster centers (obtained from priors or random initialization) , and then proceeds by alternating the following two steps:

  • Assignment step: Given the current set of cluster centers, , assign each point to the cluster whose center is the closest to :

  • Update step: Update the points in each cluster, , and compute the new center for each cluster, .

The computational complexity for the above assignment step and the update step is and , respectively. Various speedup algorithms have been developed by making the complexity of the assignment step less than the linear time (e.g., logarithmic time) with respect to (the number of the data points), (the number of clusters), and (the dimension of the data pint). In the following, we present a short review mainly on handling large and .

2.2 Handling large data

Distance computation elimination.  Various approaches have been proposed to speed up exact -means. An accelerated algorithm is proposed by using the triangle inequality Elkan03 and keeping track of lower and upper bounds for distances between points and centers to avoid unnecessary distance calculations but requires extra storage, rendering it impractical for a large number of clusters.

Subsampling.  An alternative solution to speed up -means is based on sub-sampling the data points. One way is to run -means over sub-sampled data points, and then to directly assign the remaining points to the clusters. An extension of the above solution is to optionally add the remaining points incrementally, and to rerun -means to get a finer clustering. The former scheme is not applicable in many applications. As pointed in PhilbinCISZ07 , it results in less accurate clustering and lower performance in image retrieval applications. The Coremeans algorithm FrahlingS08 uses the latter scheme. It begins with a coreset and incrementally increases the size of the coreset. As pointed out in FrahlingS08 , Coremeans works well only for a small number of clusters. Consequently, those methods are not suitable for large-scale clustering problems, especially for problems with a large number of clusters.

Data organization.  The approach in KanungoMNPSW02 presents a filtering algorithm. It begins by storing the data points in a -d tree and maintains, for each node of the tree, a subset of candidate centers. The candidates for each node are pruned or filtered, as they propagate to the children, which eliminates the computation time by avoiding comparing each center with all the points. But as this paper points out, it works well only when the number of clusters is small.

In the community of document processing, Canopy clustering McCallumNU00 , which is closely related to our approach, first divides the data points into many overlapping subsets (called canopies), and clustering is performed by measuring exact distances only between points that occur within a common canopy. This eliminates a lot of unnecessary distance computations.

2.3 Handling large clusters

Hierarchical -means.  The hierarchical -means (HKM) uses a clustering tree instead of flat -means NisterS06 to reduce the number of clusters in each assignment step. It first clusters the points into a small number (e.g., 10) of clusters, then recursively divides each cluster until a certain depth is reached. The leaves in the resulted clustering tree are considered to be the final clusters. (For , one obtains one million clusters.)

Suppose that the data points associated with each node of the hierarchial tree are divided into a few (e.g., a constant number , much smaller than ) subsets (clusters). In each recursion, each point can only be assigned to one of the clusters, and the depth of the recursions is . The computational cost is (ignoring the small constant number ).

Approximate -means.  In PhilbinCISZ07 approximate nearest neighbor (ANN) search replaces the exact nearest neighbor (NN) search in the assignment step when searching for the nearest cluster center for each point. In particular, the current cluster centers in each -means iteration are organized by a forest of -d trees to perform an accelerated approximate NN search. The cost of the assignment step is reduced to , with being the number of accessed nearest cluster candidates in the -d trees. Refined-AKM (RAKM) Philbin10a further improves the convergence speed by by enforcing constraints of non-increasing objective values during the iterations. Both AKM and RAKM require a considerable overhead of constructing -d trees in each -means iteration, thus a trade-off between the speed and the accuracy of the nearest neighbor search has to be made.

2.4 Others

There are some other complementary works in improving -means clustering. In Sculley10 , the update step is speeded up by transforming a batch update to a mini-batch update. The high-dimensional issue has also been addressed by using dimension reduction, e.g., random projections BoutsidisZ10 ; FernB03a and product quantization JegouDS11 .

Object discovery and mining from spatially related images is one topic that is related to image clustering ChumM10 ; LiWZLF08 ; PhilbinZ08 ; RaguramWFL11 ; SimonSS07 , which also aims to cluster the images so that each group contains the same object. This is a potential application of our scalable -means algorithm. In WangWZTGL12 ; WangWZTGL13 , we introduce an algorithm of clustering spatially-related images based on the neighborhood graph. The idea of constructing the neighborhood graph is to adopt multiple spatial partition trees, which is similar to the idea of this paper.

3 -means with cluster closures

In this section, we first introduce the proposed approach, then give the analysis and discussions, and finally present the implementation details.


Figure 1: The distribution of the distance ratio. It shows that most active points have smaller distance ratio and lie near some cluster boundaries


Figure 2: Illustration of uniting neighborhoods to obtain the closure. The black dash line indicates the closure of cluster


Figure 3: The coverage of the active points by the closure w.r.t. the neighborhood size. A neighborhood of size 50 has about coverage

3.1 Approach

Active points.  -means clustering partitions the data space into Voronoi cells – each cell is a cluster and the cluster center is the center of the cell. In the assignment step, each point is assigned to its nearest cluster center. We call points that change cluster assignments in an iteration active points. In other words, changes cluster membership from the -th cluster to the -th cluster because , where is the distance function.

We observe that active points are close to the boundary between and . To verify this, we define distance ratio for an active point as: . The distance ratio is in the range of , since we only compute distance ratio for active points. Smaller values of mean closer to the cluster boundaries. Figure 1 shows the distribution of distance ratios when clustering GIST features from the Tiny image data set (described in 4.1) to clusters. We can see that most active points have small distance ratios, e.g. more than of the active points have a distance ratio less than (shown in the red area), and thus lie near to cluster boundaries.

During the assignment step, we only need to identify the active points and change their cluster memberships. The above observation that active points lie close to cell boundaries suggests a novel approach to speed up the assignment step by identifying active points around cell boundaries.

Cluster closures.  Assume for now that we have identified the neighborhood of a given point , a set of points containing ’s neighboring points and itself, denoted by . We define the closure of a cluster as:


Figure 2 illustrates the relationship between the cluster, the neighborhood points, and the closure.

If active points are on the cluster boundaries, as we have observed, then by increasing the neighborhood size , the group closure will be accordingly expanded to cover more active points that will be assigned to this group in the assignment step. Figure 3 shows the recall (of an active point being covered by the closure of its newly assigned cluster) vs. the neighborhood size of over the Tinyimage data set describe in Section 4.1. Similar results are also observed in other data sets. As we can see, with a neighborhood size as small as , about of the active points are covered by the closures of the clusters to which these active points will be re-assigned.

We now turn to the question of how to efficiently compute the neighborhood of a given point used in Equation 3. We propose an ensemble approach using multiple random spatial partitions. A single approximate neighborhood for each point can be derived from a random partition (RP) tree VermaKD09 , and the final neighborhood is assembled by combining the results from multiple random spatial partitions. Suppose that a leaf node of a single RP tree, contains a set of points , we consider all the points in to be mutually neighboring to each other. Thus the neighborhood of a point in the set can be straightforwardly computed by .

Since RP trees are efficient to construct, the above neighborhood computation is also efficient. While the group closure from one single RP tree may miss some active points, using multiple RP trees effectively handles this problem. We simply unite the neighborhoods of from all the RP trees:

Here is a set of points in the leaf from the -th RP tree that contains . Note that a point may belong to multiple group closures. Also note that the neighborhood of a given point is computed only once.

Fast assignment.  With the group closures computed from Equation 3, the assignment step can be done by verifying whether a point belonging to the closure should indeed be assigned to the cluster :

  • Initialization step: Initialize the distance array by assigning an positive infinity value to each entry.

  • Closure-based assignment:
          For each cluster closure :
                               For each point :


    Here is the cluster center of at the -th iteration, is the global index for and is the index into for point .

In the assignment step, we only need to compute the distance from the center of a cluster to each point in the cluster closure. A point typically belongs to a small number of cluster closures. Thus, instead of computing the distances from a point to all cluster centers in exact -means, or constructing -d trees of all cluster centers at each iteration to find the approximate nearest cluster center, we only need to compute the distance from to a small number of cluster centers whose cluster closures contain , resulting in a significant reduction in computational cost. Moreover, the fact that active points are close to cluster boundaries is the worst case for -d trees to find the nearest neighbor. On the contrary, such a fact is advantageous for our algorithm.

3.2 Analysis

Convergence.  The following shows that our algorithm always converges. Since the objective function is lower-bounded, the convergence can be guaranteed if the objective value does not increase at each iterative step.

Theorem 3.1 (Non-increase)

The value of the objective function does not increase at each iterative step, i.e.,


∎In the assignment step for the -th iteration, computed from the -th iteration are cluster candidates. would change its cluster membership only if it finds a closer cluster center, thus we have , and Equation 4 holds for the assignment step.

In the update step, the cluster center will then be update based on the new point assignments. We now show that this update will not increase the within-cluster sum of squared distortions, or in a more general form:


where is the -th updated cluster center , and is an arbitrary point in the data space. Equation 5 can be verified by the following:


Thus Equation 4 holds for the update step. ∎

Accuracy.  Our algorithm obtains the same result as the exact Lloyd’s algorithm if the closures of the clusters are large enough, in such a way all the points that would have been assigned to the -th cluster when using the Lloyd’s algorithm belong to the cluster closure . However, it should be noted that this condition is sufficient but not necessary. In practice, even with a small neighborhood, our approach often obtains results similar to using the exact Lloyd’s algorithm. The reason is that the missing points, which should have been assigned to the current cluster at the current iteration but are missed, are close to the cluster boundary thus likely to appear in the closure of the new clusters updated by the current iteration. As a result, these missing points are very likely to be correctly222“Correctly” w.r.t. assignments if produced by Lloyd’s algorithm. assigned in the next iteration.

Complexity.  Consider a point and its neighborhood , the possible groups that may absorb are . As a result, we have . In our implementation, we use balanced random bi-partition trees, with each leaf node containing points ( is a small number). Suppose we use random partition trees. Then the neighborhood size of a point will not be larger than . As a result, the complexity of the closure-based assignment step is .

For the complexity of constructing trees, our approach constructs a RP-tree in and AKM costs to build a d-tree. However, our approach only needs a small number (typically in our clustering experiments) of trees through all iterations, but AKM requires constructing a number (e.g., in PhilbinCISZ07 ) of trees in each iteration, which makes the total cost more expensive.

3.3 Discussion

We present the comparison of our approach with most relevant three algorithms, Canopy clustering, approximate -means, and hierarchical -means.

Versus Canopy clustering.  Canopy clustering, however, suffers from the canopy creation whose cost is high for visual features. More importantly, it is non-trivial (1) to define a meaningful and efficient approximate distance function for visual data, and (2) to tune the parameters for computing the canopy, both of which are crucial to the effectiveness and efficiency of Canopy clustering. In contrast, our approach is simpler and more efficient because random partitions can be created with a cost of only . Moreover, our method can adaptively update cluster member candidates, in contrast to static canopies in McCallumNU00 .

Versus AKM.  The advantages of the proposed approach over AKM are summarized as follows. First, the computational complexity of assigning a new cluster to a point in our approach is only , while the complexity is for AKM or RAKM. The second advantage is that we only need to organize the data points once as the data points do not change during the iterations, in contrast to AKM or RAKM that needs to construct the -d trees at each iteration as the cluster centers change from iteration to iteration. Last, It is shown that active points (points near cluster boundaries) present the worst case for ANN search (used in AKM) to return their accurate nearest neighbors. In contrast, our approach is able to identify active points efficiently and makes more accurate cluster assignment for active points without the shortcoming in AKM.

Versus HKM.  As shown before, HKM takes less time cost than AKM and our approach. However, its cluster accuracy is not as good as HKM and our approach. This is because when assigning a point to a cluster (e.g., quantizing a feature descriptor) in HKM, it is possible that an error could be committed at a higher level of the tree, leading to a sub-optimal cluster assignment and thus sub-optimal quantization.

3.4 Implementation details

The random partition tree used for creating cluster closures is a binary tree structure that is formed by recursively splitting the space and aims to organize the data points in a hierarchical manner. Each node of the tree is associated with a region in the space, called a cell. These cells define a hierarchical decomposition of the space. The root node is associated with the whole set of data points . Each internal node is associated with a subset of data points that lie in the cell of the node. It has two child nodes and , which correspond to two disjoint subsets of data points and . The leaf node

may be associated with a subset of data points or only contain a single point. In the implementation, we use a random principal direction to form the partition hyperplane to split the data points into two subsets. The principal directions are obtained by using principal component analysis (PCA). To generate random principal directions, rather than computing the principle direction from the whole subset of points, we compute the principal direction over the points randomly sampled from each subset. In our implementation, the principle direction is computed by the Lanczos algorithm 

Lanczos50 .

We use an adaptive scheme that incrementally creates random partitions to automatically expand the group closures on demand. At the beginning of our algorithm, we only create one random partition tree. After each iteration, we compute the reduction rate of the within-cluster sum of squared distortions. If the reduction rate in successive iterations is smaller than a predefined threshold, a new random partition tree is added to expand points’ neighborhood thus group closures. We compare the adaptive neighborhood scheme to a static one that computes the neighborhoods altogether at the beginning (called static neighborhoods). As shown in Figure 4, we can see that the adaptive neighborhood scheme performs better in all the iterations and hence is adopted in the later comparison experiments.

The closure-based assignment step can be implemented in another equivalent way. For each point , we first identify the candidate centers by checking the cluster memberships of the points within the neighborhood of . Here , and is the cluster membership of point . Then the best cluster candidate for can be found by checking the clusters . In this equivalent implementation, the assignments are computed independently and can be naturally parallelized. The update step computes the mean for each the cluster independently, which can be naturally parallelized as well. Thus, our algorithm can be easily parallelized. We show the clustering performance with the parallel implementation (using multiple threads on multi-core CPUs) in Figure 5.


Figure 4: Clustering performance with adaptive vs. static neighborhoods


Figure 5: Clustering performance with different numbers of threads
Figure 6: Clustering performance in terms of within-cluster sum of squared distortions (WCSSD) vs. time. The first row are the results of clustering 1M SIFT dataset into 0.5K, 2K and 10K clusters, respectively. The second row are results on 1M tiny image dataset
Figure 7: Clustering performance in terms of normalized mutual information (NMI) vs. time, on the dataset of (a) 200K tiny images, (b) 500K tiny images, and (c) 200K shopping images

4 Experiments

4.1 Data sets

SIFT.  The SIFT features are collected from the Caltech 101 data set FeiFP04 . We extract maximally stable extremal regions for each image, and compute a -dimensional SIFT feature for each region. We randomly sample million features to form this data set.

Tiny images.  We generate three data sets sampled from the tiny images TorralbaFF08 : tiny images, tiny images, and tiny images. The tiny images are randomly sampled without using category (tag) information. We sample () tags from the tiny images and sample about () images for each tag, forming () images. We use a -dimensional GIST feature to represent each image.

Shopping images.  We collect about shopping images from the Internet. Each image is associated with a tag to indicate its category. We sample tags and sample images for each tag to form the image set. We use a -dimensional HOG feature to represent each image.

Oxford .  This data set PhilbinCISZ07 consists of high resolution images of Oxford landmarks. The collection has been manually annotated to generate a comprehensive ground truth for different landmarks, each represented by possible queries. This gives a set of queries over which an object retrieval system can be evaluated. The images, the SIFT features, and the ground truth labeling of this data set is publicly available333 This data set and the next data set will be used to demonstrate the application of our approach to object retrieval.

Ukbench .  This data set is from the Recognition Benchmark introduced in NisterS06 . It consists of images split into four-image groups, each of the same scene/object taken at different viewpoints. The data set, the SIFT descriptors, and the ground truth is publicly available444

4.2 Evaluation metric

We use two metrics to evaluate the performance of various clustering algorithms, the within-cluster sum of squared distortions (WCSSD) which is the objective value defined by Equation 1, and the normalized mutual information (NMI) which is widely used for clustering evaluation. NMI requires the ground truth of cluster assignments for points in the data set. Given a clustering result , NMI is defined by , where is the mutual information of and and is the entropy.

In object retrieval, image feature descriptors are quantized into visual words using codebooks. A codebook of high quality will result in less quantization errors and more repeatable quantization results, thus leading to a better retrieval performance. We apply various clustering algorithms to constructing visual codebooks for object retrieval. By fixing all the other components and parameters in our retrieval system except the codebook, the retrieval performance is an indicator of the quality of the codebook. For the Oxford dataset, we follow PhilbinCISZ07 to use mean average precision (mAP) to evaluate the retrieved images. For the ukbench dataset, the retrieval performance is measured by the average number of relevant images in the top retrieved images, ranging from to .

4.3 Clustering performance comparison

We compare our proposed clustering algorithm with four approximate -means algorithms, namely hierarchial -means (HKM), approximate -means (AKM), refined approximate -means (RAKM) and Canopy algorithm. The exact Lloyd’s is much less efficient and prohibitively costly for large data sets, so we do not report its results. We use the implementation of HKM available from MujaL09 , and the public release of AKM555 The RAKM is modified from the above AKM release. For Canopy algorithm, we conduct principal component analysis over the features to project them to a lower-dimensional subspace to achieve a fast canopy construction. For a fair comparison, we initialize the cluster assignment by a random partition tree in all algorithms except HKM The time costs for constructing trees or other initialization are all included in the comparisons. All algorithms are run on a 2.66GHz desktop PC using a single thread.

Figure 6 shows the clustering performance in terms of WCSSD vs. time. The experiments are performed on two data sets, the -dimensional SIFT data set and the -dimensional tiny image data set, respectively. The results are shown for different number of clusters, ranging from to . Our approach consistently outperforms the other four approximate -means algorithms – it converges faster to a smaller objective value.

Figure 7 shows the clustering results in terms of NMI vs. time. We use three labeled datasets, the tiny images, the tiny images and the shopping images. Consistent with the WCSSD comparison results, our proposed algorithm is superior to the other four clustering algorithms.

We also show the qualitative clustering results of our algorithm. Figure 8 shows some examples of the clustering results over the shopping images. Figure 9 shows some examples of the clustering results over the tiny images. The first clusters are examples of similar objects, the second clusters are examples of similar texture images, and the last cluster are an example of similar sceneries.

Figure 8: Clustering results over the shopping images: each cluster example is represented by two rows of images which are randomly picked out from the cluster
Figure 9: Clustering results over the tiny images: each cluster example is represented by two rows of images which are randomly picked out from the cluster.

4.4 Empirical analysis

We conduct empirical studies to understand why our proposed algorithm has superior performance. In particular we compare our proposed approach with AKM PhilbinCISZ07 and RAKM Philbin10a in terms of the accuracy and the time cost of cluster assignment, using the task of clustering the Tiny image data set into clusters. To be on the same ground, in the assignment step the number of candidate clusters for each point is set the same. For (R)AKM, the number of candidate clusters is simply the number of points accessed in -d trees when searching for a nearest neighbor. For our proposed algorithm, we partition the data points with RP trees such that the average number of candidate clusters is the same as the number of accessed points in -d trees. Figure 10 compares the accuracy of cluster assignment by varying the number of candidate clusters. We can see that our approach has a much higher accuracy in all cases, which has a positive impact on the iterative clustering algorithm to make it converge faster. Figure 10 compares the time of performing one iteration, by varying the number of candidate clusters for each point. We can see that our algorithm is much faster than (R)AKM in all cases, e.g., taking only about half the time of (R)AKM when . This is as expected since finding the best cluster costs for our algorithm but for d-trees used in (R)AKM.

Figure 10: Comparison of accuracy and time in the assignment step when clustering the Tiny image data set into clusters. (a) accuracy vs. the number of cluster candidates; (b) time for one iteration vs. the number of cluster candidates

We perform another empirical study to investigate the bucket size parameter in the RP tree, using the task of clustering the SIFT dataset into clusters. Figure 11 shows the results in terms of WCSSD vs. the number of iterations, with bucket sizes set to , , , , respectively. A larger bucket size leads to a larger WCSSD reduction in each iteration, because it effectively increases the neighborhood size for each data point. Figure 11 shows the result in terms of WCSSD vs. time. We observe that at the beginning, bucket sizes of and perform even better than the bucket size of . But eventually, the performance of various bucket sizes are similar. The difference between Figure 11 and Figure 11 is expected, as a larger bucket size leads to a better cluster assignment at each iteration, but increases the time cost for one iteration. In our comparison experiments, a bucket size of is adopted.

4.5 Evaluation using object retrieval

We compare the quality of codebooks built by HKM, (R)AKM, and our approach, using the performance of object retrieval. AKM and RAKM perform almost the same when the number of accessed candidate centers is large enough, so we only present results from AKM.

We perform the experiments on the UKBench dataset which has local features, and on the Oxford dataset which has local features. Following PhilbinCISZ07 , we perform the clustering algorithms to build the codebooks, and test only the filtering stage of the retrieval system, i.e., retrieval is performed using the inverted file (including the tf-idf weighting).

The results over the UKbench dataset are obtained by constructing codebook, and use the distance metric. The results of HKM and AKM are taken from NisterS06 and PhilbinCISZ07 , respectively. From Table 1, we see that for the same codebook size, our method outperforms other approaches. Besides, we also conduct the experiment over subsets of various sizes, which means that we only consider the images in the subset as queries and the search range is also constrained within the subset. The performance comparison is given in Figure 12, from which we can see our approach consistently gets superior performances.

Figure 11: Clustering performance vs. the bucket size of a RP tree

t]   Method Scoring levels Average Top   HKM   HKM   HKM   HKM   AKM   Ours

Table 1: A comparison of our approach to HKM and AKM on the UKbench data set using a -word codebook


Figure 12: A comparison of our approach to HKM and AKM on the UKbench data set with various subset sizes

The performance comparison using the Oxford dataset is shown in Table 2. We show the results of using the bag-of-words (BoW) representation with a codebook and using spatial re-ranking PhilbinCISZ07 . Our approach achieves the best performance, outperforming AKM in both the BoW representation and spatial re-ranking. We also compare the performance of our approach to AKM and HKM using different codebook sizes, as shown in Table 3. Our approach is superior compared to other approaches with different codebook sizes. Different from AKM that gets the best performance with a -word codebook, our approach obtains the best performance with a -word codebook, indicating that our approach is producing a higher quality codebook.

Last, we show some visual examples of the retrieval results in Figure 13. The first images in each row is the query, followed by the top results.

Figure 13: Examples of the retrieval results of Oxford5k dataset: the first image in each row is the query image and the following images are the top results

5 Conclusions

There are three factors that contribute to the superior performance of our proposed approach: (1) We only need to consider active points that change their cluster assignments in the assignment step of the -means algorithm; (2) Most active points locate at or near cluster boundaries; (3) We can efficiently identify active points by pre-assembling data points using multiple random partition trees. The result is a simple, easily parallelizable, and surprisingly efficient -means clustering algorithm. It outperforms state-of-the-art on clustering large-scale real datasets and learning codebooks for image retrieval.

  Method Scoring level mAP (BoW) mAP (Spatial)
  AKM 0.647
  Our approach
Table 2: A comparison of our approach with HKM and AKM on the Oxford data set with a -word codebook
  Vocabulary size HKM AKM AKM spatial Ours Ours spatial
Table 3: Performance comparison of our approach, HKM, and AKM using different codebook sizes on the Oxford data set


  • (1) Boutsidis, C., Zouzias, A., Drineas, P.: Random projections for -means clustering. CoRR abs/1011.4632 (2010)
  • (2) Chum, O., Matas, J.: Large-scale discovery of spatially related images. IEEE PAMI 32(2), 371–377 (2010)
  • (3) Elkan, C.: Using the triangle inequality to accelerate k-means. In: ICML (2003)
  • (4) Fei-Fei, L., Fergus, R., Perona, P.: Learning generative visual models from few training examples: an incremental bayesian approach tested on 101 object categories. In: CVPR Workshop on Generative-Model Based Vision (2004)
  • (5)

    Fern, X.Z., Brodley, C.E.: Random projection for high dimensional data clustering: A cluster ensemble approach.

    In: ICML, pp. 186–193 (2003)
  • (6)

    Forgy, E.W.: Cluster analysis of multivariate data: Efficiency versus interpretability of classifications.

    Biometrics 21, 768–780 (1965)
  • (7) Frahling, G., Sohler, C.: A fast k-means implementation using coresets. Int. J. Comput. Geometry Appl. 18(6), 605–625 (2008)
  • (8) Jégou, H., Douze, M., Schmid, C.: Product quantization for nearest neighbor search. IEEE PAMI 33(1), 117–128 (2011)
  • (9) Kanungo, T., Mount, D.M., Netanyahu, N.S., Piatko, C.D., Silverman, R., Wu, A.Y.: An efficient k-means clustering algorithm: Analysis and implementation. IEEE PAMI 24(7), 881–892 (2002)
  • (10)

    Lanczos, C.: An iteration method for the solution of the eigenvalue problem of linear differential and integral operators.

    Journal of Research of the National Bureau of Standards 45(4), 255–282 (1950)
  • (11) Li, X., Wu, C., Zach, C., Lazebnik, S., Frahm, J.M.: Modeling and recognition of landmark image collections using iconic scene graphs. In: ECCV (2008)
  • (12) Lloyd, S.P.: Least squares quantization in PCM. IEEE Transactions on Information Theory 28(2), 129–137 (1982)
  • (13) MacQueen, J.B.: Some methods for classification and analysis of multivariate observations.

    In: Proc. of 5th Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, pp. 281–297 (1967)

  • (14) Mahajan, M., Nimbhorkar, P., Varadarajan, K.R.: The planar k-means problem is np-hard. In: WALCOM, pp. 274–285 (2009)
  • (15) McCallum, A., Nigam, K., Ungar, L.H.: Efficient clustering of high-dimensional data sets with application to reference matching. In: KDD (2000)
  • (16) Muja, M., Lowe, D.G.: Fast approximate nearest neighbors with automatic algorithm configuration. In: VISSAPP (1), pp. 331–340 (2009)
  • (17) Nistér, D., Stewénius, H.: Scalable recognition with a vocabulary tree. In: CVPR (2006)
  • (18) Philbin, J.: Scalable object retrieval in very large image collections. Ph.D. thesis, University of Oxford (2010)
  • (19) Philbin, J., Chum, O., Isard, M., Sivic, J., Zisserman, A.: Object retrieval with large vocabularies and fast spatial matching. In: CVPR (2007)
  • (20) Philbin, J., Zisserman, A.: Object mining using a matching graph on very large image collections. In: ICVGIP, pp. 738–745 (2008)
  • (21) Raguram, R., Wu, C., Frahm, J.M., Lazebnik, S.: Modeling and recognition of landmark image collections using iconic scene graphs. IJCV 95(3), 213–239 (2011)
  • (22) Sculley, D.: Web-scale k-means clustering. In: WWW (2010)
  • (23) Simon, I., Snavely, N., Seitz, S.M.: Scene summarization for online image collections. In: ICCV, pp. 1–8 (2007)
  • (24) Sivic, J., Zisserman, A.: Video google: A text retrieval approach to object matching in videos. In: ICCV, pp. 1470–1477 (2003)
  • (25)

    Torralba, A.B., Fergus, R., Freeman, W.T.: 80 million tiny images: A large data set for nonparametric object and scene recognition.

    IEEE PAMI 30(11), 1958–1970 (2008)
  • (26) Verma, N., Kpotufe, S., Dasgupta, S.: Which spatial partition trees are adaptive to intrinsic dimension?

    In: Proc. 25th Conf. on Uncertainty in Artificial Intelligence (2009)

  • (27) Wang, J., Jia, L., Hua, X.S.: Interactive browsing via diversified visual summarization for image search results. Multimedia Syst. 17(5), 379–391 (2011)
  • (28) Wang, J., Wang, J., Ke, Q., Zeng, G., Li, S.: Fast approximate k-means via cluster closures. In: CVPR, pp. 3037–3044 (2012)
  • (29) Wang, J., Wang, J., Zeng, G., Tu, Z., Gan, R., Li, S.: Scalable k-nn graph construction for visual descriptors. In: CVPR, pp. 1106–1113 (2012)
  • (30) Wang, J., Wang, J., Zeng, G., Tu, Z., Gan, R., Li, S.: Scalable $k$-nn graph construction. CoRR abs/1307.7852 (2013)