Our proposed research topic is to do clustering on a scalable dataset from a semi-supervised approach based on hashing methods. In particular, our goal is to explore the underlying data distribution by clustering the data points and differentiating the classes. When a small set of labeled data that come from only a subset of the classes is given, we want to find out the whole data distribution for a complete set of classes. For example, we are given a set of labels of two classes, can we separate these two classes well and at the same time discover the existence of a third class. It requires using the information from the labeled data to find a transformation metric that can split the two classes well; and after this data transformation, we can discover that there is a third class exists. Suppose there is a handwritten digit recognition task and the dataset contains digits ‘2’, ‘7’ and ‘4’. If a general agglomerative clustering is run, it might end up with 2 clusters that ‘2’ and ‘7’ in one cluster and ‘4’ in the other, due to the similarity of their shapes. However, when a small labeled set of classes ‘2’ and ‘7’ is given, we can learned a degree of granularity for similarity comparison. By using a data transformation that maximally can split ‘2’ and ‘7’ into two clusters, we are able to identify the existence of another cluster, digit ‘4’. Because agglomerative clustering suffers from its computation inefficiency, a major contribution of this paper is to introduce a machine learned hashing method - kernelized locality-sensitive hashing (KLSH) - into agglomerative clustering. This results in an efficient computation in clustering for large-scale dataset.
Our paper is structured as follows. We provide background study and related work in section 2. Section 3 presents our algorithms for distance metric learning and KLSH clustering. Section 4 describes the experiments with a discussion of the results, followed by conclusions in section 5.
2 Related Work
There has been much previous work on cluster seeding to address the limitation that iterative clustering techniques (e.g. K-Means and Expectation Maximization (EM)) are sensitive to the choice of initial starting points (seeds). The problem addressed is how to select seed points in the absence of prior knowledge. Kaufman and Rousseeuw propose an elaborate mechanism: the first seed is the instance that is most central in the data; the rest of the representatives are selected by choosing instances that promise to be closer to more of the remaining instances. Pena et al.  empirically compare the four initialization methods for the K-Means algorithm and illustrate that the random and Kaufman initializations outperform the other two, since they make K-Means less dependent on the initial choice of seeds. In K-Means++ 
, the random starting points are chosen with specific probabilities: that is, a pointis chosen as a seed with probability proportional to ’s contribution to the overall potential (defined by the sum of squared distances between each point and the closest center). By augmenting K-Means using this simple, randomized seeding technique, K-Means++ is (log K) competitive with the optimal clustering. Bradley and Fayyad 
propose refining the initial seeds by taking into account the modes of the underlying distribution. This refined initial seed enables the iterative algorithm to converge to a better local minimum. Semi-supervised learning is also seen as unsupervised learning guided by constraints. Noticed that clustering is heavily dependent on distance metrics and a particular algorithm is an executor to follow the rules, pointed out the desire to use a systematic way to learn distance metric for clustering from labeled data. It is based on posing metric learning as a convex optimization problem.
When the data size is growing exponentially, hashing is a technique especially good at solving large scale problems. 
described Locality-Sensitive Hashing (LSH) method, which is an efficient algorithm for the approximate and exact nearest neighbor problem. Their goal is to preprocess a dataset of objects (e.g. images) so that later, given a new query object, one can quickly return the dataset object that is most similar to the query. The technique is of significant interest in a wide variety of areas of unsupervised learning. Hierarchical clustering tries to solve a similar problem, from another perspective. By iteratively finding nearest neighbors, it groups data into clusters. Kernelized LSH is later proposed by for fast image search. It generalizes LSH to accommodate arbitrary kernel functions, making it possible to preserve the algorithm’s sub-linear time similarity search guarantees for a wide class of useful similarity functions.
In this section, we describe our methods to solve a large scale semi-supervised learning problem by first introducing the distance learning metrics, and then our fast agglomerative clustering method based on kernelized locality-sensitive hashing (KLSH).
3.1 Distance Metric Learning
Under the circumstances that the data given to us has a few labeled points, and we know which points for sure belongs to the same or different classes. We have a similarity and a dissimilarity matrix and respectively. For entry in similarity matrix , if data and are in the same class and otherwise. Similarly for dissimilarity matrix . Based on , we try to learn a distance metric , where , are two data points, is a positive semi-definite matrix of distance parameters among data points. The idea is to minimize the distance between similar points while keeping dissimilar points apart.
It can be solved efficiently using constrained Newton’s descent on the objective function
This semi-supervised part is to learn a distance metric for data transformation before the main agglomerative clustering.
3.2 Clustering with KLSH
Curse of dimensionality is a well-known problem for learning on large scale datasets. It is related to the fact that the cost of computation grows exponentially with the increase of data dimensions, or the number of data instances. This is a problem directly affects clustering approaches that based on density estimation in input space.
For instance, in K-Means or general agglomerative clustering, the cost in iteratively estimating new centroid locations and re-arranging data instances to clusters exerts a significant burden on the performance. This happens especially to dataset in high dimension, where frequently computing inter-instance distances are highly expensive.
Our proposed semi-supervised clustering algorithm using kernelized locality-sensitive hashing (KLSH) in Algorithm 1 aims to solve the large scale agglomerative clustering problem. It first learn a distance metric from a small set of labeled data (step 1 in Algorithm 1). The second step is to build KLSH table that map the data in to hashed bits. In the rest of the procedure, an agglomerative clustering is performed. Instead of explicitly computing inter-instance distances, the clustering is done based on the KLSH-hashed data points by measuring their Hamming distance. Such kernelized locality-sensitive hashing method has a high probability of preserving neighborhoods so it’s a reasonable substitute for the exact inter-instance distances.
Our experiment is based on the MNIST dataset of handwritten digits. We evaluate our KLSH agglomerative clustering algorithm via a comparison to K-Means.
We obtained handwritten digits from the MNIST data repository. There are 10 classes of rasterized images (corresponding to digits from ‘0’ to ‘9’). We used up to 50,000 data points for experiments.
4.2 Experiment Setup
We ran experiments using both K-Means and KLSH agglomerative clustering with and without distance metric learning. Hash string length, the number of classes, and the number of data points are varied one at a time. We report precision, recall and the computation time. All the experiments were done on a machine with 8-core Intel processors of 2.8 GHz and 8 GB of RAM.
4.3 Results and Analysis
First in Table 1, we observed that KLSH agglomerative clustering can achieve the same level of precision for a fraction of the computational cost. The downside is that recall is caused by the factor of 2. The decrease in recall is caused by the fact that KLSH cannot recover all of the points in the nearest neighborhood. The addition of distance metric learning has noticeable benefits on performance for KLSH Agglomerative Clustering.
In Table 2, we analyzed the effect of an increase in the number of classes (while fixing the number of data points) on precision, recall, and computation time. Precision remains constant while recall decreases. The computational costs remains relatively independent of the number of clusters.
In Table 3, we analyzed the effect of hash string length on clustering validity. Increasing the length of hash string increases both the precision and recall.
It is also able to adjust the tradeoff between efficiency and effectiveness. Notice that even if we use bit binary hash code, there are still possible outcomes. If the hashing split data well, the number of entries of the table will still be very large. It increases the accuracy of clustering results but meanwhile leads to a higher computation cost during agglomerative clustering.
According to the results, clustering with KLSH has superior performance when the dataset is large and the number of real clusters is small. Comparing to K-Means, it has large promising improvement on speed. When true cluster number is not large, it achieves high performance on both speed and accuracy. Especially in a lower level of the linkage tree, clustering with bias (distance metric learning) can immediately correctly cluster similar data instances.
|# Inst.||K-Means||K-Means w/ DL||Aggl. KLSH||Aggl. KLSH w/ DL|
General hierarchical clustering methods cannot scale well on large dataset due to the exponentially growing number of calculations on inter-instance distances. Kernelized locality-sensitive hashing (KLSH) provides a high probability of preserving neighborhoods and it’s a reasonable substitute for the exact inter-instance distances. Our proposed KLSH agglomerative clustering alleviates the problem by calculating a reduced-sized Hamming distance and achieves efficient clustering computation. The incorporation of distance metric learning marginally improves the precision and recall.
L. Kaufman and P. J. Rousseeuw.
Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley, 1990.
-  José Manuel Peña, José Antonio Lozano, and Pedro Larrañaga. An empirical comparison of four initialization methods for the k-means algorithm. Pattern Recognition Letters, 20(10):1027–1040, 1999.
-  David Arthur and Sergei Vassilvitskii. k-means++: the advantages of careful seeding. In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, SODA ’07, pages 1027–1035, Philadelphia, PA, USA, 2007.
-  Jude W. Shavlik, editor. Refining Initial Points for K-Means Clustering. Morgan Kaufmann, 1998.
-  Eric P. Xing, Andrew Y. Ng, Michael I. Jordan, and Stuart Russell. Distance metric learning, with application to clustering with side-information. In Advances in Neural Information Processing Systems 15, pages 505–512. MIT Press, 2002.
-  Alexandr Andoni and Piotr Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Commun. ACM, 51(1):117–122, January 2008.
Brian Kulis and Kristen Grauman.
Kernelized locality-sensitive hashing for scalable image search.
IEEE International Conference on Computer Vision (ICCV), 2009.