Kernel Treelets, a clustering method based on Treelets
A new method for hierarchical clustering is presented. It combines treelets, a particular multiscale decomposition of data, with a projection on a reproducing kernel Hilbert space. The proposed approach, called kernel treelets (KT), effectively substitutes the correlation coefficient matrix used in treelets with a symmetric, positive semi-definite matrix efficiently constructed from a kernel function. Unlike most clustering methods, which require data sets to be numeric, KT can be applied to more general data and yield a multi-resolution sequence of basis on the data directly in feature space. The effectiveness and potential of KT in clustering analysis is illustrated with some examples.READ FULL TEXT VIEW PDF
Sparse subspace clustering (SSC), as one of the most successful subspace...
To the best of our knowledge, there are no general well-founded robust
The paper introduces a new efficient nonlinear one-class classifier
The main contribution of the paper is to show that Gaussian sketching of...
Kernel quadratures and other kernel-based approximation methods typicall...
In this paper, we propose a Ward-like hierarchical clustering algorithm
This paper establishes a kernel-based framework for reconstructing data ...
Kernel Treelets, a clustering method based on Treelets
Treelets, introduced by Lee, Nadler, and Wasserman [1, 2], is a method to produce a multiscale, hierarchical decomposition of unordered data. The central premise of Treelets is to exploit sparsity and capture intrinsic localized structures with only a few features, represented in terms of an orthonormal basis. The hierarchical tree constructed by the treelet algorithm provides a scale-based partition of the data that can be used for classification, specially for cluster analysis .
Cluster analysis, also called clustering, is concerned with finding a partition of a set such that its corresponding equivalence class captures similarity of its elements. The Treelet approach is an example of hierarchical clustering (HC) , which is a type of methods that provides a nested and multiscale clustering. The typical complexity of HC methods is (where denotes the number of data in the dataset) but Treelets, like single linkage HC  and complete linkage HC , can be done in operations. Most of these clustering methods are only applicable to numerical dataset only. However, many modern datasets do not have clear representations in due for example to missing data, length difference, and non-numeric attributes. A typical solution to this problem usually involves finding a projection from each observation to
as is the case for example in text vectorization, array alignment 
, and missing-data imputation. These particular projections pose considerable challenges and might raise the bias of the model if false assumptions are made.
In this paper we propose a HC method that combines Treelets with a projection on a feature space that is a Reproducing Kernel Hilbert Space (RKHS). We call this method Kernel Treelets (KT). It effectively substitutes the correlation coefficient matrix, used by the original treelet method as a measure of similarity among variables, with a symmetric, positive semi-definite matrix constructed from a (Mercer) kernel function. The intuition behind this approach is that inner products provide a measure of similarity and a projection into a RKHS, done via the so-called Kernel trick [10, 11], is a natural and efficient way to construct appropriate similarity matrices for a wide variety of data sets, including those mentioned above. We present some examples that demonstrate the potential of KT as an effective tool for clustering analysis.
We provide in this section a brief description of the Treelet algorithm [1, 2] and the Kernel method . Treelets are based on the repeated application of two dimensional (Jacobi) rotations to a matrix measuring the similarity of variables. So we start by reviewing Jacobi (also called Givens) rotations first.
A Jacobi rotation matrix
is an orthogonal matrix with at most 4 entries different from the identity, or more generally, a rotation operator on a 2 dimensional subspace generated by two coordinate axes. For a given symmetric matrixand entry and rotation matrix is constructed so that
The construction of is equivalent to finding the cosine (c) and sine (c) of the angle of rotation, which satisfy
subject to the constraint . The matrix is then given
For other entries , .
A numerical stable way of computing this problem is as follows:
Assume , and compute
Let be 1 if and -1 otherwise, then we define
From which we can calculate and .
The complexity of storing a Given’s rotation matrix is , and Jacobi rotation over a matrix uses space with time complexity .
The Treelets algorithm [1, 2] was designed to construct a multiscale basis and a corresponding hierarchical clustering over the attributes of some datasets in , to exploit sparsity. In its most efficient implementation  it is an algorithm. The algorithm starts with a regularization, hyper-parameter and computing a (empirical) covariance matrix . The initial scaling indices are defined as the set . With base case and , each step and for can be constructed inductively as follows:
Construct matrix of the same shape as entry-wise:
Find the two indices such that
Calculate Jacobi rotation matrix for and matrix .
Without loss of generality, and is interchangeable, so we require that , and record and .
The Jacobi rotations produce a Treelets basis for each . The sequence of matrices provides a basis for , defined as
So for every vector , there is a th basis representation . Furthermore, there is a compressed th basis representation obtained by dropping insignificant () non-scaling indices of . That is, if we define to be the
th column of the identity matrix, the compressedth basis representation is given by
Treelets is also a hierarchical clustering method over the attributes. The hierarchical clustering structure is stored in . We start with trivial clustering where each element is in its own cluster and labeled by itself. For each , we merge clusters labeled and and label it . This is feasible because each step the set of all cluster labels is exactly . This operation gives a hierarchical tree for clustering use on the attributes.
A kernel over some set is defined as a function . A symmetric and positive semi-definite (SPSD) kernel has the properties:
If is finite, then is SPSD if and only if is a SPSD matrix. If , then is SPSD if and only if there exists a function , where denotes the Hilbert space, such that for all ,
The space here is called a reproducing kernel Hilbert space (RKHS). The following are two common examples of SPSD kernels:
Radial basis function (RBF) kernel
A kernel for a set can be restricted to a subset , and SPSD property is preserved during restriction. If the task is clustering over a finite set, the selected kernel needs only be SPSD on the set of all samples, which is generally finite, and we only need to check that the kernel matrix is SPSD. If we need to extend the clustering outcome to other data, e.g. clustering boosted classification, then has to include the whole data space as a subset.
K-nearest neighbors algorithm is a multi-class classification algorithm . By specifying and a metric, the algorithm can, given a test data, predict its labels by the majority vote of a subset of closest elements in distance metric from training data. If an inner product is specified instead of distance, we can compute the distance between two point in the following way:
If the metric is kernelized,
Support Vector Machine (SVM) is a classification method by finding optimal hyper-planes. Kernel SVM  is a classification method towards nonlinear problems that performs SVM in RKHS generated by the kernel. When we only apply KT to a small sample, we may use kernel SVM with the same kernel to assign labels for data outside of this sample. This can be viewed as clustering attributes with treelets and using SVM to assign labels to other attributes in RKHS.
The task of KT is to find a clustering for some set given a SPSD kernel measuring the similarity among variables. We combine Treelets with kernels by replacing the covariance with kernel matrix, and apply the rest of the steps of Treelets algorithm. The exact steps are as follows:
First we draw a sample with size
from uniform distribution onand some sample size . If more information about is given, it may be possible to draw a sample that better represent with smaller sample size.
Then, we calculate the kernel matrix . is a SPSD matrix because is SPSD, and thus we can apply Treelets algorithm with hyper-parameter using instead of the (empirical) covariance matrix. can be set to 0 or tuned experimentally as in Treelets. In this step, theTreelets method provides a hierarchical clustering tree of each columns of , which corresponds to each observation in .
If , we are finished on the step above. Otherwise, we need to cluster the elements in based on clusters we have from elements in . We use kernel SVM to complete this task. Given and its corresponding cluster labels, we train the kernel SVM with the same kernel , and then apply to predict the cluster labels of . K-Nearest Neighbors with distance induced by kernel
is an alternative to kernel SVM.
We now prove that the kernel projection is equivalent to working with a symmetric positive definite matrix defined by the inner product in and evaluated through the kernel. We also suggest a definition of a clustering setting and clustering equivalence that allows us to connect the results of the clustering analysis for the original set with those of the transformed, projected set.
For every finite dataset and an SPSD kernel , there exists an orthonormal Hilbert basis in the RKHS such that
where and is symmetric and positive semi-definite.
We apply Gram-Schmidt orthogonalization process to the maximal linearly independent subset of
and get a set of orthonormal vectors, where
We may extend this set to a orthonormal Hilbert basis . Then , is 0 for all entries after and consequently after , so there exists such that
is a square matrix, we may compute its singular value decomposition
We can now define a new orthonormal Hilbert basis through the change of basis matrix . Let for all , then
The projected data in basis is and the matrix
is symmetric and positive definite. ∎
If we denote such that for all ,
or in other words, is the first components of in the basis . Then for all ,
that is . From the lemma, we have that is symmetric and positive definite and
A clustering setting is a pair where is an finite ordered dataset and is a measurement on the dataset . We define an equivalence on the clustering setting that if and only if . For any measurement based clustering method, using measurement on provides the same exact clustering outcome on the labels as using measurement on . An example of clustering equivalences is that if kernel corresponds to projection , then there is , and therefore .
For a dataset and a kernel , we already know that there is a clustering equivalence . From the corollary of lemma 1, there is , which provides the equivalence . As is symmetric, . As a conclusion, , which implies that a clustering method measured with inner product on dataset provides a clustering of measured with kernel . Therefore, Treelets on without centering provides a hierarchical clustering of attributes of based on attribute inner product (covariance matrix), which is a hierarchical clustering of based on inner product. According to clustering setting equivalences, this hierarchical clustering is equivalent to a hierarchical clustering of based on kernel . Furthermore, a property of Treelets is that does not necessarily need to be computed. The ”covariance matrix” of without centering has a easier computation method:
So we may avoid the costly spectral decomposition to compute and define of Treelets as
The complexity of this algorithm is , where is the complexity of applying kernel function to a pair of data and if the data is numeric. In this model, the choice of kernel determines the expected outcome of the prediction and the choice of sample determines the variability of the outcome. A small sample size speeds up the algorithm with the cost of generating false clustering by unrepresentative samples, while large sample size slow down the algorithm and also produces numerical issues because data is more likely to be close to orthogonal as the dimension of projected space grows, and Treelets method would be forced to stop if all remaining components are almost orthogonal. The optimal sample size depends on the floating number accuracy and computation time allowed and should be as large as possible without exceeding the time limit and accuracy limit.
We implemented KT and the following examples in Python with package Numpy , Scikit-learn , and plots were generated with Matplotlib . The Treelets part of our implementation is not optimized, so it is runtime in the followings examples rather than as designed by Lee et al 
. The hyperparameteris set to 0 for all the experiments below.
To illustrate how KT works as a hierarchical clustering method, we use an example from scikit-learn  which consists of 6 datasets, each of which has 1500 two-dimensional data points (i.e. and ), and we can visualize each dataset and each cluster by plotting each observation as a point in the plane. Each of the first five datasets consists of data drawn from multiple shapes with an error in distance. The sixth dataset consists of a uniform random sample from to show how clustering method work for uniform distributed data. Figure 1 shows how KT with different kernels works on these datasets compared to the performance of some other clustering methods. The number of clusters and hyper-parameters are tuned for each method and the sample sizes are set to 1000 for each KT method. Each row of this image represents a dataset and each column represents a clustering method. The method each column represents and and its runtime on each dataset is in recorded in Table 1.
|0 - KTrbf||2.003||2.063||2.325||2.094||2.819||1.967|
|1 - KTlinear||1.585||1.613||1.402||1.73||2.341||1.469|
|2 - KTpoly||3.956||6.08||6.878||9.582||9.836||4.526|
|3 - MiniBatchKMeans||0.006||0.018||0.009||0.01||0.007||0.009|
|4 - MeanShift||0.047||0.032||0.063||0.057||0.032||0.05|
|5 - SpectralClustering||0.642||1.011||0.13||0.352||0.257||0.208|
|6 - Ward||0.114||0.098||0.513||0.245||0.111||0.087|
|7 - AgglomerateClustering||0.085||0.102||0.374||0.196||0.103||0.078|
|8 - DBSCAN||0.015||0.014||0.015||0.012||0.067||0.012|
|9 - GaussianMixture||0.005||0.005||0.008||0.012||0.004||0.009|
In this experiment, KT with RBF kernel is the only method that performs clustering closest to human intuition for all first five datasets. The sixth dataset is a uniform distribution in which we may see how KT is affected by the relative density deficiency in some area due to sampling. Its high performance on the first five datasets is expected as these datasets are to some extent Euclidean distance-based, which corresponds to the assumptions for RBF kernels. Fig.2 shows how difference of number of sample points affects the clustering result. Each column represents KT using RBF kernel with different sample sizes. The hyper-parameter is tuned towards case and is used for all other sample sizes. Notice that as KT1500 is of full sample size, it does not trigger kernel SVM whereas KT1499 do. Their number of clusters and runtime is recorded in Table 2. From here we can see that more sample data implies more runtime and more stable outcome. The minimum optimal number of samples required for the first 5 datasets are 1000, 100, 1000, 200, 50, respectively, which shows that different datasets requires different amount of samples to explain its shape. Furthermore, the fourth dataset shows that optimal hyper-parameter is number-of-sample dependent. RBF kernel can be considered as a weighted average of distance and connectivity, where a larger means a higher weight on distance. For the same , as sample size gets larger, the clustering result becomes more distance-based rather than connectivity based, demonstrating that optimal for those sample sizes are actually smaller.
|0 - KT50||0.011||0.012||0.013||0.011||0.011||0.01|
|1 - KT100||0.035||0.044||0.039||0.045||0.033||0.028|
|2 - KT200||0.109||0.099||0.128||0.12||0.121||0.132|
|3 - KT300||0.225||0.217||0.242||0.269||0.259||0.235|
|4 - KT500||0.551||0.568||0.62||0.569||0.652||0.536|
|5 - KT800||1.315||1.513||1.534||1.378||1.699||1.295|
|6 - KT1000||2.016||2.055||2.336||2.098||2.782||1.941|
|7 - KT1200||2.88||2.94||3.242||3.004||4.146||2.77|
|8 - KT1499||4.438||4.532||5.4||4.713||6.788||4.341|
|9 - KT1500||4.472||4.69||5.398||4.807||6.782||4.274|
To illustrate how KT works in network analysis we use an example from Stanford Network Analysis Project . This is a dataset consisting of ’circles’ (or ’friends lists’) from Facebook. It has surveyed individual (vertices) and each two of them is connected with vertices if they are friends and not if they are not friends, which are the edges. The edges are undirected and not weighted, and the total number of edges is 88234. We use KT to do clustering on this dataset with full sample size (). Denote the set of vertices on the graph as , and define a kernel function such that
The number 1045 is computed and chosen as the largest degree of all vertices. Notice that is a SPSD kernel on because is a symmetric matrix and is also dominant by the positive diagonal, as
To estimate the performance of KT as a multi-scale clustering method on this dataset, we use an evaluation as follows. For each cluster partition in the hierarchy, we compute its matching matrix and its corresponding true positive rate as well as false positive rate. Matching matrix, a type of confusion matrix, is a 2 by 2 matrix recording the number of true positives, true negatives, false positives, and false negatives for pairwise associations. True positive rate measure the proportion of two nodes being in the same cluster given the two nodes are connected and false positive rate measures the proportion of two nodes being in the same cluster given the two nodes are not connected. Each pair of true positive rate and false positive rate produces a point on the plane, and interpolating the set of points of all clustering results in the hierarchy (with order) produces the Receiver operating characteristic (ROC) curve, and the numerical integral overinterval of this curve is known as Area Under Curve (AUC). Figure 3 demonstrates the performance of KT on the dataset, which provides good clusterings for the dataset because it has an AUC as high as .
To illustrate how KT works on dataset with missing information, we use Mice Protein Expression (MPE) dataset 
from UCI Machine Learning Repository as an example. This is a dataset consisting of 1080 observations for 8 classes of mice, each of which containing 77 expression levels of different proteins with some of the entries are not avalible. We use KT to do clustering on this dataset. First we normalize these attributes so that each of them has empirical mean 0 and standard deviation 1. Then we define a RBF kernel for dataset with missing data such that for all observation,
Where is the set of indices that is avalible (not missing) in both and . We check that so that it is well-defined. The number 32 is a parameter tuned with experiments. We compare the predicted clusters and the true labels according to pairwise scores. Fig.4 shows how KT performs compared to KMeans clustering. We measure the true positive rate as the proportion of two record being in the same cluster given that they are from mice of the same type, and the false positive rate as the proportion of two record being in the same cluster given that they are from mice of different type. Similar as the example of network dataset, we draw its ROC curve and calculate its AUC. Also, we use KMeans with multiple number of clusters for comparison. The AUC of KT is much higher than the AUC of KMeans (), demonstrating KT is a much better clustering method for this dataset than KMeans.
In the paper we describe a novel approach, kernel treelets (KT), for hierarchical clustering. The method relies on applying the treelet algorithm to a matrix measuring similarities among variables in a feature, reproducing kernel Hilbert space. We show with some examples that KT is as useful as other hierarchical clustering methods and is especially competitive for datasets without numerical matrix representation and or missing data. The KT approach also shows significant potential for semi-supervised learning tasks and as a pre-processing, post-processing step in deep-learning. Work in these directions is underway.
Theoretical foundations of the potential function method in pattern recognition learning.Automation and remote control, 25:821–837, 1964.
A training algorithm for optimal margin classifiers.In
Proceedings of the fifth annual workshop on Computational learning theory, pages 144–152. ACM, 1992.