As huge amounts of data are collected perpetually from communication, imaging, and mobile devices, medical and e-commerce platforms as well as social-networking sites, this is undoubtedly an era of data deluge . Such “big data” however, come with “big challenges.” The sheer volume and dimensionality of data make it often impossible to run analytics and traditional inference methods using stand-alone processors, e.g., [3, 17]. Consequently, “workhorse” learning tools have to be re-examined in the face of today’s high-cardinality sets possibly comprising high-dimensional data.
Clustering (a.k.a. unsupervised classification) refers to categorizing into groups unlabeled data encountered in the widespread applications of mining, compression, and learning tasks . Among numerous clustering algorithms, -means is the most prominent one thanks to its simplicity . It thrives on “tight” groups of feature vectors, data points or objects that can be separated via (hyper)planes. Its scope is further broadened by the so-termed probabilistic and kernel
-means, with an instantiation of the latter being equivalent to spectral clustering – the popular tool for graph-partitioning that can cope even with nonlinearly separable data.
A key question with regards to clustering data sets of cardinality containing -dimensional vectors with and/or huge, is: How can one select the most “informative” vectors and/or dimensions so as to reduce their number for efficient computations, yet retain as much of their cluster-discrimination capability? This paper develops such an approach for big data
-means clustering. Albeit distinct, the inspiration comes from random sampling and consensus (RANSAC) methods, which were introduced for outlier-resilient regression tasks encountered in computer vision[14, 30, 25, 35, 29, 8].
, and neural networks. Unfortunately, most available selection schemes do not scale well with the number of features , particularly in the big data regime where
is massive. Recent approaches to dimensionality reduction and clustering include subspace clustering, where minimization problems requiring singular value decompositions (SVDs) are solved per iteration to determine in parallel a low-dimensional latent space and corresponding sparse coefficients for efficient clustering. Low-dimensional subspace representations are also pursued in the context of kernel -means [7, Alg. 2], where either an SVD on a sub-sampled kernel matrix, or, the solution of a quadratic optimization task is required per iteration to cluster efficiently large-scale data.
Randomized schemes select features with non-uniform probabilities that depend on the so-termed “leverage scores” of the data matrix[19, 20]. Unfortunately, their complexity is loaded by the leverage scores computation, which, similar to , requires SVD computations – a cumbersome (if not impossible) task when and/or . Recent computationally efficient alternatives for feature selection and clustering rely on random projections (RPs) [19, 20, 9, 5]. Specifically for RP-based clustering , the data matrix is left multiplied by a data-agnostic (fat) RP matrix to reduce its dimension (); see also  where RPs are employed to accelerate the kernel -means algorithm. Clustering is performed afterwards on the reduced -dimensional vectors. With its universality and quantifiable performance granted, this “one-RP-matrix-fits-all” approach is not as flexible to adapt to the data-specific attributes (e.g., low rank) that is typically present in big data.
This paper’s approach aspires to not only account for structure, but also offer a gamut of novel randomized algorithms trading off complexity for clustering performance by developing a family of what we term sketching and validation (SkeVa) algorithms. The SkeVa family includes two algorithms based on efficient intermediate -means clustering steps. The first is a batch method, while the second is sequential thus offering computational efficiency and suitability for streaming modes of operation. The third member of the SkeVa family is kernel-based and enables big data clustering of even nonlinearly separable data sets. Finally, the fourth one bypasses the need for intermediate -means clustering thus trading off performance for complexity. Extensive numerical validations on synthetic and real data-sets highlight the potential of the proposed methods, and demonstrate their competitive performance on clustering massive populations of high-dimensional data vs. state-of-the-art RP alternatives.
Notation. Boldface uppercase (lowercase) letters indicate matrices (column vectors). Calligraphic uppercase letters denote sets ( stands for the empty set), and expresses the cardinality of . Operators and stand for the - and -norm of a vector, respectively, while denotes transposition and the all-one vector.
Consider the data matrix with and/or being potentially massive. Data belong to a known number of clusters (). Each cluster is represented by its centroid that can be e.g., the (sample) mean of the vectors in . Accordingly, each datum can be modeled as , where ; the sparse vector comprises the data-cluster association entries satisfying , where when , while , otherwise; and the noise captures ’s deviation from the corresponding centroid(s).
For hard clustering, the said associations are binary (), and in the celebrated hard -means algorithm they are identified based on the Euclidean () distance between and its closest centroid. Specifically, given and , per iteration , the -means algorithm iteratively updates data-cluster associations and cluster centroids as follows; see e.g., .
|[-a] Update data-cluster associations: For ,|
|[-b] Update cluster centroids: For ,|
Although there may be multiple assignments solving (1b), each is assigned to a single cluster. To initialize (1b), one can randomly pick from . The iterative algorithm (1) solves a challenging NP-hard problem, and albeit its success, -means guarantees convergence only to a local minimum at complexity , with denoting the number of iterations needed for convergence, which depends on initialization [4, § 9.1].
As (1b) and (1c) minimize an -norm squared, hard -means is sensitive to outliers. Variants exhibiting robustness to outliers adopt non-Euclidean distances (a.k.a. dissimilarity metrics) , such as the -norm. In addition, candidate “centroids” can be selected per iteration from the data themselves; that is, in (1c). Notwithstanding for this so-termed -medoids algorithm, one just needs the distances to carry out the minimizations in (1b) and (1c). The latter in turn allows to even represent non-vectorial objects (a.k.a. qualitative data), so long as the aforementioned (non-)Euclidean distances can become available otherwise; e.g., in the form of correlations [4, § 9.1].
Besides various distances and centroid options, hard -means in (1) can be generalized in three additional directions: (i) Using nonlinear functions , with being a potentially infinite-dimensional space, data can be transformed to , where clustering can be facilitated (cf. Sec. 3.3); (ii) via non-binary , soft clustering can allow for multiple associations per datum, and thus for a probabilistic assignment of data to clusters; and (iii) additional constraints (e.g., sparsity) can be incorporated in the coefficients through appropriate regularizers .
All these generalizations can be unified by replacing (1) per iteration , with
|[-a] Update data-cluster associations: ,|
|[-b] Update cluster centroids:|
In Sec. 3.3, function will be implicitly used to map nonlinearly separable data to linearly separable (possibly infinite dimensional) data , whose distances can be obtained through a pre-selected (so-termed kernel) function [4, Chap. 6]. The regularizer can enforce prior knowledge on the data-cluster association vectors.
To confirm that indeed (1) is subsumed by (2), let ; choose as the squared Euclidean distance in ; and set , if , whereas otherwise, with being the th -dimensional canonical vector. Then (2b) becomes , which further simplifies to the minimum distance rule of (1b). Moreover, it can be readily verified that (2c) separates across s to yield the centroid of (1c) per cluster.
To recognize how (2) captures also soft clustering, consider that cluster is selected with probability
, and its data are drawn from a probability density function (pdf)parameterized by ; i.e., . If is Gaussian, then denotes its mean and covariance matrix . With and allowing for multiple cluster associations, the likelihood per datum is given by the mixture pdf: , which for independently drawn data yields the joint log-likelihood ()
If , , and in (2), then soft -means iterations (2b) and (2c) maximize (3) with respect to (w.r.t.) and . An alternative popular maximizer of the likelihood in (3) is the expectation-maximization algorithm; see e.g., [4, § 9.3].
If the number of clusters
is unknown, it can also be estimated by regularizing the log-likelihood in (3) with terms penalizing complexity as in e.g., minimum description length criteria .
Although hard -means is the clustering module common to all numerical tests in Sec. V, the novel big data algorithms of Secs. III and IV apply to all schemes subsumed by (2).
3 The SkeVa Family
Our novel algorithms based on random sketching and validation are introduced in this section. Relative to existing clustering schemes, their merits are pronounced when and/or take prohibitively large values for the unified iterations (2b) and (2c) to remain computationally feasible.
3.1 Batch algorithm
For specificity, the SkeVa K-means algorithm will be developed first for , followed by its variant for .
Using a repeated trial-and-error approach, SkeVa K-means discovers a few dimensions (features) that yield high-accuracy clustering. The key idea is that upon sketching a small enough subset of dimensions (trial or sketching phase), a hypotheses test can be formed by augmenting the original subset with a second small subset (up to affordable dimensions) to validate whether the first subset nominally represents the full -dimensional data (error phase). Such a trial-and-error procedure is repeated for a number of realizations, after which the features that have achieved the “best” clustering accuracy results are those determining the final clusters on the whole set of dimensions.
Starting with the trial-phase per realization , dimensions (rows) of are randomly drawn (uniformly) to obtain . With small enough, -means is run on to obtain clusters and corresponding centroids [cf. (1b) and (1c)]. These sketching and clustering steps comprise the (random) sketching phase.
Moving on to the error-phase of the procedure, the quality of the -dimensional clustering is assessed next using what we term validation phase. This starts by re-drawing -dimensional data , generally different from those selected in draw . Associating each with the cluster belongs to, the centroids corresponding to the extra dimensions are formed as [cf. (1c)]
Let and denote respectively the concatenated data and centroids from draws and , and likewise for the data and centroid matrices and . Measuring distances and using again the minimum distance rule data-cluster associations and clusters are obtained for the “augmented data.” If per datum the data-cluster association in the space of dimensions coincides with that in the space of dimensions, then is in the validation set (VS) ; that is,
Quality of clustering per draw is then assessed using a monotonically increasing rank function of the set . Based on this function, a -dimensional trial is preferred over another -dimensional trial if .
The sketching and validation phases are repeated for a prescribed number of realizations . At last, the -dimensional sketching yields the final clusters, namely ; see Alg. 1.
With regards to selecting , a straightforward choice is the VS cardinality, that is , which can be thought as the empirical probability of correct clustering. Alternatively, a measure of cluster separability, used extensively in pattern recognition, is Fisher’s discriminant ratio , which in the present context becomes
is the unbiased sample variance of cluster:
The larger the , the more separable clusters are. Obtaining is computationally light since distances in (7) have been calculated during the validation phase of the algorithm. The only additional burden is computing the numerator in (6) in complexity. Based on FDR, a second choice for is
Instead of , the exponent is to avoid pathological cases where approaches , e.g., when all points in a cluster are very concentrated so that .
Alg. 1 incurs overall complexity , where is an upper bound on the number of iterations needed for -means to converge in step 5, plus required in step 8 of Alg. 1. Parameters , , and are selected depending on the available computational resources; should be such that running the computations of Alg. 1 on -dimensional vectors can be affordable by the processing unit used. A probabilistic argument for choosing can be determined as elaborated next.
Using parameters that can be obtained in practice, it is possible to relate the number of random draws with the reliability of SkeVa-based clustering, along the lines of analyzing RANSAC .
To this end, let denote the probability of having out of SkeVa realizations at least one “good draw” of “informative” dimensions, meaning one for which -means yields data-cluster associations close to those found by -means on the full set of dimensions. Parameter is a function of the underlying cluster characteristics. It can be selected by the user and reflects one’s level of SkeVa-based big data clustering reliability, e.g., . The probability of having all “bad draws” after SkeVa repetitions is clearly . Moreover, let denote the probability that a randomly drawn row of is “informative.” In other words, quantifies prior information on the number of rows (out of ) that carry high discriminative information for -means. For instance, can be practically defined by the leverage scores of , which typically rank the importance of rows of in large-scale data analytics [5, 20]. An estimate of the leverage scores across the rows of expresses as the percentage of informative rows. Alternatively, if is the data generation mechanism per cluster , with , then the th entry of is . Thus, rows of are realizations of a Gaussian random vector. If these rows are clustered in groups, can capture the probability of having an “informative” row located within a confidence region around its associated centroid that contains a high percentage of its pdf mass. Per SkeVa realization, the probability of drawing “non-informative” rows can be approximated by . Due to the independence of SkeVa realizations, , which implies that . Clearly, as increases there is (a growing) nonzero probability that the correct clusters will be revealed. On the other hand, it must be acknowledged that if is not sufficient, clustering performance will suffer commensurately.
It is interesting to note that as in , only depends implicitly on , since is a percentage, and does not depend on the validation metric or pertinent thresholds and bounds.
3.2 Sequential algorithm
Drawing a batch of features (rows of ) during the validation phase of Alg. 1 to assess the discriminating ability of the features drawn in the sketching phase may be computationally, especially if is relatively large. The computation of all distances in step 8 of Alg. 1 can be prohibitive if becomes large. This motivates a sequential augmentation of dimensions, where features are added one at a time, and computations are performed on only a single row of per feature augmentation, till the upper-bound is reached. Apparently, such an approach adds flexibility and effects computational savings since sequential augmentation of dimensions does not need to be carried out till is reached, but it may be terminated early on if a prescribed criterion is met. These considerations prompted the development of Alg. 2.
The sketching phase of Alg. 2 remains the same as in Alg. 1. In the validation phase, and for each dimension in the additional ones, are obtained as in Alg. 1 [cf. (4)], and likewise for . If is smaller than the current maximum value in memory, the -dimensional clustering is discarded, and a new draw is taken. This can be seen as a “bail-out” test, to reject possibly “bad clusterings” in time, without having to perform the augmentation using all dimensions.
Experiments corroborate that it is not necessary to augment all dimensions (cf. Fig. 1), but using a small subset of them provides satisfactory accuracy while reducing complexity. An alternative route is to stop augmentation once the “gradient” of , meaning finite differences across augmented dimensions, drops below a prescribed ; that is, . The sequential approach is summarized in Alg. 2, and has complexity strictly smaller than .
Using in the place of whenever , or equivalently, replacing with , both the batch and sequential schemes developed for can be implemented verbatim for . This variant of SkeVa K-means will be detailed in the next subsection for nonlinearly separable clusters.
3.3 Big data kernel clustering
A prominent approach to clustering or classifying nonlinearly separable data is throughkernels; see e.g., . Vectors are mapped to that live in a higher (possibly infinite-) dimensional space , where inner products defining distances in , using the induced norm , are given by a pre-selected (reproducing) kernel function ; that is, . An example of such a kernel is the Gaussian one: .
For simplicity in exposition, our novel kernel-based (Ke)SkeVa K-means approach to big data clustering will be developed for the hard -means. Extensions to kernel-based soft SkeVa K-means follow naturally, and are outlined in Appendix A. Similar to (1), the kernel-based hard -means proceeds as follows. For ,
|[-a] Update data-cluster associations: For ,|
|[-b] Update cluster centroids: For ,|
where Euclidean norms in the standard form of -means have been replaced by . As the potentially infinite-size cannot be stored in memory, step (8c) is implicit. In fact, only and the data-cluster associations suffice to run (8). To illustrate this, substitute from (8c) into (8b) to write
Having established that distances involved in SkeVa K-means are expressible in terms of the chosen kernel , the resulting iterative scheme is listed as Alg. 3. After randomly selecting an affordable subset , comprising columns of per realization , and similar to the trial-and-error step in line 5 of Alg. 1, the (kernel) -means of (8) is applied to . The validation phase of KeSkeVa K-means is initialized in line 6, where a second subset comprising columns from . The distances between data and centroids involved in step 7 of Alg. 3 are also obtained through kernel evaluations [cf. (10)]. This KeSkeVa that operates on the number of data-points rather than dimensions follows along the line of RANSAC  but with two major differences: (i) Instead of robust parameter regression, it is tailored for big data clustering; and (ii) rather than consensus it deals with affordably small validation sets across possibly huge data-sets.
Given the “implicit centroids” obtained as in (11), data are mapped to clusters which are different from . To assess this difference, the distance between and is computed, and columns of are re-grouped in clusters as
Recall that distances are again obtained through kernel evaluations [cf. (10)].
The process of generating clusters and centroids in KeSkeVa K-means can be summarized as follows: (i) Group randomly drawn data into clusters with centroids ; (ii) draw additional data-points, augment clusters , and compute new centroids ; (iii) given , find clusters as in (12). Since in general, data belonging to do not necessarily belong to , and vice versa; while data common to and , that is data that have not changed “cluster membership” during the validation phase, comprise the validation set
Among realizations, trial with the highest cardinality is identified in Alg. 3, and data are finally associated with clusters . The overall complexity of Alg. 3 is in step 5, when are not stored in memory and kernel evaluations have to be performed for all employed data per realization, plus in step 7. If are stored in memory, then Alg. 3 incurs complexity , which is quadratic only in the small cardinality .
One remark is now in order.
Similar to all kernel-based approaches, a critical issue is selecting the proper kernel – more a matter of art and prior information about the data. Nonetheless, practical so-termed multi-kernel approaches adopt a dictionary of kernels from which one or a few are selected to run -means on . It will be interesting to investigate whether such multi-kernel approaches can be adapted to our SkeVa-based operation to alleviate the dependence on a single kernel function.
4 Divergence-Based SkeVa
A limitation of Algs. 1 and 2 is the trial-and-error strategy which requires clustering per random draw of samples. This section introduces a method to surmount such a need and select a small number of data or dimensions on which only a single clustering step is applied at the last stage of the algorithm. As the “quality” per draw is assessed without clustering, this approach trades off accuracy for reduced complexity.
4.1 Large-scale data sets
First, the case where random draws are performed on the data will be examined, followed by draws across the dimensions. Any randomly drawn data from will be assumed centered around their sample mean.
Since intermediate clustering will not be applied to assess the quality of a random draw, a metric is needed to quantify how well the randomly drawn samples represent clusters. To this end, motivated by the pdf mixture model [cf. (3)], consider the following pdf estimate formed using the randomly selected data
where stands for a user-defined kernel function,
parameterized by , and satisfying
for the estimate in (14) to qualify as a pdf. Here, the Gaussian kernel is considered with .
Having linked random samples with pdf estimates, the assessment whether a draw represents well the whole population translates to how successful this draw is in estimating the actual data pdf via (14). A random sample where all selected data form a single cluster is clearly not a good representative of . For example, if the selected points are gathered around (recall that drawn data are centered around their sample mean), then the resulting pdf estimate (