1.1 Ordinary Center Based Clustering
The goal of clustering is to partition an unlabeled dataset into disjoint sets or clusters such that each cluster only consists of similar members of the dataset [19, 20, 13]. Of particular interest to this work are center or centroid based clustering methods, as described in the following. Let be a dataset of -dimensional vectors, whose elements are drawn according to a random vector . In classical -means clustering [27, 32, 26], one is interested in finding the optimal cluster centers that minimize the average distortion
Distinct clusters can then be identified via the Voronoi cells (ties are broken arbitrarily). Several variations to the basic formulation in (1) have been studied. For example, the squared error distortion measure between the cluster center and the dataset sample in (1) can be replaced with the th power distortion measure , a quadratic distortion measure , where is a positive semi-definite matrix [14, 23], or a Bregman divergence [2, 11].
Finding a (globally-)optimal solution to (1) is known to be an NP-hard problem . Nevertheless, locally-optimal solutions can be found using the -means algorithm or its extensions such as the generalized Lloyd algorithm [26, 24]. Moreover, vector quantization theory [17, 15] provides a precise description of the structure of optimal solutions and the corresponding minimum average distortions in the asymptotic regimes [35, 25].
1.2 Problem Statement
In this paper, we will study the following generalization of (1): Consider the dataset
containing observations of the
-valued random matrix. We wish to minimize
where are some weights. For and , we recover the original -means problem (1). We provide a complete asymptotic analysis of the minimizers of (3) and the corresponding loss values for the special cases for any , and for any and .
1.3 Application Scenarios
An example scenario where the cost function (3) becomes relevant is as follows: Consider a physical process that generates a dataset , and suppose that our goal is cluster . In practice, we may only access a noisy version of through, for example, sensor measurements. For any given sample index , given that there are sensors in total, Sensor can provide a noisy version of the true data sample . In such a scenario, one only has the noisy dataset in (2) available and cannot access the true dataset . We thus wish to find a clustering of that is as faithful to the clustering of as possible. The minimization of (3) provides a possible solution to this problem. In fact, (3) exploits the information that each measurement is a noisy version of a common true data sample. This is why a common cluster center is used for all noisy versions of the data sample. The weights may be used to control the relative importance of individual sensor measurements: Sensors known to be less noisy may be assigned a larger weight.
It is also worth mentioning that a straightforward alternative approach for approximating from may be to directly apply the -means algorithm to . In this case, the problem can be considered to be a special case of multi-modal or multi-view clustering [6, 4], where each noisy measurement corresponds to one individual view. We numerically demonstrate that the approximation performance with this approach, measured in terms of the adjusted Rand index (ARI)  or the adjusted mutual information (AMI) , is inferior to our approach that relies on minimizing (3). The optimal way to approximate a clustering through its noisy measurements warrants a separate investigation and will be left as future work. We refer to reader to [10, 8] for other problem formulations that involve clustering noisy data.
We would also like to note that the objective function (3) can also be interpreted in the context of facility location optimization, at least for the special case (and ). In fact, many facility location optimization problems can be formulated as clustering or quantization problems [30, 9]. For our scenario, consider packages at locations , which are to be processed at one of the facilities , and then conveyed to their destinations , respectively. The cost of conveying a package from one location to another can be modeled to be proportional to the th power of the distance between the two locations [28, 29, 7, 22]. Thus, the total cost of conveying the th package through the facility at is given by . The minimum average cost of conveying all packages is then given by (3) for the special case with . Minimizing (3) corresponds to optimizing the facility locations. Other applications of vector quantization to sensor or facility location optimization can be found in [18, 21].
1.4 Organization of the Paper
The rest of this paper is organized as follows: In Section 2, we introduce some well-known results from quantization theory. In Sections 3 and 4, we analyze the cases of squared-error distortions and arbitrary powers of errors, respectively. In Section 5, we provide numerical results over real and artificial datasets. Finally, in Section 6, we draw our main conclusions. Proofs of certain theoretical claims and more numerical experiments can be found in the extended version of the paper.
Throughout the paper, we will present our analytical results for the case of having observations from the dataset . In particular, for the simple -means scenario in (1), letting amounts to replacing the empirical sum with the integral
where is the quantizer codebook, and
represents the probability density function (PDF) of. We omit the domain of integration when it is clear from the context.
For , it is easily shown that the minimizer of (4) is the centroid . On the other hand, finding the minimizers of (4) is a hopeless problem for a general number of centers . On the other hand, vector quantization theory yields a precise characterization. In fact, we have the asymptotic result [3, 35]
where is a constant that depends only on the dimension , and
is the -norm of the density . The sequence of quantizer codebooks that achieve the performance in (5) has the following property: There exists a continuous function such that at the cube , optimal quantizer codebooks contain points as . Hence, can be thought as a “point density function” and obeys the normalization . For the squared-error distortion, the optimal point density function depends on the input distribution through
Equivalently, we say that is proportional to . The question is now how the quantization points are to be deployed optimally inside each cube . Since the underlying density is approximately uniform on
, the question is equivalent to finding the structure of an optimal quantizer for a uniform distribution. For one and two dimensions, the optimal quantizers are known to be the uniform and the hexagonal lattice quantizers, respectively (Thus, thepoints should follow a hexagonal lattice on the square in an optimal quantizer). We thus have and
, corresponding to the normalized second moment of the interval and the regular hexagon respectively. For a setwith , the normalized second moment is defined as
For , the optimal quantizer structure and has remained unknown. We conclude by noting that the arguments involving infinitesimals and point density functions can be formalized; see .
3 Squared-Error Distortions
Let us now consider our problem of quantizing multiple sources to a common cluster center, as formulated in (3). Focusing on the case of having observations from the dataset , we replace the empirical sum in (3) with the integral
where represents a realization of the random matrix , and is the PDF of .
We first consider the case of squared-error distortions , which allow a simple characterization of the optimal average distortions. In fact, for , the integrand in (9) can be rewritten as
where . The equivalence (10) can easily be verified by expanding the squared Euclidean norms on both sides via the formula , where and
are arbitrary vectors. Let us now define a new random variable
and, with regards to the last two terms in (10), the expected value
We observe that the integral in (13) is merely the average squared-error distortion of a quantizer given a source with density . Therefore, when , optimal quantization of multiple sources to a common cluster center is equivalent to the optimal quantization of the single source with the usual squared-error distortion measure. It follows that the results of Section 2 are directly applicable, and we have the following theorem.
Let . As , we have
Moreover, the optimal point density function that achieves (14) is proportional to .
We have thus precisely characterized the asymptotic average distortion for the case . Note that when , which corresponds to ordinary quantization with squared-error distortion, the average distortion decays to zero as the number of quantizer centers grows to infinity. Theorem 1 demonstrates that when , the average distortion converges to , which is in general non-zero. The reason is that when , a single quantizer center is used to reproduce mutiple sources, which makes zero distortion impossible to achieve whenever the sources are not identical.
4 Distortions with Arbitrary Powers of Errors
We now consider the achievable performance for a general . We also consider the case . Without loss of generality, let . In this case, the objective function in (9) takes the form
The main difficulty for is that and an algebraic manipulation of the form (10) is not available. Nevertheless, it turns out that an analysis in the high-resolution regime is still feasible. First, we need the following basic lemma.
Let . The global minimizer of is , where . The corresponding global minimum is .
Note that is convex in so that is has a global minimum. Observe that this global minimum should be located on the line that connects and . In fact, suppose that the global minimizer does not belong to . We can project to to come up with a new point that satisfies and . This implies and thus contradicts the optimality of . Given that should be on , it can be written in the form for some . We have
Let us now calculate
According to (18), the function is decreasing for and increasing for . The global minimum is thus achieved for . Substituting this optimum value for to (17), we obtain the same expression for as in the statement of the lemma. This concludes the proof. ∎
According to Lemma 1, given one data point at , and the other at , the minimum cost can be achieved by using a cluster center at . Then, given a hypothetically-infinite number of cluster centers, one can achieve the optimal performance by placing the centers at every possible location imaginable. On the other hand, given only finitely many centers, one has no choice but be content with choosing the center that is close to the optimal location . In such a scenario, it makes sense to analyze the behavior of the function near the optimal value
where is a small vector in magnitude. The motivation behind restricting to be small is to be able to invoke a high-resolution analysis. We have
Now, let and denote the gradient and the Hessian of a multivariate function evaluated at , respectively. We have the generic multivariate Taylor series expansion
In order to expand (21) for small , we need to find the Taylor expansion of the function . For this purpose, for any vector , we let . The gradient and the Hessian of can then be calculated as
is the identity matrix and. Substituting (23) and (24) to (22), we obtain
Using this expansion in (21) leads to
where . Using the equivalence , we have
According to (5), the last term decays as . For the second term, we apply a change of variables and to obtain
Note that for any , and , the matrix is positive semi-definite. This implies that the matrix is positive semi-definite for every .
The function defines an input-weighted quadratic distortion measure. The structure and distortion of the optimal quantizers corresponding to such distortion measures has been studied in . As discussed above, the matrix is positive semi-definite for every so that the results of  is applicable. In particular, we have
and the optimal point density that achieves the asymptotic performance in (31) is
Applying these results to our specific problem leads to the following theorem.
For ordinary center-based quantization with the th power distortion measure, the average distortion decays as , see . It is interesting to note that, when one instead consider a sum of th powers of distortions, as in 15, the average distortion decays as , independently of .
Let us now discuss certain special cases of the conclusions of Theorem 2 above.
The second equality follows from a change of variables , and the last equality follows once we view the PDF of as a convolution. Substituting the derived equalities, the conclusions of Theorems 1 and 2 become identical whenever . ∎
Let , and suppose and are independent and uniform random variables on . In order to calculate , we find the region where the joint PDF of and in (30) is non-zero. In other words, we solve for in the conditions and . After some straightforward manipulations, we obtain the equivalent set of inequalities
According to Theorem 2, the optimal point density at is proportional to the cube root of . The normalizing constant can be calculated to be
The integral in (40) cannot be expressed in terms of elementary functions, but can easily be evaluated numerically. Also, for the special case of , the integral vanishes so that we have, simply
Also, when and are independent and uniform on , the random variable has PDF . Therefore,
A closed-form asymptotic expression for the optimal asymptotic distortion can then obtained by substituting (40) and (42) to Theorem 2. One only needs to numerically evaluate the integral in (40). In particular, for , i.e. when the samples from and are weighted equally, we obtain
The case can be handled in a similar manner. This concludes our example. ∎
It is not clear how to extend the analysis to sources. Note however that the numerical design of the quantizer (i.e. a minimization of (9)) in the general case is always possible via the following variant of the generalized Lloyd algorithm: First, one initializes some arbitrary , and then iterates between the two steps of calculating the generalized Voronoi cells
and the generalized centers
It is easily seen that this algorithm results in a non-increasing average distortion at every iteration and thus it converges in a cost-function sense. Moreover, the center calculation (45) can be accomplished in a computationally-efficient manner as it is convex for any . We will use this algorithm to validate our analytical results in the next section over various datasets.
5 Numerical Results
In this section, we provide numerical experiments to show the performance of our algorithms over various datasets and to verify our analytical results.
5.1 Performance Results for Clustering Through Noisy Observations
First, we show that our proposed approach of clustering noisy observation vectors to a common center is superior to naively clustering a concatenated version of noisy observation vectors. We follow the same notation as in Section 1.3, but consider now a real dataset with a practical noisy observation scenario. Specifically, let be the Iris dataset , consisting of four dimensional vectors, where the components of the vectors correspond to the sepal length, sepal width, petal length, and petal width of each flower. In practice, one may not be able to access the true , but its noisy version: For example, multiple drones may measure a given iris from a distance over the air, resulting in multiple noisy observations of a sample in . Mathematically, let be uniformly distributed on . We assume that the observations are given by the random variables , where are independent noise random variables. Through the samples , our goal is to obtain a clustering that is as close to the clustering of as possible.
One one hand, one can consider the ordinary -means clustering of , which we refer to as the “ordinary clustering” scenario. Our method instead relies on minimizing (3) for the special case . In Fig. 2, we compare the clusterings obtained using two approaches in terms of their similarity with the -means clustering of the original dataset . The horizontal axis is the number of cluster centers, and the vertical axis represents the similarity measure. We consider both the ARI and AMI similarity measures. A Monte Carlo similarity measure average is accurate within with confidence. We also consider four observations , and that
are zero-mean Gaussian random variables with variance. First, we can observe that our approach outperforms ordinary clustering uniformly for all scenarios and for both similarity measures. The improvement is particularly notable when there is more noise in the observations or when the number of cluster centers are large. In particular, for cluster centers, which is the ground truth for the number of clusters of the dataset, the improvement of our method over ordinary clustering in terms of the ARI measure is , , and for and , respectively.
Advantage of our method over ordinary clustering carries well over non-Gaussian noise models and different number of observations over different datasets. In Fig. 3, we show the results for the case where each are uniformly distributed on for for . We can similarly observe that our approach that relies on quantizing to a common cluster center provides a better performance.
In Fig. 4, we show the results for UCI Wine dataset , which consists of -dimensional vectors. Since, for this dataset, the components of vectors have vastly different variances, we have preprocessed the dataset such that each component has unit variance. Except for the case of cluster centers, our quantization approach outperforms the ordinary -means algorithm for all scenarios.
5.2 Validation of High Resolution Analysis
We now provide numerical experiments that verify the high resolution results provided by Theorems 1 and 2. We consider the same scenario as in Example 2 for different values of and . In Fig. 5, we compare the cluster centers obtained using our generalized Lloyd algorithm (labeled “Numerical”) with those provided by Theorem 2 (labeled “Analytical”) for the cases in Fig. 4(a) and in Fig. 4(b). The horizontal axis represents , and the vertical axis represents the cluster center or the quantization point locations. Note that Theorem 2 provides the optimal quantizer point density function, not the individual quantization points or cluster centers. We may use, however, inverse transform sampling to obtain a sequence of quantization points that will be faithful to the quantizer point density function. Namely, if the desired point density function is , we use the quantization points , where is the cumulative point density function.
Although the results of Theorem 2 are only valid asymptotically, we can observe that they still provide a very accurate description of the optimal quantizers for number of cluster centers as low as . There is only slight mismatch for centers that are close to the boundaries and . A similar phenomena can be observed for the case and in Fig. 4(b), but the amount of mismatch is lower. Also, the theory can precisely predict very subtle changes in the optimal quantization points, e.g. the movement of the optimal location for the third quantization point from from as grows from to . In Fig. 6, we show the average distortion performances corresponding to the cluster centers in Fig. 5. Again, quantization theory can precisely predict the average distortion performance even when is as small as . The difference between the analytical and the numerical results become indistinguishable for the case of cluster centers.
For , the proposed algorithms correspond to a simple scheme where one quantizes a weighted average of the noisy observations. The case
is especially useful for handling datasets with outlier observations (e.g. many noiseless observations and few noisy observations). Numerical simulations corresponding to these cases will be provided elsewhere.
We have considered quantizing an -dimensional sample, which is obtained by concatenating vectors from datasets of -dimensional vectors, to a -dimensional cluster center. An application area of such a quantization strategy is when one wishes to cluster a dataset through noisy observations of each of its members. We have found analytical formulae for the average distortion performance and the optimal quantizer structure in the asymptotic regime where the number of cluster centers are large. We have shown that our clustering approach outperforms the naive approach that relies on quantizing the -dimensional noisy observation vectors to -dimensional centers.
-  (2009) NP-hardness of euclidean sum-of-squares clustering. Machine learning 75 (2), pp. 245–248. Cited by: §1.1.
-  (2005) Clustering with Bregman divergences. Journal of Machine Learning Research 6 (Oct), pp. 1705–1749. Cited by: §1.1.
-  (1948) Spectra of quantized signals. The Bell System Technical Journal 27 (3), pp. 446–472. Cited by: §2.
-  (2004) Multi-view clustering. In Fourth IEEE International Conference on Data Mining (ICDM’04), pp. 19–26. Cited by: §1.3.
-  (1982) Multidimensional asymptotic quantization theory with r th power distortion measures. IEEE Transactions on Information Theory 28 (2), pp. 239–247. Cited by: §1.1.
-  (2017) A survey on multi-view clustering. arXiv preprint arXiv:1712.06246. Cited by: §1.3.
-  (2013) (1+ )-approximation for facility location in data streams. In Proceedings of the twenty-fourth annual ACM-SIAM symposium on Discrete algorithms, pp. 1710–1728. Cited by: §1.3.
-  (1991) Characterization and detection of noise in clustering. Pattern Recognition Letters 12 (11), pp. 657–664. Cited by: §1.3.
-  (2010) Multiple criteria facility location problems: a survey. Applied Mathematical Modelling 34 (7), pp. 1689–1709. Cited by: §1.3.
-  (1990) A study of vector quantization for noisy channels. IEEE Transactions on Information Theory 36 (4), pp. 799–809. Cited by: §1.3.
Quantization and clustering with Bregman divergences.
Journal of Multivariate Analysis101 (9), pp. 2207–2221. Cited by: §1.1.
-  (1936) The use of multiple measurements in taxonomic problems. Annals of eugenics 7 (2), pp. 179–188. Cited by: §5.1.
-  (2007) Data clustering: theory, algorithms, and applications. Vol. 20, Siam. Cited by: §1.1.
-  (1995-Sep.) Theoretical analysis of the high-rate vector quantization of lpc parameters. IEEE Transactions on Speech and Audio Processing 3 (5), pp. 367–381. External Links: Cited by: §1.1.
-  (2012) Vector quantization and signal compression. Vol. 159, Springer Science & Business Media. Cited by: §1.1.
Foundations of quantization for probability distributions. Springer. Cited by: §2.
-  (1998-10) Quantization. IEEE Transactions on Information Theory 44 (6), pp. 2325–2383. External Links: Cited by: §1.1.
-  (2017-05) Energy efficiency in two-tiered wireless sensor networks. In IEEE International Conference on Communications (ICC), Cited by: §1.3.
-  (1999) Data clustering: a review. ACM computing surveys (CSUR) 31 (3), pp. 264–323. Cited by: §1.1.
Data clustering: 50 years beyond k-means. Pattern recognition letters 31 (8), pp. 651–666. Cited by: §1.1.
-  (2017-12) Performance gains of optimal antenna deployment for massive mimo systems. In IEEE Global Communications Conference (GLOBECOM), Cited by: §1.3.
-  (2018-06) Power-efficient deployment of UAVs as relays. IEEE Signal Processing Advances in Wireless Communications (SPAWC). Cited by: §1.3.
-  (1999-04) Asymptotic performance of vector quantizers with a perceptual distortion measure. IEEE Transactions on Information Theory 45 (4), pp. 1082–1091. Cited by: §1.1, §4.
-  (1980) An algorithm for vector quantizer design. IEEE Transactions on Communications 28 (1), pp. 84–95. Cited by: §1.1.
-  (2016) Clustering with Bregman divergences: an asymptotic analysis. In Advances in Neural Information Processing Systems, pp. 2351–2359. Cited by: §1.1.
-  (1982) Least squares quantization in PCM. IEEE Transactions on Information Theory 28 (2), pp. 129–137. Cited by: §1.1, §1.1.
Some methods for classification and analysis of multivariate observations.
Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Vol. 1, pp. 281–297. Cited by: §1.1.
-  (2017) Clustering through continuous facility location problems. Theoretical Computer Science 657, pp. 137–145. Cited by: §1.3.
-  (2008-01) A continuous facility location problem and its application to a clustering problem. pp. 1826–1831. External Links: Cited by: §1.3.
-  (1997) Locational optimization problems solved through voronoi diagrams. European Journal of Operational Research 98 (3), pp. 445–456. Cited by: §1.3.
-  (1971) Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association 66 (336), pp. 846–850. Cited by: §1.3.
-  (1956) Sur la division des corps matériels en parties. Bulletin de l’Académie Polonaise des Sciences, Classe III IV (12), pp. 801–804. Cited by: §1.1.
-  (1990) PARVUS: an extendable package of programs for data exploration, classification and correlation, m. forina, r. leardi, c. armanino and s. lanteri, elsevier, amsterdam, 1988. Journal of Chemometrics 4 (2), pp. 191–193. Cited by: §5.1.
-  (2010) Information theoretic measures for clusterings comparison: variants, properties, normalization and correction for chance. The Journal of Machine Learning Research 11, pp. 2837–2854. Cited by: §1.3.
-  (1982-03) Asymptotic quantization error of continuous signals and the quantization dimension. IEEE Transactions on Information Theory 28 (2), pp. 139–149. External Links: Cited by: §1.1, §2, §4.