Learning Manifolds with K-Means and K-Flats

09/05/2012 ∙ by Guillermo D. Canas, et al. ∙ MIT 0

We study the problem of estimating a manifold from random samples. In particular, we consider piecewise constant and piecewise linear estimators induced by k-means and k-flats, and analyze their performance. We extend previous results for k-means in two separate directions. First, we provide new results for k-means reconstruction on manifolds and, secondly, we prove reconstruction bounds for higher-order approximation (k-flats), for which no known results were previously available. While the results for k-means are novel, some of the technical tools are well-established in the literature. In the case of k-flats, both the results and the mathematical tools are new.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Our study is broadly motivated by questions in high-dimensional learning. As is well known, learning in high dimensions is feasible only if the data distribution satisfies suitable prior assumptions. One such assumption is that the data distribution lies on, or is close to, a low-dimensional set embedded in a high dimensional space, for instance a low dimensional manifold. This latter assumption has proved to be useful in practice, as well as amenable to theoretical analysis, and it has led to a significant amount of recent work. Starting from [29, 40, 7], this set of ideas, broadly referred to as manifold learning, has been applied to a variety of problems from supervised [42]

and semi-supervised learning

[8], to clustering [45] and dimensionality reduction [7], to name a few.

Interestingly, the problem of learning the manifold itself has received less attention: given samples from a d-manifold embedded in some ambient space , the problem is to learn a set that approximates in a suitable sense. This problem has been considered in computational geometry, but in a setting in which typically the manifold is a hyper-surface in a low-dimensional space (e.g. ), and the data are typically not sampled probabilistically, see for instance [32, 30]. The problem of learning a manifold is also related to that of estimating the support of a distribution, (see [17, 18] for recent surveys.) In this context, some of the distances considered to measure approximation quality are the Hausforff distance, and the so-called excess mass distance.

The reconstruction framework that we consider is related to the work of [1, 38], as well as to the framework proposed in [37], in which a manifold is approximated by a set, with performance measured by an expected distance to this set. This setting is similar to the problem of dictionary learning (see for instance [36], and extensive references therein), in which a dictionary is found by minimizing a similar reconstruction error, perhaps with additional constraints on an associated encoding of the data. Crucially, while the dictionary is learned on the empirical data, the quantity of interest is the expected reconstruction error, which is the focus of this work.

We analyze this problem by focusing on two important, and widely-used algorithms, namely k-means and k-flats. The k-means algorithm can be seen to define a piecewise constant approximation of . Indeed, it induces a Voronoi decomposition on , in which each Voronoi region is effectively approximated by a fixed mean. Given this, a natural extension is to consider higher order approximations, such as those induced by discrete collections of -dimensional affine spaces (k-flats), with possibly better resulting performance. Since is a -manifold, the k-flats approximation naturally resembles the way in which a manifold is locally approximated by its tangent bundle.

Our analysis extends previous results for k-means to the case in which the data-generating distribution is supported on a manifold, and provides analogous results for k-flats. We note that the k-means algorithm has been widely studied, and thus much of our analysis in this case involves the combination of known facts to obtain novel results. The analysis of k-flats, however, requires developing substantially new mathematical tools.

The rest of the paper is organized as follows. In section 2, we describe the formal setting and the algorithms that we study. We begin our analysis by discussing the reconstruction properties of k-means in section 3. In section 4, we present and discuss our main results, whose proofs are postponed to the appendices.

2 Learning Manifolds

Let by a Hilbert space with inner product

, endowed with a Borel probability measure

supported over a compact, smooth -manifold . We assume the data to be given by a training set, in the form of samples drawn identically and independently with respect to .
Our goal is to learn a set that approximates well the manifold. The approximation (learning error) is measured by the expected reconstruction error

(1)

where the distance to a set is , with . This is the same reconstruction measure that has been the recent focus of [37, 5, 38].

It is easy to see that any set such that will have zero risk, with being the “smallest” such set (with respect to set containment.) In other words, the above error measure does not introduce an explicit penalty on the “size” of : enlarging any given can never increase the learning error.
With this observation in mind, we study specific learning algorithms that, given the data, produce a set belonging to some restricted hypothesis space (e.g. sets of size for k-means), which effectively introduces a constraint on the size of the sets. Finally, note that the risk of Equation 1 is non-negative and, if the hypothesis space is sufficiently rich, the risk of an unsupervised algorithm may converge to zero under suitable conditions.

2.1 Using K-Means and K-Flats for Piecewise Manifold Approximation

In this work, we focus on two specific algorithms, namely k-means  [34, 33] and k-flats [12]. Although typically discussed in the Euclidean space case, their definition can be easily extended to a Hilbert space setting. The study of manifolds embedded in a Hilbert space is of special interest when considering non-linear (kernel) versions of the algorithms [20]

. More generally, this setting can be seen as a limit case when dealing with high dimensional data. Naturally, the more classical setting of an absolutely continuous distribution over

-dimensional Euclidean space is simply a particular case, in which , and is a domain with positive Lebesgue measure.

K-Means. Let be the class of sets of size in . Given a training set and a choice of , k-means is defined by the minimization over of the empirical reconstruction error

(2)

where, for any fixed set , is an unbiased empirical estimate of , so that k-means can be seen to be performing a kind of empirical risk minimization [13, 9, 37, 10, 37].

A minimizer of Equation 2 on is a discrete set of means , which induces a Dirichlet-Voronoi tiling of : a collection of regions, each closest to a common mean [4] (in our notation, the subscript denotes the dependence of on the sample, while refers to its size.) By virtue of being a minimizing set, each mean must occupy the center of mass of the samples in its Voronoi region. These two facts imply that it is possible to compute a local minimum of the empirical risk by using a greedy coordinate-descent relaxation, namely Lloyd’s algorithm [33]. Furthermore, given a finite sample , the number of locally-minimizing sets is also finite since (by the center-of-mass condition) there cannot be more than the number of possible partitions of into groups, and therefore the global minimum must be attainable. Even though Lloyd’s algorithm provides no guarantees of closeness to the global minimizer, in practice it is possible to use a randomized approximation algorithm, such as kmeans++ [3], which provides guarantees of approximation to the global minimum in expectation with respect to the randomization.

K-Flats. Let be the class of collections of flats (affine spaces) of dimension . For any value of , k-flats, analogously to k-means, aims at finding the set that minimizes the empirical reconstruction (2) over . By an argument similar to the one used for k-means, a global minimizer must be attainable, and a Lloyd-type relaxation converges to a local minimum. Note that, in this case, given a Voronoi partition of into regions closest to each -flat, new optimizing flats for that partition can be computed by a -truncated PCA solution on the samples falling in each region.

2.2 Learning a Manifold with K-means and K-flats

In practice, k-means is often interpreted to be a clustering algorithm, with clusters defined by the Voronoi diagram of the set of means . In this interpretation, Equation 2 is simply rewritten by summing over the Voronoi regions, and adding all pairwise distances between samples in the region (the intra-cluster distances.) For instance, this point of view is considered in [14]

where k-means is studied from an information theoretic persepective. K-means can also be interpreted to be performing vector quantization, where the goal is to minimize the encoding error associated to a nearest-neighbor quantizer 

[23]. Interestingly, in the limit of increasing sample size, this problem coincides, in a precise sense [39]

, with the problem of optimal quantization of probability distributions (see for instance the excellent monograph of 

[24].)

When the data-generating distribution is supported on a manifold , k-means can be seen to be approximating points on the manifold by a discrete set of means. Analogously to the Euclidean setting, this induces a Voronoi decomposition of , in which each Voronoi region is effectively approximated by a fixed mean (in this sense k-means produces a piecewise constant approximation of .) As in the Euclidean setting, the limit of this problem with increasing sample size is precisely the problem of optimal quantization of distributions on manifolds, which is the subject of significant recent work in the field of optimal quantization [26, 27].

In this paper, we take the above view of k-means as defining a (piecewise constant) approximation of the manifold supporting the data distribution. In particular, we are interested in the behavior of the expected reconstruction error , for varying and . This perspective has an interesting relation with dictionary learning, in which one is interested in finding a dictionary, and an associated representation, that allows to approximately reconstruct a finite set of data-points/signals. In this interpretation, the set of means can be seen as a dictionary of size that produces a maximally sparse representation (the k-means encoding), see for example [36] and references therein. Crucially, while the dictionary is learned on the available empirical data, the quantity of interest is the expected reconstruction error, and the question of characterizing the performance with respect to this latter quantity naturally arises.

Since k-means produces a piecewise constant approximation of the data, a natural idea is to consider higher orders of approximation, such as approximation by discrete collections of -dimensional affine spaces (k-flats), with possibly better performance. Since is a -manifold, the approximation induced by k-flats may more naturally resemble the way in which a manifold is locally approximated by its tangent bundle. We provide in Sec. 4.2 a partial answer to this question.

3 Reconstruction Properties of k-Means

Since we are interested in the behavior of the expected reconstruction (1) of k-means and k-flats for varying and , before analyzing this behavior, we consider what is currently known about this problem, based on previous work. While k-flats is a relatively new algorithm whose behavior is not yet well understood, several properties of k-means are currently known.

Recall that k-means find an discrete set of size that best approximates the samples in the sense of (2). Clearly, as increases, the empirical reconstruction error cannot increase, and typically decreases. However, we are ultimately interested in the expected reconstruction error, and therefore would like to understand the behavior of with varying .

In the context of optimal quantization, the behavior of the expected reconstruction error has been considered for an approximating set obtained by minimizing the expected reconstruction error itself over the hypothesis space . The set can thus be interpreted as the output of a population, or infinite sample version of k-means. In this case, it is possible to show that is a non increasing function of and, in fact, to derive explicit rates. For example in the case , and under fairly general technical assumptions, it is possible to show that , where the constants depend on and  [24].

In machine learning, the properties of k-means have been studied,

for fixed , by considering the excess reconstruction error . In particular, this quantity has been studied for , and shown to be, with high probability, of order , up-to logarithmic factors [37]. The case where is a Hilbert space has been considered in [37, 10], where an upper-bound of order is proven to hold with high probability. The more general setting where is a metric space has been studied in [9].

Sphere Dataset MNIST Dataset
Figure 1: We consider the behavior of k-means for data sets obtained by sampling uniformly a dimensional sphere embedded in (left). For each value of , k-means (with k-means++ seeding) is run times, and the best solution kept. The reconstruction performance on a (large) hold-out set is reported as a function of . The results for four different training set cardinalities are reported: for small number of points, the reconstruction error decreases sharply for small and then increases, while it is simply decreasing for larger data sets. A similar experiment, yielding similar results, is performed on subsets of the MNIST (http://yann.lecun.com/exdb/mnist) database (right). In this case the data might be thought to be concentrated around a low dimensional manifold. For example [28] report an average intrinsic dimension for each digit to be between and .

When analyzing the behavior of , and in the particular case that , the above results can be combined to obtain, with high probability, a bound of the form

(3)

up to logarithmic factors, where the constant does not depend on or (a complete derivation is given in the Appendix.) The above inequality suggests a somewhat surprising effect: the expected reconstruction properties of k-means may be described by a trade-off between a statistical error (of order ) and a geometric approximation error (of order .)

The existence of such a tradeoff between the approximation, and the statistical errors may itself not be entirely obvious, see the discussion in [5]. For instance, in the k-means problem, it is intuitive that, as more means are inserted, the expected distance from a random sample to the means should decrease, and one might expect a similar behavior for the expected reconstruction error. This observation naturally begs the question of whether and when this trade-off really exists or if it is simply a result of the looseness in the bounds. In particular, one could ask how tight the bound (3) is.

While the bound on is known to be tight for sufficiently large [24], the remaining terms (which are dominated by ) are derived by controlling the supremum of an empirical process

(4)

and it is unknown whether available bounds for it are tight [37]. Indeed, it is not clear how close the distortion redundancy is to its known lower bound of order (in expectation)  [5]. More importantly, we are not aware of a lower bound for itself. Indeed, as pointed out in [5], “The exact dependence of the minimax distortion redundancy on k and d is still a challenging open problem”.

Finally, we note that, whenever a trade-off can be shown to hold, it may be used to justify a heuristic for choosing

empirically as the value that minimizes the reconstruction error in a hold-out set.

In Figure 1 we perform some simple numerical simulations showing that the trade-off indeed occurs in certain regimes. The following example provides a situation where a trade-off can be easily shown to occur.

(a)
(b)
Figure 2: The optimal k-means (red) computed from samples drawn uniformly on (blue.) For a) , the expected squared-distance to a random point is , while for b) , it is .
Example 1.

Consider a setup in which

samples are drawn from a uniform distribution on the unit

-sphere, though the argument holds for other much smaller than . Because , with high probability, the samples are nearly orthogonal: , while a third sample drawn uniformly on will also very likely be nearly orthogonal to both  [31]. The k-means solution on this dataset is clearly (Fig 2(a)). Indeed, since (Fig 2(b)), it is with very high probability. In this case, it is better to place a single mean closer to the origin (with ), than to place two means at the sample locations. This example is sufficiently simple that the exact k-means solution is known, but the effect can be observed in more complex settings.

4 Main Results

Contributions. Our work extends previous results in two different directions:

  1. We provide an analysis of k-means for the case in which the data-generating distribution is supported on a manifold embedded in a Hilbert space. In particular, in this setting: 1) we derive new results on the approximation error, and 2) new sample complexity results (learning rates) arising from the choice of by optimizing the resulting bound. We analyze the case in which a solution is obtained from an approximation algorithm, such as k-means++ [3], to include this computational error in the bounds.

  2. We generalize the above results from k-means to k-flats, deriving learning rates obtained from new bounds on both the statistical and the approximation errors. To the best of our knowledge, these results provide the first theoretical analysis of k-flats in either sense.

We note that the k-means algorithm has been widely studied in the past, and much of our analysis in this case involves the combination of known facts to obtain novel results. However, in the case of k-flats, there is currently no known analysis, and we provide novel results as well as new performance bounds for each of the components in the bounds.

Throughout this section we make the following technical assumption:

Assumption 1.

is a smooth d-manifold with metric of class , contained in the unit ball in , and with volume measure denoted by . The probability measure is absolutely continuous with respect to , with density .

4.1 Learning Rates for k-Means

The first result considers the idealized case where we have access to an exact solution for k-means.

Theorem 1.

Under Assumption 1, if is a solution of k-means then, for , there are constants and dependent only on , and sufficiently large such that, by setting

(5)

and , it is

(6)

for all , where and grows sublinearly with .

Remark 1.

Note that the distinction between distributions with density in , and singular distributions is important. The bound of Equation (6) holds only when the absolutely continuous part of over is non-vanishing. the case in which the distribution is singular over requires a different analysis, and may result in faster convergence rates.

The following result considers the case where the k-means++ algorithm is used to compute the estimator.

Theorem 2.

Under Assumption 1, if is the solution of k-means++ , then for , there are constants and that depend only on , and a sufficiently large such that, by setting

(7)

and , it is

(8)

for all , where the expectation is with respect to the random choice in the algorithm, and , , and grows sublinearly with .

Remark 2.

In the particular case that and is contained in the unit ball, we may further bound the distribution-dependent part of Equations 6 and 8. Using Hölder’s inequality, one obtains

(9)

where is the Lebesgue measure in , and is the volume of the -dimensional unit ball.

It is clear from the proof of Theorem 1 that, in this case, we may choose

independently of the density , to obtain a bound with probability (and similarly for Theorem 2, except for an additional term), where the constant only depends on the dimension.

Remark 3.

Note that according to the above theorems, choosing requires knowledge of properties of the distribution underlying the data, such as the intrinsic dimension of the support. In fact, following the ideas in [43] Section 6.3-5, it is easy to prove that choosing to minimize the reconstruction error on a hold-out set, allows to achieve the same learning rates (up to a logarithmic factor), adaptively in the sense that knowledge of properties of are not needed.

4.2 Learning Rates for k-Flats

To study k-flats, we need to slightly strengthen Assumption 1 by adding to it by the following:

Assumption 2.

Assume the manifold to have metric of class , and finite second fundamental form II [22].

One reason for the higher-smoothness assumption is that k-flats uses higher order approximation, whose analysis requires a higher order of differentiability.
We begin by providing a result for k-flats on hypersurfaces (codimension one), and next extend it to manifolds in more general spaces.

Theorem 3.

Let, . Under Assumptions 1,2, if is a solution of k-flats, then there is a constant that depends only on , and sufficiently large such that, by setting

(10)

and , then for all it is

(11)

where is the total root curvature of , is the measure associated with the (positive) second fundamental form, and is the Gaussian curvature on .

In the more general case of a -manifold (with metric in ) embedded in a separable Hilbert space , we cannot make any assumption on the codimension of (the dimension of the orthogonal complement to the tangent space at each point.) In particular, the second fundamental form II, which is an extrinsic quantity describing how the tangent spaces bend locally is, at every , a map (in this case of class by Assumption 2) from the tangent space to its orthogonal complement ( in the notation of [22, p. 128].) Crucially, in this case, we may no longer assume the dimension of the orthogonal complement to be finite.
Denote by , the operator norm of . We have:

Theorem 4.

Under Assumptions 1,2, if is a solution to the k-flats problem, then there is a constant that depends only on , and sufficiently large such that, by setting

(12)

and , then for all it is

(13)

where

Note that the better k-flats bounds stem from the higher approximation power of -flats over points. Although this greatly complicates the setup and proofs, as well as the analysis of the constants, the resulting bounds are of order , compared with the slower order of k-means.

4.3 Discussion

In all the results, the final performance does not depend on the dimensionality of the embedding space (which in fact can be infinite), but only on the intrinsic dimension of the space on which the data-generating distribution is defined. The key to these results is an approximation construction in which the Voronoi regions on the manifold (points closest to a given mean or flat) are guaranteed to have vanishing diameter in the limit of going to infinity. Under our construction, a hypersurface is approximated efficiently by tracking the variation of its tangent spaces by using the second fundamental form. Where this form vanishes, the Voronoi regions of an approximation will not be ensured to have vanishing diameter with going to infinity, unless certain care is taken in the analysis.

An important point of interest is that the approximations are controlled by averaged quantities, such as the total root curvature (k-flats for surfaces of codimension one), total curvature (k-flats in arbitrary codimensions), and -norm of the probability density (k-means), which are integrated over the domain where the distribution is defined. Note that these types of quantities have been linked to provably tight approximations in certain cases, such as for convex manifolds [25, 16], in contrast with worst-case methods that place a constraint on a maximum curvature, or minimum injectivity radius (for instance [1, 38].) Intuitively, it is easy to see that a constraint on an average quantity may be arbitrarily less restrictive than one on its maximum. A small difficult region (e.g. of very high curvature) may cause the bounds of the latter to substantially degrade, while the results presented here would not be adversely affected so long as the region is small.

Additionally, care has been taken throughout to analyze the behavior of the constants. In particular, there are no constants in the analysis that grow exponentially with the dimension, and in fact, many have polynomial, or slower growth. We believe this to be an important point, since this ensures that the asymptotic bounds do not hide an additional exponential dependence on the dimension.

References

  • [1] William K Allard, Guangliang Chen, and Mauro Maggioni. Multiscale geometric methods for data sets ii: Geometric multi-resolution analysis. Applied and Computational Harmonic Analysis, 1:1–38, 2011.
  • [2] Daniel Aloise, Amit Deshpande, Pierre Hansen, and Preyas Popat. Np–hardness of euclidean sum-of-squares clustering. Mach. Learn., 75:245–248, May 2009.
  • [3] David Arthur and Sergei Vassilvitskii. k–means++: the advantages of careful seeding. In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, SODA ’07, pages 1027–1035, Philadelphia, PA, USA, 2007. SIAM.
  • [4] Franz Aurenhammer. Voronoi diagrams: A survey of a fundamental geometric data structure. ACM Comput. Surv., 23:345–405, September 1991.
  • [5] Peter L. Bartlett, Tamas Linder, and Gabor Lugosi. The minimax distortion redundancy in empirical quantizer design. IEEE Transactions on Information Theory, 44:1802–1813, 1998.
  • [6] Peter L. Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3:463–482, 2002.
  • [7] M. Belkin and P. Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation. Neural Comput., 15(6):1373–1396, 2003.
  • [8] Mikhail Belkin, Partha Niyogi, and Vikas Sindhwani. Manifold regularization: a geometric framework for learning from labeled and unlabeled examples. J. Mach. Learn. Res., 7:2399–2434, 2006.
  • [9] Shai Ben-David. A framework for statistical clustering with constant time approximation algorithms for k-median and k-means clustering. Mach. Learn., 66(2-3):243–257, March 2007.
  • [10] Gérard Biau, Luc Devroye, and Gábor Lugosi. On the performance of clustering in hilbert spaces. IEEE Transactions on Information Theory, 54(2):781–790, 2008.
  • [11] Gilles Blanchard, Olivier Bousquet, and Laurent Zwald.

    Statistical properties of kernel principal component analysis.

    Mach. Learn., 66:259–294, March 2007.
  • [12] P. S. Bradley and O. L. Mangasarian. k-plane clustering. J. of Global Optimization, 16:23–32, January 2000.
  • [13] Joachim M. Buhmann.

    Empirical risk approximation: An induction principle for unsupervised learning.

    Technical report, University of Bonn, 1998.
  • [14] Joachim M. Buhmann. Information theoretic model validation for clustering. In International Symposium on Information Theory, Austin Texas. IEEE, 2010. (in press).
  • [15] E V Chernaya. On the optimization of weighted cubature formulae on certain classes of continuous functions. East J. Approx, 1995.
  • [16] Kenneth L. Clarkson. Building triangulations using -nets. In

    Proceedings of the thirty-eighth annual ACM symposium on Theory of computing

    , STOC ’06, pages 326–335, New York, NY, USA, 2006. ACM.
  • [17] A. Cuevas and R. Fraiman. Set estimation. In New perspectives in stochastic geometry, pages 374–397. Oxford Univ. Press, Oxford, 2010.
  • [18] A. Cuevas and A. Rodríguez-Casal. Set estimation: an overview and some recent developments. In Recent advances and trends in nonparametric statistics, pages 251–264. Elsevier B. V., Amsterdam, 2003.
  • [19] Sanjoy Dasgupta and Yoav Freund. Random projection trees for vector quantization. IEEE Trans. Inf. Theor., 55:3229–3242, July 2009.
  • [20] Inderjit S. Dhillon, Yuqiang Guan, and Brian Kulis.

    Kernel k-means: spectral clustering and normalized cuts.

    In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’04, pages 551–556, New York, NY, USA, 2004. ACM.
  • [21] J. Dieudonne. Foundations of Modern Analysis. Pure and Applied Mathematics. Hesperides Press, 2008.
  • [22] M.P. DoCarmo. Riemannian geometry. Theory and Applications Series. Birkhäuser, 1992.
  • [23] Allen Gersho and Robert M. Gray. Vector quantization and signal compression. Kluwer Academic Publishers, Norwell, MA, USA, 1991.
  • [24] Siegfried Graf and Harald Luschgy. Foundations of quantization for probability distributions. Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2000.
  • [25] P. M. Gruber. Asymptotic estimates for best and stepwise approximation of convex bodies i. Forum Mathematicum, 15:281–297, 1993.
  • [26] Peter M. Gruber. Optimum quantization and its applications. Adv. Math, 186:2004, 2002.
  • [27] P.M. Gruber. Convex and discrete geometry. Grundlehren der mathematischen Wissenschaften. Springer, 2007.
  • [28] Matthias Hein and Jean-Yves Audibert. Intrinsic dimensionality estimation of submanifolds in rd. In ICML ’05: Proceedings of the 22nd international conference on Machine learning, pages 289–296, 2005.
  • [29] V. De Silva J. B. Tenenbaum and J. C. Langford. A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500):2319–2323, 2000.
  • [30] Ravikrishna Kolluri, Jonathan Richard Shewchuk, and James F. O’Brien. Spectral surface reconstruction from noisy point clouds. In Proceedings of the 2004 Eurographics/ACM SIGGRAPH symposium on Geometry processing, SGP ’04, pages 11–21, New York, NY, USA, 2004. ACM.
  • [31] M. Ledoux. The Concentration of Measure Phenomenon. Mathematical Surveys and Monographs. American Mathematical Society, 2001.
  • [32] David Levin.

    Mesh-independent surface interpolation.

    In Hamann Brunnett and Mueller, editors, Geometric Modeling for Scientific Visualization, pages 37–49. Springer-Verlag, 2003.
  • [33] Stuart P. Lloyd. Least squares quantization in pcm. IEEE Transactions on Information Theory, 28:129–137, 1982.
  • [34] J. B. MacQueen. Some methods for classification and analysis of multivariate observations. In L. M. Le Cam and J. Neyman, editors, Proc. of the fifth Berkeley Symposium on Mathematical Statistics and Probability, volume 1, pages 281–297. University of California Press, 1967.
  • [35] Meena Mahajan, Prajakta Nimbhorkar, and Kasturi Varadarajan. The planar k–means problem is np-hard. In Proceedings of the 3rd International Workshop on Algorithms and Computation, WALCOM ’09, pages 274–285, Berlin, Heidelberg, 2009. Springer-Verlag.
  • [36] Julien Mairal, Francis Bach, Jean Ponce, and Guillermo Sapiro. Online dictionary learning for sparse coding. In Proceedings of the 26th Annual International Conference on Machine Learning, ICML ’09, pages 689–696, 2009.
  • [37] A. Maurer and M. Pontil. K–dimensional coding schemes in hilbert spaces. IEEE Transactions on Information Theory, 56(11):5839 –5846, nov. 2010.
  • [38] Hariharan Narayanan and Sanjoy Mitter.

    Sample complexity of testing the manifold hypothesis.

    In Advances in Neural Information Processing Systems 23, pages 1786–1794. MIT Press, 2010.
  • [39] David Pollard. Strong consistency of k-means clustering. Annals of Statistics, 9(1):135–140, 1981.
  • [40] ST Roweis and LK Saul. Nonlinear dimensionality reduction by locally linear embedding. Science, 290:2323–2326, 2000.
  • [41] David Slepian. The one-sided barrier problem for Gaussian noise. Bell System Tech. J., 41:463–501, 1962.
  • [42] Florian Steinke, Matthias Hein, and Bernhard Schölkopf. Nonparametric regression between general Riemannian manifolds. SIAM J. Imaging Sci., 3(3):527–563, 2010.
  • [43] I. Steinwart and A. Christmann. Support vector machines. Information Science and Statistics. Springer, New York, 2008.
  • [44] G Fejes Toth. Sur la representation d´une population in par une nombre d´elements. Acta Math. Acad. Sci. Hungaricae, 1959.
  • [45] Ulrike von Luxburg. A tutorial on spectral clustering. Stat. Comput., 17(4):395–416, 2007.
  • [46] Paul L. Zador. Asymptotic quantization error of continuous signals and the quantization dimension. IEEE Transactions on Information Theory, 28(2):139–148, 1982.

Appendix A Methodology and Derivation of Results

Although both k-means and k-flats optimize the same empirical risk, the performance measure we are interested in is that of Equation 1. We may bound it from above as follows:

(14)
(15)

where is the best attainable performance over , and is a set for which the best performance is attained. Note that by the definition of . The same error decomposition can be considered for k-flats, by replacing by and by .

Equation 14 decomposes the total learning error into two terms: a uniform (over all sets in the class ) bound on the difference between the empirical, and true error measures, and an approximation error term. The uniform statistical error bound will depend on the samples, and thus may hold with a certain probability.

In this setting, the approximation error will typically tend to zero as the class becomes larger (as increases.) Note that this is true, for instance, if is the class of discrete sets of size , as in the k-means problem.

The performance of Equation 14

is, through its dependence on the samples, a random variable. We will thus set out to find probabilistic bounds on its performance, as a function of the number

of samples, and the size of the approximation. By choosing the approximation size parameter to minimize these bounds, we obtain performance bounds as a function of the sample size.

Appendix B K-Means

We use the above decomposition to derive sample complexity bounds for the performance of the k-means algorithm. To derive explicit bounds on the different error terms we have to combine in a novel way some previous results and some new observations.

Approximation error. The error is related to the problem of optimal quantization. The classical optimal quantization problem is quite well understood, going back to the fundamental work of [46, 44] on optimal quantization for data transmission, and more recently by the work of [24, 27, 26, 15]

. In particular, it is known that, for distributions with finite moment of order

, for some , it is [24]

(16)

where is the Lebesgue measure, is the density of the absolutely continuous part of the distribution (according to its Lebesgue decomposition), and is a constant that depends only on the dimension. Therefore, the approximation error decays at least as fast as .

We note that, by setting to be the uniform distribution over the unit cube , it clearly is

and thus, by making use of Zador’s asymptotic formula [46], and combining it with a result of Böröczky (see [27], p. 491), we observe that with , for the -th order quantization problem. In particular, this shows that the constant only depends on the dimension, and, in our case (), has only linear growth in , a fact that will be used in the sequel.

The approximation error of k-means is related to the problem of optimal quantization on manifolds, for which some results are known [26]. By calling the approximation error only among sets of means contained in , Theorem 5 in Appendix  C, implies in this case (letting ) that

(17)

where is absolutely continuous over and, by replacing with a -dimensional domain in , it is clear that the constant is the same as above.

Since restricting the means to be on cannot decrease the approximation error, it is , and therefore the right-hand side of Equation 17 provides an (asymptotic) upper bound to .

For the statistical error we use available bounds.
Statistical error. The statistical error of Equation 14, which uniformly bounds the difference between the empirical, and expected error, has been widely-studied in recent years in the literature [37, 38, 5]. In particular, it has been shown that, for a distribution over the unit ball in , it is

(18)

with probability  [37]. Clearly, this implies convergence almost surely, as ; although this latter result was proven earlier in [39], under the less restrictive condition that have finite second moment.
By bringing together the above results, we obtain the bound in Theorem 1 on the performance of k-means, whose proof is postponed to Appendix A.

Further, we can consider the error incurred by the actual optimization algorithm used to compute the k-means solution.
Computational error. In practice, the k-means problem is NP-hard [2, 19, 35], with the original Lloyd relaxation algorithm providing no guarantees of closeness to the global minimum of Equation 2. However, practical approximations, such as the k-means++ algorithm [3], exist. When using k-means++, means are inserted one by one at samples selected with probability proportional to their squared distance to the set of previously-inserted means. This randomized seeding has been shown by [3] to output a set that is, in expectation, within a -factor of the optimal. Once again, by combining these results, we obtain Theorem 2, whose proof is also in Appendix A.

We use the results discussed in Section A to obtain the proof of Theorem 1 as follows.

Proof.

Letting , then with probability , it is

(19)

where the parameter

(20)

has been chosen to balance the summands in the third line of Equation 19. ∎

The proof of Theorem 2 follows a similar argument.

Proof.

In the case of Theorem 2, the additional multiplicative term corresponding to the computational error incurred by the k-means++ algorithm does not affect the choice of parameter since both summands in the third line of Equation 19 are multiplied by in this case. Therefore, we may simply use the same choice of as in Equation 20 in this case to obtain

(21)

with probability , where the expectation is with respect to the random choice in the algorithm. From this the bound of Theorem 2 follows. ∎

Appendix C K-Flats

Here we state a series of lemma that we prove in the next section. For the k-flats problem, we begin by introducing a uniform bound on the difference between empirical (Equation 2) and expected risk (Equation 1.)

Lemma 1.

If is the class of sets of -dimensional affine spaces then, with probability on the sampling of , it is

By combining the above result with approximation error bounds, we may produce performance bounds on the expected risk for the k-flats problem, with appropriate choice of parameter . We distinguish between the codimension one hypersurface case, and the more general case of a smooth manifold embedded in a Hilbert space. We begin with an approximation error bound for hypersurfaces in Euclidean space.

Lemma 2.

Assume given smooth with metric of class in . If is the class of sets of -dimensional affine spaces, and is the minimizer of Equation 1 over , then there is a constant that depends on only, such that

where is the total root curvature of , and is the measure associated with the (positive) second fundamental form. The constant grows as with .

For the more general problem of approximation of a smooth manifold in a separable Hilbert space, we begin by considering the definitions in Section 4 the second fundamental form II and its operator norm at a point . The we have:

Lemma 3.

Assume given a -manifold with metric in embedded in a separable Hilbert space . If is the class of sets of -dimensional affine spaces, and is the minimizer of Equation 1 over , then there is a constant that depends on only, such that

where and is the volume measure over . The constant grows as with .

We combine these two results into Theorems 3 and 4, whose derivation is in Appendix B.

c.1 Proofs

We begin proving the bound on the statistical error given in Lemma 1.

Proof.

We begin by finding uniform upper bounds on the difference between Equations 1 and 2 for the class of sets of -dimensional affine spaces. To do this, we will first bound the Rademacher complexity of the class .

Let and be Gaussian processes indexed by , and defined by

(22)

, is the union of -subspaces: , where each is an orthogonal projection onto , and

are independent Gaussian sequences of zero mean and unit variance.

Noticing that for any orthogonal projection (see for instance [11], Sec. 2.1), where is the Hilbert-Schmidt inner product, we may verify that:

(23)

Since it is,

(24)

we may bound the Gaussian complexity as follows:

(25)

where the first inequality follows from Equation 23 and Slepian’s Lemma [41], and the second from Equation 24.

Therefore the Rademacher complexity is bounded by

(26)

Finally, by Theorem 8 of [6], it is:

(27)

as desired. ∎

c.2 Approximation Error

In order to prove approximation bounds for the k-flats problem, we will begin by first considering the simpler setting of a smooth -manifold in space (codimension ), and later we will extend the analysis to the general case.

Approximation Error: Codimension One

Assume that it is with the natural metric, and is a compact, smooth -manifold with metric of class . Since is of codimension one, the second fundamental form at each point is a map from the tangent space to the reals. Assume given and . At every point , define the metric , where

  • I and II are, respectively, the first and second fundamental forms on  [22].

  • is the convexified

    second fundamental form, whose eigenvalues are those of

    II but in absolute value. If the second fundamental form II is written in coordinates (with respect to an orthonormal basis of the tangent space) as , with orthonormal, and diagonal, then is in coordinates. Because is continuous and positive semi-definite, it has an associated measure (with respect to the volume measure .)

  • is chosen such that . Note that such always exists since:

    • implies , and

    • can be made arbitrarily large by increasing .

    and therefore there is some intermediate value of that satisfies the constraint.

In particular, from condition c), it is clear that is everywhere positive definite.

Let and be the measures over , associated with I and . Since, by its definition, is absolutely continuous with respect to I, then so must be. Therefore, we may define

to be the density of with respect to .

Consider the discrete set of size that minimizes the quantity

(28)

among all sets of points on . is the (fourth-order) quantization error over , with metric , and with respect to a weight function . Note that, in the definition of , it is crucial that the measure (), and distance () match, in the sense that is the geodesic distance with respect to the metric , whose associated measure is .

The following theorem, adapted from [26], characterizes the relation between and the quantization error on a Riemannian manifold.

Theorem 5.

[[26]] Given a smooth compact Riemannian -manifold with metric of class , and a continuous function , then

(29)

as