Clustering, the task of partitioning data into groups, is one of the fundamental tasks in machine learning[BIS06, NJW01, EKS+96, KK10, SW18]. The data that practitioners seek to cluster is commonly modeled as i.i.d. samples from a probability density function, an assumption foundational in theory, statistics, and AI [BIS06, GS16, BC17, MV10, GLR18]. A clustering algorithm on data drawn from a probability density function should ideally converge to a partitioning of the probability density function, as the number of i.i.d. samples grows large [VBB08, TS15, GS16].
In this paper, we outline a new strategy for clustering, and make progress on implementing it. We propose the following two questions to help generate new clustering algorithms:
How can we partition probability density functions? How can we do this so that two data points drawn from the same part of the partition are likely to be similar, and two data points drawn from different parts of the partition are likely to be dissimilar?
What clustering algorithms converge to such a partition, as the number of samples from the density function grows large?
In this paper, we address the first point, and make partial progress on the second. We focus on the special case of -way partitioning, which can be seen as finding a good cut on the probability density function. First, we propose a new notion of sparse (or isoperimetric) cuts on density functions. We call this an -sparse cut, for real parameters and . Next, we propose a new notion of spectral sweep cuts on probability densities, called a -spectral sweep cut, for real parameters and . We show that a -spectral sweep cut provably approximates an -sparse cut when and . In particular, is such a setting. Our result holds for any -Lipschitz probability density function on , for any . Based on past success applying sparse graph cuts to dividing data into two pieces in the machine learning setting [GS06, BN04], we believe our similarly defined -sparse cuts may have similar ideal behavior when it comes to partitioning our density function.
To our knowledge, this is the first spectral method of cutting probability density functions that has any theoretical guarantee on the cut quality. The key mathematical contribution of this paper is a new Cheeger and Buser inequality for probability density functions, which we use to prove that -spectral sweep cuts approximate -sparse cuts on probability density functions for the aforementioned settings of . These inequalities are inspired by the Cheeger and Buser inequalities on graphs and manifolds [AM84, CHE70, BUS82], which have received considerable attention in graph algorithms and machine learning [CHU97, ST04, OSV+08, OV11, KW16, BN04]. These new inequalities do not directly follow from either the graph or manifold Cheeger inequalities, something we detail in Section 1.4
. We note that our Cheeger and Buser inequalities for probability density functions require a careful definition of eigenvalue and sparse/isoperimetric cut: existing definitions lead to false inequalities.
Finally, our paper will present a discrete -way clustering algorithm that we suspect converges to the -spectral sweep cut as the number of data points grows large. Our algorithm bears similarity to classic spectral clustering methods, although we note that classical spectral clustering does not have any theoretical guarantees on the cluster quality. We note that we do not prove convergence of our discrete clustering method to the -spectral sweep cut, and leave this for future work.
In this subsection, we define sparsity, eigenvalues/Rayleigh quotients, and sweep cuts.
Let be a probability density function with domain , and let be a subset of .
The -sparsity of the cut defined by is the dimensional integral of on the cut, divided by the dimensional integral of on the side of the cut where this integral is smaller.
The -Rayleigh quotient of with respect to is:
A -principal eigenvalue of is , where:
Define a -principal eigenfunction of to be a function such that .
Now we define a sweep cut for a given function with respect to a a positive valued function supported on :
Let be two real numbers, and be any function from to . Let be any function from , and let be the cut defined by the set .
The sweep-cut algorithm for with respect to returns the cut of minimum sparsity, where this sparsity is measured with respect to .
When is a -principal eigenfunction, the sweep cut is called a -spectral sweep cut of .
A function is -Lipschitz if for all
A function is is -integrable if is well defined and finite. Throughout this paper, we assume is always -integrable.
1.2 Past Work
Spectral Clustering and Sweep Cut Algorithms on Data
The spectral clustering algorithms of Shi and Malik [SM97] and those of Ng, Jordan, and Weiss [NJW01] are some of the most popular clustering algorithms on data (over 10,000 citations). If we want to split data points into two clusters, their algorithm works as follows: for data points, compute an matrix
on the data, and compute the principal eigenvectorof the matrix. Then, find a threshold value such that all points where are considered to be on one side of the cut, and all other points where are on the other. Often, the matrix is a Laplacian matrix of some graph built from the data [VON07].
Von Luxburg, Belkin, and Bosquet [VBB08] proved that if the data is modeled as i.i.d samples from a probability density , the matrix is a Laplacian matrix with certain structural assumptions, and certain regularity assumptions on hold, then classical spectral clustering algorithms converge to a -spectral sweep cut on 111We note that these authors used different terminology to describe this result, as their papers did not define -spectral sweep cuts.. These results were refined in [RBV10, GS15, TS15]. We note that there are no sparsity guarantees known for a -spectral sweep cut, and we show -Lipschitz examples of where this spectral sweep cut leads to undesirable behavior.
Cheeger Inequality and Sparse Cuts in Graphs
In 1984, Alon and Milman discovered a graph Cheeger inequality [AM84], which showed that a graph spectral sweep cut is approximately sparse. For a formal definition of sparse (or isoperimetric) cuts in a graph theoretic setting, see [AM84]. The graph Cheeger inequality has guided decades of algorithmic and theoretical research on graph partitioning, random walks, and spectral graph theory in general [CHU97, KW16, OSV+08, OV11, ST04].
The graph Cheeger inequality implies that partitioning a graph based on the principal graph eigenvector (via spectral-sweep methods) will find a provably sparse cut [CHU97, OV11, OSV+08, LRT+12, LGT14a]. It also implies that a slowly mixing random walk is likely to yield good information about a sparse graph cut; this intuition was leveraged in Spielman and Teng’s seminal nearly-linear time algorithm on graph partitioning in [ST04]. Cheeger’s inequality for graphs has been an inspiration for decades of spectral graph theory research (for more information, see [CHU97]).
Sparse cuts have been researched extensively in graph theory [LR88, ARV04, CGR05, AP09, MAD10, LRT+12, KW16]. Sparse cuts, and the related notion of balanced cuts, have been used to generate fast multicommodity flow algorithms [LR88, ARV04]. These cuts are deeply related to expander decomposition [ST04, WUL17, SW19], which in turn has proven immensely useful in algorithm design. For a brief overview of the application of sparse and balanced cuts in computer science theory, see the introduction of [MAD10]. For a survey of uses of expander decomposition, see the introduction of [SW19].
We note that expander decompositions are based on the idea that sparse cuts form natural partitions of the graph into clustered components. This intuition is leveraged in machine learning [BN04]. Cheeger’s Inequality and Sparse Cuts for Manifolds
The Cheeger inequality on a manifold states that the fundamental eigenvalue of the Beltrami-Laplace operator is bounded below by the square of the sparse or isoperimetric cut, divided by (See [CHE70] for details). In 1982, Buser proved an upper bound for provided that the manifold has lower bounded Ricci curvature [BUS82]. In contrast to the graph case, this inequality is false without the curvature assumption. This inequality has been widely used in the theory of differential geometry [VSC08, HK00]
. Intuition based on manifold Cheeger and manifold Buser has been used to great effect in semi-supervised learning and image processing[BN04, GS06].
The Cheeger-Buser Inequality and Sparse Cuts for Convex Bodies and Density Functions
The Cheeger-Buser inequality, and the related notion of sparse cuts, have seen renewed interest in the computer science literature [LS90, LV18a, MIL09]. Past researchers have used ideas inspired by sparse cuts and the Cheeger-Buser inequality to recover fast mixing time lemmas for random walks for log-concave density functions supported on convex polytopes [LS90, DFK91, KLS95, LV18a, GM83]. These lemmas have in turn been used to find fast sampling algorithms from such density functions [LV18a], and more.
One prominent use of sparse cuts in convex geometry is the celebrated Kannan-Lovasz-Simonovits Conjecture (KLS Conjecture). This conjecture asserts that, for a convex body, the sparsity of the sparsest hyperplane cut is a dimension-less constant approximation of the sparsity of the optimal sparse cut[KLS95, LV18b]. Generally, these results all implicitly use the settings for their definition of sparse cuts, eigenvalues, and eigenvectors. They operate in the setting where there is significant structure on the probability, such as being a log-concave distribution [LV18a]. We note that the Cheeger-Buser inequality fails for this setting of when the probability density is Lipschitz but not log-concave. For a survey on uses of Cheeger’s inequality in convex polytope theory and log-concave density function sampling, see [LV18b].
Discrete Machine Learning from Continuous Methods
Our overall approach to clustering follows the line of work generating discrete machine learning methods by analyzing its behavior in the limit. This approach was used fruitfuilly by Su, Boyd, and Candes [SBC14] and Wibisono, Wilson, and Jordan [WWJ16] to generate faster gradient descent variants based on continuous time differential equations. For more information, refer to their respective papers.
Our paper has three core contributions:
A natural method for cutting probability density functions, based on a new notion of sparse cuts on density functions.
A Cheeger and Buser inequality for Lipschitz probability density functions, and
A clustering algorithm operating on samples, which heuristically approximates a spectral sweep cut on the density function when the number of samples grows large.
We emphasize that our primary contributions are points 1 and 2, which are formally stated in Theorems 1.4 and 1.5 respectively. Our clustering algorithm on samples, which is designed to approximate the -spectral sweep cut on the density function as the number of samples grows large, is of secondary importance.
We now state our two main theorems.
Spectral Sweep Cuts give Sparse Cuts:
Let be an -Lipschitz probability density function, and let .
The -spectral sweep cut of has sparsity satisfying:
Here, refers to the optimal sparsity of a cut on .
In words, the spectral sweep cut of the eigenvector gives a provably good approximation to the sparsest cut, as long as and . Proving this result is a straightforward application of two new inequalities we present, which we will refer to as as the Cheeger and Buser inequalities for probability density functions.
Probability Density Cheeger and Buser:
Let be an -Lipschitz density function. Let .
Let be the infimum -sparsity of a cut through , and let be the -principal eigenvalue of . Then:
The first inequality is Probability Density Cheeger, and the second inequality is Probability Density Buser.
Note that we don’t need to have a total mass of for any of our proofs. The overall probability mass of can be arbitrary.
We note that Theorem 1.5 has a partial converse:
If or , then the Cheeger-Buser inequality in Theorem 1.5 does not hold.
In particular, if or , no Cheeger-Buser inequality can hold for any . These settings of and encompass most past work on spectral cuts and Cheeger-Buser inequalities on probability denstiy functions.
Finally, we give a discrete algorithm 1,3-SpectralClustering for clustering data points into two-clusters. We conjecture, but do not prove, that 1,3-SpectralClustering converges to the the -spectral sweep cut of the probability density function as the number of samples grows large.
We note that this resembles the unnormalized spectral clustering based on the work of Shi and Malik [SM97] and Ng, Weiss, and Jordan [NJW01, TS15]. The major difference is that we build our Laplacian from the matrix . Past work on spectral clustering builds the Laplacian from the matrices or [VON07, TS15].
1.4 Differences between our work and past work
Our work differs from past work in the following key ways:
Our work differs from past practical work on spectral sweep cuts cuts [GS15, TS15, SM97, NJW01], as those methods perform what we call a -sweep cut. These sweep cuts have no theoretical guarantees, much less a guarantee on their sparsity. Lemma 1.6 shows that no Cheeger and Buser inequality can simultaneously hold for any setting of when .
We will further show that using a -sweep cut can lead to undesirable cuts of -Lipschitz probability densities, with poor sparsity guarantees.
We note that probability density Cheeger-Buser is not easily implied by graph or manifold Cheeger-Buser. For a lengthier discussion on this, see Appendix A.
For our work, the probability density is not required to be bounded away from . This is a sharp departure from many existing results: past results on partitioning probability densities required a positive lower bound on [VBB08, GS15]
. The strongest results in fields like linear elliptic partial differential equations depend onbeing bounded away from [WHI17].
Our work is the first spectral sweep cut algorithm that guarantees a sparse cut on Lipschitz densities , without requiring strong parametric assumptions on .
1.4.1 Technical Contribution
The key technical contribution of our proof is proving Buser’s inequality on Lipschitz probability densities via mollification [SOB38, FRI44] with disks of varying radius. This paper is the first time mollification with disks of varying radius have been used. We emphasize that the most difficult part of our paper is proving the Buser inequality.
Mollification has a long history in mathematics dating back to Sergei Sobolev’s celebrated proof of the Sobolev embedding theorem [SOB38]. It is one of the key tools in numerical analysis, partial differential equations, fluid mechanics, and functional analysis [FRI44, LW01, SS09, MON03], and analogs of mollification have been used in computational complexity settings [DKN10]. Informally speaking, mollification is used to create a series of smooth functions approximating a non-smooth function, by convolving the original function with a smooth function supported on a disk. Notably, an approach using convolution is used by Buser in [BUS82] to prove the original Buser’s inequality, albeit with an intricate pre-processing step on any given cut.
To prove Buser’s inequality on Lipschitz probability density functions , we will show that given a cut with low -sparsity, we can find a function with low -Rayleigh quotient. We build by starting with the indicator function for cut (which is on one side of the cut and on the other). Next, we mollify this function with disks of varying radius. In particular, for each point in the domain of , we spread out the point mass over a disk of radius proportional to , where L is the Lipschitz constant of . The resulting function obtained by ‘spreading out’ will have low -Rayleigh quotient.
For all past uses of mollification, the disks on which the smooth convolving function is supported (we call this the mollification disk) have the same radius throughout the manifold. The use of a uniform radius disk is critical for most uses and proofs in mollification. Our contribution is to allow the disks to vary in radius across our density. This variation in radius allow us to deal with functions that approach (and explains the importance of the density being Lipschitz). No mollification disks centered anywhere in our probability density will intersect the -set of the density. This overcomes significant hurdles in many results for functional analysis and PDEs, as many past significant results related to partial differential equations rely on having a positive lower bound on the density [WHI17, GS15].
Proving our Buser inequality using mollification by disks of various radius requires a fairly delicate proof with many pages of calculus. Our key technical lemma is a bound on how the norm of a mollified function when the mollification disks have various radius, which can be found in Section 3.3.
2 Paper Organization
We prove the Buser inequality in Section 3, via a rather extensive series of calculus computations. Our proof relies on a key technical lemma, which is presented in Section 3.3. This is by far the most difficult part of our proof.
We prove the Cheeger inequality in Section 4. The proof in this section implies that the sparsity of the spectral sweep cut of a probability density function is provably close to the principal eigenvalue of . We note that this inequality does not depend on the Lipschitz nature of the probability density function.
In Section 6, we go over example -D distributions that show that either Cheeger or Buser inequality must fail for past definitions of sparsity and eigenfunctions. We will prove Lemma 1.6 in this section.
In Section 7, we show an example Lipschitz probability density where the spectral sweep cut has bad sparsity for any , and will lead to an undesirable cut (from a clustering point of view) on this density function. This is important since the spectral clustering algorithm of Ng et al [NJW01] is known to converge to a spectral sweep cut on the underlying probability density function, as the number of samples grows large [TS15].
Finally, we state conclusions and open problems in Section 8.
In the appendix, we note that the Cheeger and Buser inequalities for probability densities are not easily implied by graph or manifold Cheeger-Buser. We also provide a simplified version of Cheeger’s and Buser’s inequality for probability densities, in the -dimensional case. This may make easier reading for those unfamiliar with technical multivariable mollification.
3 Buser Inequality for Probability Density Functions
The key idea to proving Buser’s inequality is as follows: given , and a cut where , we will build a function whose -Rayleigh quotient is close to the sparsity of .
Roughly speaking, is built by convolving at point with a unit-weight disk with radius proportional to . Thus, we are convolving with a disk, where the radius of the disk varies between points , and the radius is directly proportional to .
3.1 Weighted Buser-type Inequality
We now prove our weighted Buser-type inequality, from Theorem 1.5. We state our result in terms of general .
Let be an -Lipschitz function, be a -principal eigenvalue, and the isoperimetric cut.
We note that when setting , the above expression simplifies into:
3.2 Proof Strategy: Mollification by Disks of Radius Proportional to
To prove Theorem 3.1 we construct an approximation of for which the numerator and denominator of the Raleigh quotient, , approximate respectively the numerator and denominator of this expression. Specifically, will constructed as a mollification of , Recall the following two equivalent definitions of a mollification. They are equivalent by the change of variables .
with a parameter to be chosen and a smooth radially symmetric function supported in the unit open ball with unit mass . When is constant it follows from the Tonelli theorem that ; when is not constant the following lemma shows that the latter still bounds the former.
3.3 Key Technical Lemma: Bounding norm of a function with the norm of its mollification
The following is our primary technical lemma, which roughly bounds the norm of a mollified function by the norm of the original . Here, the mollification radius is determined by a function .
Let be Lipschitz continuous with Lipschitz constant for almost every . Let be smooth, , and . Then
(of Lemma 3.2) An application of Tonelli’s theorem shows
Fix and consider the change of variables . The Jacobian of this mapping is which by Sylvester’s determinant theorem has determinant . It follows that
and the lemma follows since . ∎
(Here, denotes the dot product between and .)
We present the following simple corollaries, which is the primary way our proof makes use of Lemma 3.2
For any Lipschitz continuous function with Lipschitz constant and any with , we have:
This corollary will be used to bound the numerator of our Rayleigh quotient. Note that the expression
We present another simple corollary whose proof is equally straightforward. This corollary will be used to bound the denominator, and is a small generalization of Corollary 3.3. We write down both corollaries anyhow, since this will make it easier to interpret our bounds on the Rayleigh quotient.
For any Lipschitz continuous function with Lipschitz constant , any , and any with , we have:
Now we are ready to prove our main Theorem, which is the Buser inequality for probability densities stated in Theorem 3.1.
(of Theorem 3.1)
Fix with and let
be the characteristic function of. Setting to be the weighted average of ,
Since and it follows that
In the calculations below we omit the limiting argument with smooth approximations of outlined at the beginning of this section which justify formula involving , and for readability frequently write and for and .
Next, let be the mollification of (an extension of) given by equation (1). Then is a local average average of so , and . Letting denote the Lipschitz constant of , the parameter will to be chosen less than so that that Lemma 3.2 is applicable with constant .
The remainder of the proof constructs an upper bound on the numerator of the Raleigh quotient for by and to lower bound the denominator by . The conclusion of the theorem then follows from equation (5).
3.4 Upper Bounding the Numerator
To bound the norm in the numerator of the Raleigh quotient by the norm in the numerator of the expression for it is necessary to obtain uniform bound on .
Let be any function, and let be defined as in Equation 1. Let be an -Lipschitz function.
In order to prove this lemma, we first need to get a handle on , which is the gradient of after mollification by .
We take the the second representation of in equation (1) to get
which is a consequence of the multivariable chain rule. Here,refers to the outer product of and .
Multiplying by gives:
Now, we can bound the above equation by carefully bounding each part. We note:
where the last step follows by a simple change of variable. Here, we note that
is a vector, and the integral is over, which is how we eliminated from the expression.
Next, we examine the term:
Here, we aim to bound the operator norm of this matrix. Here, we note that
and thus, when the latter equation holds, we can say:
Since , we now have:
where is the Lipschitz constant for . We note that Section 3.7 shows that
Now we turn our attention to the first term, which is:
We note that
by our definition of (which was defined when we defined ). Combining this with , we get:
This allows us to bound :
where we make use of the fact that This completes our proof. ∎
Next, we want an bound on .
Let be any function, and let be defined as in Equation 1. Let be an -Lipschitz function, and let .
First, we take the gradient first representation of in equation (1). Using the chain rule gives us an alternate form for :
The ratio in the integrand is bounded using the Lipschitz assumption on (and ),
where represents the matrix norm of . This is because , and , and every time , and thus
For any -Lipschitz distribution , any function , and any such that :
We note that in the case where , and if is a step function, the expression would simplify to:
3.5 Lower Bound on the Demoninator
Let and be the –weighted averages of and and let denote the space with this weight. Our core lemma is a bound on in terms of and weighted norms of and respectively.
Let be an -Lipschitz function , and let be such that . Let be an indicator function of a set with finite -perimeter. Let be defined as be defined as in Equation 1, and be defined as . Then:
Note that when , as is true when , the inequality in Lemma 3.8 becomes:
The key to this proof is to upper bound the quantity with the expression appearing in Corollary 3.4. We will do so by a series of inequalities, application of the fundamental theorem of calculus, and more.
Using the property that subtracting the average from a function reduces the norm it follows that
If then , so a lower bound for the denominator of the Raleigh quotient
where the identity from Equation 4, and the bound , were used in the last step.
It remains to estimate the difference . To do this, we use the multivariable fundamental theorem of calculus to write
where the first and second equalities came from application of the multivariable fundamental theorem of calculus, and the last equation is straightforward. This tells us that: