1 Introduction
Clustering, the task of partitioning data into groups, is one of the fundamental tasks in machine learning
[BIS06, NJW01, EKS+96, KK10, SW18]. The data that practitioners seek to cluster is commonly modeled as i.i.d. samples from a probability density function, an assumption foundational in theory, statistics, and AI [BIS06, GS16, BC17, MV10, GLR18]. A clustering algorithm on data drawn from a probability density function should ideally converge to a partitioning of the probability density function, as the number of i.i.d. samples grows large [VBB08, TS15, GS16].In this paper, we outline a new strategy for clustering, and make progress on implementing it. We propose the following two questions to help generate new clustering algorithms:

How can we partition probability density functions? How can we do this so that two data points drawn from the same part of the partition are likely to be similar, and two data points drawn from different parts of the partition are likely to be dissimilar?

What clustering algorithms converge to such a partition, as the number of samples from the density function grows large?
In this paper, we address the first point, and make partial progress on the second. We focus on the special case of way partitioning, which can be seen as finding a good cut on the probability density function. First, we propose a new notion of sparse (or isoperimetric) cuts on density functions. We call this an sparse cut, for real parameters and . Next, we propose a new notion of spectral sweep cuts on probability densities, called a spectral sweep cut, for real parameters and . We show that a spectral sweep cut provably approximates an sparse cut when and . In particular, is such a setting. Our result holds for any Lipschitz probability density function on , for any . Based on past success applying sparse graph cuts to dividing data into two pieces in the machine learning setting [GS06, BN04], we believe our similarly defined sparse cuts may have similar ideal behavior when it comes to partitioning our density function.
To our knowledge, this is the first spectral method of cutting probability density functions that has any theoretical guarantee on the cut quality. The key mathematical contribution of this paper is a new Cheeger and Buser inequality for probability density functions, which we use to prove that spectral sweep cuts approximate sparse cuts on probability density functions for the aforementioned settings of . These inequalities are inspired by the Cheeger and Buser inequalities on graphs and manifolds [AM84, CHE70, BUS82], which have received considerable attention in graph algorithms and machine learning [CHU97, ST04, OSV+08, OV11, KW16, BN04]. These new inequalities do not directly follow from either the graph or manifold Cheeger inequalities, something we detail in Section 1.4
. We note that our Cheeger and Buser inequalities for probability density functions require a careful definition of eigenvalue and sparse/isoperimetric cut: existing definitions lead to false inequalities.
Finally, our paper will present a discrete way clustering algorithm that we suspect converges to the spectral sweep cut as the number of data points grows large. Our algorithm bears similarity to classic spectral clustering methods, although we note that classical spectral clustering does not have any theoretical guarantees on the cluster quality. We note that we do not prove convergence of our discrete clustering method to the spectral sweep cut, and leave this for future work.
1.1 Definitions
In this subsection, we define sparsity, eigenvalues/Rayleigh quotients, and sweep cuts.
Definition 1.1.
Let be a probability density function with domain , and let be a subset of .
The sparsity of the cut defined by is the dimensional integral of on the cut, divided by the dimensional integral of on the side of the cut where this integral is smaller.
Definition 1.2.
The Rayleigh quotient of with respect to is:
A principal eigenvalue of is , where:
Define a principal eigenfunction of to be a function such that .
Now we define a sweep cut for a given function with respect to a a positive valued function supported on :
Definition 1.3.
Let be two real numbers, and be any function from to . Let be any function from , and let be the cut defined by the set .
The sweepcut algorithm for with respect to returns the cut of minimum sparsity, where this sparsity is measured with respect to .
When is a principal eigenfunction, the sweep cut is called a spectral sweep cut of .
Additional Definitions:
A function is Lipschitz if for all
A function is is integrable if is well defined and finite. Throughout this paper, we assume is always integrable.
1.2 Past Work
Spectral Clustering and Sweep Cut Algorithms on Data
The spectral clustering algorithms of Shi and Malik [SM97] and those of Ng, Jordan, and Weiss [NJW01] are some of the most popular clustering algorithms on data (over 10,000 citations). If we want to split data points into two clusters, their algorithm works as follows: for data points, compute an matrix
on the data, and compute the principal eigenvector
of the matrix. Then, find a threshold value such that all points where are considered to be on one side of the cut, and all other points where are on the other. Often, the matrix is a Laplacian matrix of some graph built from the data [VON07].Von Luxburg, Belkin, and Bosquet [VBB08] proved that if the data is modeled as i.i.d samples from a probability density , the matrix is a Laplacian matrix with certain structural assumptions, and certain regularity assumptions on hold, then classical spectral clustering algorithms converge to a spectral sweep cut on ^{1}^{1}1We note that these authors used different terminology to describe this result, as their papers did not define spectral sweep cuts.. These results were refined in [RBV10, GS15, TS15]. We note that there are no sparsity guarantees known for a spectral sweep cut, and we show Lipschitz examples of where this spectral sweep cut leads to undesirable behavior.
Cheeger Inequality and Sparse Cuts in Graphs
In 1984, Alon and Milman discovered a graph Cheeger inequality [AM84], which showed that a graph spectral sweep cut is approximately sparse. For a formal definition of sparse (or isoperimetric) cuts in a graph theoretic setting, see [AM84]. The graph Cheeger inequality has guided decades of algorithmic and theoretical research on graph partitioning, random walks, and spectral graph theory in general [CHU97, KW16, OSV+08, OV11, ST04].
The graph Cheeger inequality implies that partitioning a graph based on the principal graph eigenvector (via spectralsweep methods) will find a provably sparse cut [CHU97, OV11, OSV+08, LRT+12, LGT14a]. It also implies that a slowly mixing random walk is likely to yield good information about a sparse graph cut; this intuition was leveraged in Spielman and Teng’s seminal nearlylinear time algorithm on graph partitioning in [ST04]. Cheeger’s inequality for graphs has been an inspiration for decades of spectral graph theory research (for more information, see [CHU97]).
Sparse cuts have been researched extensively in graph theory [LR88, ARV04, CGR05, AP09, MAD10, LRT+12, KW16]. Sparse cuts, and the related notion of balanced cuts, have been used to generate fast multicommodity flow algorithms [LR88, ARV04]. These cuts are deeply related to expander decomposition [ST04, WUL17, SW19], which in turn has proven immensely useful in algorithm design. For a brief overview of the application of sparse and balanced cuts in computer science theory, see the introduction of [MAD10]. For a survey of uses of expander decomposition, see the introduction of [SW19].
We note that expander decompositions are based on the idea that sparse cuts form natural partitions of the graph into clustered components. This intuition is leveraged in machine learning [BN04]. Cheeger’s Inequality and Sparse Cuts for Manifolds
The Cheeger inequality on a manifold states that the fundamental eigenvalue of the BeltramiLaplace operator is bounded below by the square of the sparse or isoperimetric cut, divided by (See [CHE70] for details). In 1982, Buser proved an upper bound for provided that the manifold has lower bounded Ricci curvature [BUS82]. In contrast to the graph case, this inequality is false without the curvature assumption. This inequality has been widely used in the theory of differential geometry [VSC08, HK00]
. Intuition based on manifold Cheeger and manifold Buser has been used to great effect in semisupervised learning and image processing
[BN04, GS06].The CheegerBuser Inequality and Sparse Cuts for Convex Bodies and Density Functions
The CheegerBuser inequality, and the related notion of sparse cuts, have seen renewed interest in the computer science literature [LS90, LV18a, MIL09]. Past researchers have used ideas inspired by sparse cuts and the CheegerBuser inequality to recover fast mixing time lemmas for random walks for logconcave density functions supported on convex polytopes [LS90, DFK91, KLS95, LV18a, GM83]. These lemmas have in turn been used to find fast sampling algorithms from such density functions [LV18a], and more.
One prominent use of sparse cuts in convex geometry is the celebrated KannanLovaszSimonovits Conjecture (KLS Conjecture). This conjecture asserts that, for a convex body, the sparsity of the sparsest hyperplane cut is a dimensionless constant approximation of the sparsity of the optimal sparse cut
[KLS95, LV18b]. Generally, these results all implicitly use the settings for their definition of sparse cuts, eigenvalues, and eigenvectors. They operate in the setting where there is significant structure on the probability, such as being a logconcave distribution [LV18a]. We note that the CheegerBuser inequality fails for this setting of when the probability density is Lipschitz but not logconcave. For a survey on uses of Cheeger’s inequality in convex polytope theory and logconcave density function sampling, see [LV18b].Discrete Machine Learning from Continuous Methods
Our overall approach to clustering follows the line of work generating discrete machine learning methods by analyzing its behavior in the limit. This approach was used fruitfuilly by Su, Boyd, and Candes [SBC14] and Wibisono, Wilson, and Jordan [WWJ16] to generate faster gradient descent variants based on continuous time differential equations. For more information, refer to their respective papers.
1.3 Contributions
Our paper has three core contributions:

A natural method for cutting probability density functions, based on a new notion of sparse cuts on density functions.

A Cheeger and Buser inequality for Lipschitz probability density functions, and

A clustering algorithm operating on samples, which heuristically approximates a spectral sweep cut on the density function when the number of samples grows large.
We emphasize that our primary contributions are points 1 and 2, which are formally stated in Theorems 1.4 and 1.5 respectively. Our clustering algorithm on samples, which is designed to approximate the spectral sweep cut on the density function as the number of samples grows large, is of secondary importance.
We now state our two main theorems.
Theorem 1.4.
Spectral Sweep Cuts give Sparse Cuts:
Let be an Lipschitz probability density function, and let .
The spectral sweep cut of has sparsity satisfying:
Here, refers to the optimal sparsity of a cut on .
In words, the spectral sweep cut of the eigenvector gives a provably good approximation to the sparsest cut, as long as and . Proving this result is a straightforward application of two new inequalities we present, which we will refer to as as the Cheeger and Buser inequalities for probability density functions.
Theorem 1.5.
Probability Density Cheeger and Buser:
Let be an Lipschitz density function. Let .
Let be the infimum sparsity of a cut through , and let be the principal eigenvalue of . Then:
and
The first inequality is Probability Density Cheeger, and the second inequality is Probability Density Buser.
Note that we don’t need to have a total mass of for any of our proofs. The overall probability mass of can be arbitrary.
We note that Theorem 1.5 has a partial converse:
Lemma 1.6.
If or , then the CheegerBuser inequality in Theorem 1.5 does not hold.
In particular, if or , no CheegerBuser inequality can hold for any . These settings of and encompass most past work on spectral cuts and CheegerBuser inequalities on probability denstiy functions.
Finally, we give a discrete algorithm 1,3SpectralClustering for clustering data points into twoclusters. We conjecture, but do not prove, that 1,3SpectralClustering converges to the the spectral sweep cut of the probability density function as the number of samples grows large.
We note that this resembles the unnormalized spectral clustering based on the work of Shi and Malik [SM97] and Ng, Weiss, and Jordan [NJW01, TS15]. The major difference is that we build our Laplacian from the matrix . Past work on spectral clustering builds the Laplacian from the matrices or [VON07, TS15].
1.4 Differences between our work and past work
Our work differs from past work in the following key ways:

Our work differs from past practical work on spectral sweep cuts cuts [GS15, TS15, SM97, NJW01], as those methods perform what we call a sweep cut. These sweep cuts have no theoretical guarantees, much less a guarantee on their sparsity. Lemma 1.6 shows that no Cheeger and Buser inequality can simultaneously hold for any setting of when .
We will further show that using a sweep cut can lead to undesirable cuts of Lipschitz probability densities, with poor sparsity guarantees.

We note that probability density CheegerBuser is not easily implied by graph or manifold CheegerBuser. For a lengthier discussion on this, see Appendix A.

For our work, the probability density is not required to be bounded away from . This is a sharp departure from many existing results: past results on partitioning probability densities required a positive lower bound on [VBB08, GS15]
. The strongest results in fields like linear elliptic partial differential equations depend on
being bounded away from [WHI17].
Our work is the first spectral sweep cut algorithm that guarantees a sparse cut on Lipschitz densities , without requiring strong parametric assumptions on .
1.4.1 Technical Contribution
The key technical contribution of our proof is proving Buser’s inequality on Lipschitz probability densities via mollification [SOB38, FRI44] with disks of varying radius. This paper is the first time mollification with disks of varying radius have been used. We emphasize that the most difficult part of our paper is proving the Buser inequality.
Mollification has a long history in mathematics dating back to Sergei Sobolev’s celebrated proof of the Sobolev embedding theorem [SOB38]. It is one of the key tools in numerical analysis, partial differential equations, fluid mechanics, and functional analysis [FRI44, LW01, SS09, MON03], and analogs of mollification have been used in computational complexity settings [DKN10]. Informally speaking, mollification is used to create a series of smooth functions approximating a nonsmooth function, by convolving the original function with a smooth function supported on a disk. Notably, an approach using convolution is used by Buser in [BUS82] to prove the original Buser’s inequality, albeit with an intricate preprocessing step on any given cut.
To prove Buser’s inequality on Lipschitz probability density functions , we will show that given a cut with low sparsity, we can find a function with low Rayleigh quotient. We build by starting with the indicator function for cut (which is on one side of the cut and on the other). Next, we mollify this function with disks of varying radius. In particular, for each point in the domain of , we spread out the point mass over a disk of radius proportional to , where L is the Lipschitz constant of . The resulting function obtained by ‘spreading out’ will have low Rayleigh quotient.
For all past uses of mollification, the disks on which the smooth convolving function is supported (we call this the mollification disk) have the same radius throughout the manifold. The use of a uniform radius disk is critical for most uses and proofs in mollification. Our contribution is to allow the disks to vary in radius across our density. This variation in radius allow us to deal with functions that approach (and explains the importance of the density being Lipschitz). No mollification disks centered anywhere in our probability density will intersect the set of the density. This overcomes significant hurdles in many results for functional analysis and PDEs, as many past significant results related to partial differential equations rely on having a positive lower bound on the density [WHI17, GS15].
Proving our Buser inequality using mollification by disks of various radius requires a fairly delicate proof with many pages of calculus. Our key technical lemma is a bound on how the norm of a mollified function when the mollification disks have various radius, which can be found in Section 3.3.
2 Paper Organization
We prove the Buser inequality in Section 3, via a rather extensive series of calculus computations. Our proof relies on a key technical lemma, which is presented in Section 3.3. This is by far the most difficult part of our proof.
We prove the Cheeger inequality in Section 4. The proof in this section implies that the sparsity of the spectral sweep cut of a probability density function is provably close to the principal eigenvalue of . We note that this inequality does not depend on the Lipschitz nature of the probability density function.
In Section 5, we prove Theorem 1.4, which shows that a spectral sweep cut has sparsity which provably approximates the optimal sparsity.
In Section 6, we go over example D distributions that show that either Cheeger or Buser inequality must fail for past definitions of sparsity and eigenfunctions. We will prove Lemma 1.6 in this section.
In Section 7, we show an example Lipschitz probability density where the spectral sweep cut has bad sparsity for any , and will lead to an undesirable cut (from a clustering point of view) on this density function. This is important since the spectral clustering algorithm of Ng et al [NJW01] is known to converge to a spectral sweep cut on the underlying probability density function, as the number of samples grows large [TS15].
Finally, we state conclusions and open problems in Section 8.
In the appendix, we note that the Cheeger and Buser inequalities for probability densities are not easily implied by graph or manifold CheegerBuser. We also provide a simplified version of Cheeger’s and Buser’s inequality for probability densities, in the dimensional case. This may make easier reading for those unfamiliar with technical multivariable mollification.
3 Buser Inequality for Probability Density Functions
The key idea to proving Buser’s inequality is as follows: given , and a cut where , we will build a function whose Rayleigh quotient is close to the sparsity of .
Roughly speaking, is built by convolving at point with a unitweight disk with radius proportional to . Thus, we are convolving with a disk, where the radius of the disk varies between points , and the radius is directly proportional to .
3.1 Weighted Busertype Inequality
We now prove our weighted Busertype inequality, from Theorem 1.5. We state our result in terms of general .
Theorem 3.1.
Let be an Lipschitz function, be a principal eigenvalue, and the isoperimetric cut.
Then:
We note that when setting , the above expression simplifies into:
3.2 Proof Strategy: Mollification by Disks of Radius Proportional to
To prove Theorem 3.1 we construct an approximation of for which the numerator and denominator of the Raleigh quotient, , approximate respectively the numerator and denominator of this expression. Specifically, will constructed as a mollification of , Recall the following two equivalent definitions of a mollification. They are equivalent by the change of variables .
(1) 
with a parameter to be chosen and a smooth radially symmetric function supported in the unit open ball with unit mass . When is constant it follows from the Tonelli theorem that ; when is not constant the following lemma shows that the latter still bounds the former.
3.3 Key Technical Lemma: Bounding norm of a function with the norm of its mollification
The following is our primary technical lemma, which roughly bounds the norm of a mollified function by the norm of the original . Here, the mollification radius is determined by a function .
Lemma 3.2.
Let be Lipschitz continuous with Lipschitz constant for almost every . Let be smooth, , and . Then
Proof.
(of Lemma 3.2) An application of Tonelli’s theorem shows
(2) 
Fix and consider the change of variables . The Jacobian of this mapping is which by Sylvester’s determinant theorem has determinant . It follows that
(3) 
and the lemma follows since . ∎
(Here, denotes the dot product between and .)
We present the following simple corollaries, which is the primary way our proof makes use of Lemma 3.2
Corollary 3.3.
For any Lipschitz continuous function with Lipschitz constant and any with , we have:
This corollary will be used to bound the numerator of our Rayleigh quotient. Note that the expression
is close to when . This is the guiding intuition behind how Corollary 3.3 and Lemma 3.2 will be used, and will be formalized later in our proof of Theorem 3.1.
We present another simple corollary whose proof is equally straightforward. This corollary will be used to bound the denominator, and is a small generalization of Corollary 3.3. We write down both corollaries anyhow, since this will make it easier to interpret our bounds on the Rayleigh quotient.
Corollary 3.4.
For any Lipschitz continuous function with Lipschitz constant , any , and any with , we have:
Now we are ready to prove our main Theorem, which is the Buser inequality for probability densities stated in Theorem 3.1.
Proof.
(of Theorem 3.1)
Fix with and let
be the characteristic function of
. Setting to be the weighted average of ,and
(4) 
Since and it follows that
(5) 
In the calculations below we omit the limiting argument with smooth approximations of outlined at the beginning of this section which justify formula involving , and for readability frequently write and for and .
Next, let be the mollification of (an extension of) given by equation (1). Then is a local average average of so , and . Letting denote the Lipschitz constant of , the parameter will to be chosen less than so that that Lemma 3.2 is applicable with constant .
The remainder of the proof constructs an upper bound on the numerator of the Raleigh quotient for by and to lower bound the denominator by . The conclusion of the theorem then follows from equation (5).
3.4 Upper Bounding the Numerator
To bound the norm in the numerator of the Raleigh quotient by the norm in the numerator of the expression for it is necessary to obtain uniform bound on .
Lemma 3.5.
Let be any function, and let be defined as in Equation 1. Let be an Lipschitz function.
(6) 
Proof.
In order to prove this lemma, we first need to get a handle on , which is the gradient of after mollification by .
We take the the second representation of in equation (1) to get
(7) 
which is a consequence of the multivariable chain rule. Here,
refers to the outer product of and .Multiplying by gives:
(8) 
Now, we can bound the above equation by carefully bounding each part. We note:
(9)  
(10)  
(11) 
where the last step follows by a simple change of variable. Here, we note that
is a vector, and the integral is over
, which is how we eliminated from the expression.Next, we examine the term:
(12) 
Here, we aim to bound the operator norm of this matrix. Here, we note that
when
and thus, when the latter equation holds, we can say:
Since , we now have:
(13) 
Combining Equation 13 Equation 9 to show:
(14)  
(15) 
where is the Lipschitz constant for . We note that Section 3.7 shows that
(16) 
and therefore:
(17)  
(18) 
Now we turn our attention to the first term, which is:
(19) 
We note that
by our definition of (which was defined when we defined ). Combining this with , we get:
(20)  
(21) 
Therefore,
(22) 
where the first inequality comes from combining Equations 18 and 21.
This allows us to bound :
(23)  
(24)  
(25)  
(26) 
where we make use of the fact that This completes our proof. ∎
Next, we want an bound on .
Lemma 3.6.
Let be any function, and let be defined as in Equation 1. Let be an Lipschitz function, and let .
Then:
(27) 
Proof.
First, we take the gradient first representation of in equation (1). Using the chain rule gives us an alternate form for :
(28) 
so
(29) 
The ratio in the integrand is bounded using the Lipschitz assumption on (and ),
(30) 
Note that
(31) 
where represents the matrix norm of . This is because , and , and every time , and thus
Lemma 3.7.
For any Lipschitz distribution , any function , and any such that :
(32) 
Proof.
We note that in the case where , and if is a step function, the expression would simplify to:
3.5 Lower Bound on the Demoninator
Let and be the –weighted averages of and and let denote the space with this weight. Our core lemma is a bound on in terms of and weighted norms of and respectively.
Lemma 3.8.
Let be an Lipschitz function , and let be such that . Let be an indicator function of a set with finite perimeter. Let be defined as be defined as in Equation 1, and be defined as . Then:
(34) 
Note that when , as is true when , the inequality in Lemma 3.8 becomes:
The estimate in Lemma 3.8 will be combined with the estimate in Lemma 3.7 to prove Theorem 3.1 in Section 3.6.
Proof.
The key to this proof is to upper bound the quantity with the expression appearing in Corollary 3.4. We will do so by a series of inequalities, application of the fundamental theorem of calculus, and more.
Using the property that subtracting the average from a function reduces the norm it follows that
If then , so a lower bound for the denominator of the Raleigh quotient
(35)  
where the identity from Equation 4, and the bound , were used in the last step.
It remains to estimate the difference . To do this, we use the multivariable fundamental theorem of calculus to write
where the first and second equalities came from application of the multivariable fundamental theorem of calculus, and the last equation is straightforward. This tells us that:
Comments
There are no comments yet.