New Nearly-Optimal Coreset for Kernel Density Estimation

07/15/2020
βˆ™
by   Wai Ming Tai, et al.
βˆ™
0
βˆ™

Given a point set PβŠ‚β„^d, kernel density estimation for Gaussian kernel is defined as 𝒒_P(x) = 1/|P|βˆ‘_p∈ Pe^-β€– x-p β€–^2 for any xβˆˆβ„^d. We study how to construct a small subset Q of P such that the kernel density estimation of P can be approximated by the kernel density estimation of Q. This subset Q is called coreset. The primary technique in this work is to construct Β± 1 coloring on the point set P by the discrepancy theory and apply this coloring algorithm recursively. Our result leverages Banaszczyk's Theorem. When d>1 is constant, our construction gives a coreset of size O(1/Ρ√(loglog1/Ξ΅)) as opposed to the best-known result of O(1/Ρ√(log1/Ξ΅)). It is the first to give a breakthrough on the barrier of √(log) factor even when d=2.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

βˆ™ 02/15/2021

Signed variable optimal kernel for non-parametric density estimation

We derive the optimal signed variable in general case kernels for the cl...
βˆ™ 05/14/2018

A One-Class Decision Tree Based on Kernel Density Estimation

One-Class Classification (OCC) is a domain of machine learning which ach...
βˆ™ 09/23/2019

PDE-Inspired Algorithms for Semi-Supervised Learning on Point Clouds

Given a data set and a subset of labels the problem of semi-supervised l...
βˆ™ 06/11/2019

Discrepancy, Coresets, and Sketches in Machine Learning

This paper defines the notion of class discrepancy for families of funct...
βˆ™ 03/03/2022

Kernel Density Estimation by Genetic Algorithm

This study proposes a data condensation method for multivariate kernel d...
βˆ™ 05/12/2022

Optimal-Degree Polynomial Approximations for Exponentials and Gaussian Kernel Density Estimation

For any real numbers B β‰₯ 1 and δ∈ (0, 1) and function f: [0, B] →ℝ, let ...
βˆ™ 10/11/2017

Improved Coresets for Kernel Density Estimates

We study the construction of coresets for kernel density estimates. That...

1 Introduction

Kernel density estimation (KDE) is a common object in data analysis, and it is a non-parametric way to estimate a probability distribution. Suppose a point set

is given, KDE smooths out a discrete point set to a continuous function [37, 38]. More precisely, given a point set and a kernel , KDE is generally defined as the function for any . One common example of kernel is Gaussian kernel, which is for any

, and it is the main focus of this paper. A wide range of application includes outlier detection

[44], clustering [35], topological data analysis [34, 12]

, spatial anomaly detection

[1, 21]

and statistical hypothesis test

[20].

Generally speaking, the technique using kernel is called kernel method

, in which kernel density estimation is the central role in these techniques. Kernel method is prevalent in machine learning and statistics and often involves optimization problems. Optimization problems are generally hard in the sense that solving them usually has super-linear or oven exponential dependence on the input’s size in its running time. Therefore, reducing the size of the input will be desirable. The most straightforward way to do so is to extract a small subset

of the input . This paper will study the construction of the subset such that approximates .

Classically, statistician concerns about different sort of average error such as -error [17] or -error [37, 38]. However, there are multiple modern applications that require -error such as preserving classification margin [36], density estimation [43], topology [34] and hypothesis test on distribution [20]. Formally, we would like to solve the following problem.

Given a point set and , we construct a subset of such that

Then, how small can the size of , , be?

We call this subset -coreset.

1.1 Known Results of -coreset

We now discuss some previous results for the size of -coreset. The summary is presented in Table 1.

Paper Coreset Size
Joshi et al.Β [22] any
Lopaz-Paz et al. [26] any
Lacoste-Julien et al. [24] any
Joshi et al.Β [22] 1
Joshi et al. [22] sub- constant
Phillips [31] constant
Phillips and Tai [32] constant
Phillips and Tai [33] any
Phillips [31] any
Phillips and Tai [32]
Phillips and Tai [33]
Our result constant
Table 1: Asymptotic -coreset sizes in terms of and .

Josh et al. [22] showed that random sampling can achieve the size of . They investigated the VC-dimension of the super-level set of a kernel and analyzed that the sample size can be bounded by it. In particular, the super-level set of the Gaussian kernel is a family of balls in , which reduces the problem to bounding the sample size of the range space of balls.

Lopaz-Paz et al. [26] later proved that the size of the coreset could be reduced to by random sampling. They studied the reproducing kernel Hilbert space (RKHS) associate with a positive-definite kernel [3, 42, 40]. Note that the Gaussian kernel is a positive-definite kernel. In RKHS, one can bound the -error between two KDEs of point sets and by the kernel distance of and . They showed that the sample size of is sufficient to bound the kernel distance.

Other than random sampling, Lacoste-Julien et al. [24] showed a greedy approach can also achieve the size of . They applied Frank-Wolfe algorithm [15, 18] in RKHS to bound the error of the kernel distance.

Note that all of the above results have a factor of . Josh et al. [22] first showed that sub- result can be obtained by reducing the problem to constructing a -approximation for the range space of balls [28]. An important case is that gives the size of . They assume that is constant.

Later, Phillips [31] improved the result to for constant via geometric matching. Notably, for , their bound is which is nearly-optimal and is the first nearly-linear result for the case of .

Recently, Phillips and Tai [32] further improved the size of coreset to for constant . It is based on a discrepancy approach. They exploited the fact that the Gaussian kernel is multiplicatively separable. It implies that the Gaussian kernel can be rewritten as the weighted average of a family of axis-parallel boxes in . Finally, they reduced the problem to TusnΓ‘dy’s problem [8, 2].

Also, Phillips and Tai [33] proved the nearly-optimal result of

shortly after that. They observed that the underlying structure of the positive-definite kernel allows us to bound the norm of the vectors and apply the lemma in

[29]. Recall that the Gaussian kernel is a positive-definite kernel.

Except for the upper bound, there are some results on the lower bound for the size of -coreset. Phillips [31] provided the first lower bound for the size of the coreset. They proved a lower bound of by giving an example that all points are far from each other. When assuming , Phillips and Tai [32] gave another example that forms a simplex and showed a lower bound of . Later, Phillips and Tai [33] combined the technique of the above two results and showed the lower bound of .

There are other conditional bounds for this problem. We suggest the reader refer to [33] for a more extensive review. Recently, Karnin and Liberty [23] claimed that there exist a coreset of size for Gaussian kernel. Their result is also conditional. They assume that the point set lies inside a -ball of radius and the dependence of in their result is exponential of . Also, the result is non-constructive. However, we are not convinced by their argument111Through private communication, the authors acknowledged our comment.: in Lemma 16 of [23], they claimed that a trivial modification on the proof of Theorem 1 in [41] is sufficient to conclude their lemma while it is not.

1.2 Our Result

Upper Lower
[22, 31]
constant new
any [33]
[4, 32]
Table 2: Size bounds of -coresets for Gaussian kernel

We bound the size of -coreset via discrepancy theory. Roughly speaking, we construct coloring on our point set such that its discrepancy is small. Then, we drop the points colored and recursively construct the coloring on the points colored . Eventually, the remaining point set is the coreset we desire. A famous theorem in discrepancy theory is Banaszczyk’s Theorem [5], which is the central piece of our proof. In constant dimensional space, we carefully study the structure of Gaussian kernel and it allows us to construct a -coreset of size . The summary of the best-known result is shown in Table 2.

1.3 Related Works

In computational geometry, -approximation is the approximation of a general set by a smaller subset. Given a set and a collection of subset of , a subset is called -approximation if for all . The pair is called set system (also known as range space or hypergraph). One can rewrite the above guarantee as where

is the characteristic function of the set

. If we replace this characteristic function by a kernel such as the Gaussian kernel, it is the same as our -coreset. There is a rich history on the construction of -approximation [13, 28]. One notable method is discrepancy theory, which is also our main technique. There is a wide range of techniques employed in this field. In the early 1980s, Beck devised the technique of partial coloring [9], and later refinement of this technique called entropy method was introduced by Spencer [39]

. The entropy method is first used to solve the famous ”six standard deviations” theorem: given a set system of

points and subsets, there is a coloring of discrepancy at most . In contrast, random coloring gives the discrepancy of . However, this entropy method is a non-constructive approach. Namely, they did not provide an efficient algorithm to find the coloring achieving such bound. A more geometric example in discrepancy theory is TusnΓ‘dy’s problem. It states that, given point set of size in , assign coloring on each point such that for any axis parallel box in the discrepancy is minimized. One previous result of the -coreset problem is reduced to TusnΓ‘dy’s problem.

On the topic of approximating KDE, Fast Gauss Transform [19] is a method to preprocess the input point set such that the computation of KDE at a query point is faster than the brute-force approach. The idea in this method is to expand the Gaussian kernel by Hermite polynomials and truncate the expansion. Assuming that the data set lies inside a bounded region, the query time in this method is poly-logarithmic of in constant dimension . Also, Charikar and SiminelakisΒ [11] studied the problem of designing a data structure that preprocesses the input to answer a KDE query in a faster time. They used locality-sensitive hashing to perform their data structure. However, the guarantee they obtained is a relative error, while ours is an additive error. More precisely, given a point set , Charikar and Siminelakis designed a data structure such that, for any query , the algorithm answers the value within -relative error. Also, the query time of their data structure is sublinear of .

2 Preliminary

Our approach for constructing coreset relies on discrepancy theory, which is a similar technique in range counting coreset [14, 30, 10]. We first introduce an equivalent problem (up to a constant factor) as follows.

Given a point set , what is the smallest quantity of over all , the set of all possible coloring from to ?

Now, one can intuitively view the equivalence in the following way. If we rewrite the objective into:

where is the set of points that is assigned , then we can apply the halving technique [14, 30] which is to recursively invoke the coloring algorithm and retain the set of point assigned until the subset of desire size remains. Note that there is no guarantee that half of the points are assigned , while the other half is assigned . However, we can handle this issue by some standard technique [28] or see our proof for details.

Also, we denote the following notations. Given a point set and a subset , we call kernel range space where . Next, given a kernel range space and a coloring , we define the discrepancy w.r.t. as

We finally define the discrepancy of a kernel range space as

where is the set of all possible coloring from to . Karnin and Liberty [23] also introduce the similar notation.

An important result in discrepancy theory is Banaszczyk’s Theorem, which is stated as follows.

Theorem 1 (Banaszczyk’s Theorem [5]).

Given a convex body of Gaussian measure at least and vectors of Euclidean norm at most , there is a coloring such that the vector . Here, is an absolute constant and Gaussian measure of a convex body is defined as .

The proof of this theorem relies on the property of convex geometry and recursively enumerate all possible coloring on vectors. Hence, the approach is non-constructive.

There is no breakthrough on how to construct the coloring efficiently since then except some partial result on special case [6, 8, 25]. Until recently, Dadush et al. [16] showed that Banaszczyk’s Theorem is equivalent to the following statement up to a constant factor on the constant .

Theorem 2 (Equivalent form of Banaszczyk’s Theorem [16]).

Given vectors of Euclidean norm at most , there is a probability distribution on with the following guarantee: there are two absolute constant such that, for any unit vector and ,

Here, is the set of all possible coloring from to and

is the random variable of

where .

The guarantee in the above statement is also known as sub-Gaussian distribution. One key difference in Dadush

et al.’s statement is that there is no convex body in the statement. The intuition behind the equivalence is one can ”approximate” any convex body by the intersection of a family of half-space. The question is converted to how many half-space one needs to have a good approximation of a convex body, or equivalently, how many events in union bound one need such that the failure probability is at most constant in Dadush et al.’s statement. Dadush et al. devised an efficient algorithm to find such coloring but is sub-optimal with an extra factor. Namely, by scaling, the condition on Euclidean norm of input vector is instead of . Later, Bansal et al. [7] constructed a coloring efficiently, which satisfied the guarantee of Banaszczyk’s Theorem. Therefore, Banaszczyk’s Theorem is now constructive. Moreover, assuming , the running time is where is the exponent of matrix multiplication.

The connection between Banaszczyk’s Theorem and discrepancy theory can be seen in the paper of MatouΕ‘ek et al. [29]. An important result in this paper is: given a -by- matrix and a -by- matrix , the discrepancy of the matrix , , is bounded by where is the largest Euclidean norm of rows in and is the largest Euclidean norm of columns in . The sketch of proof for this result is to apply Banszczyk’s Theorem: take the columns of as the input and use the union bound on the rows of .

Finally, we introduce a useful lemma which is Markov Brother’s Inequality.

Lemma 3 (Markov Brother’s Inequality [27]).

Let be a polynomial of degree . Then,

Here, is the derivative of .

3 Construction of -coreset

We first assume our point set lies inside a -ball of radius . The following lemma shows that bounding the discrepancy on a finite set implies the discrepancy in the entire space is also bounded. This finite set is a bounded grid that its width depends on poly-logarithmic of . The advantage of doing this is to bound the number of event in the union bound when we apply Banaszczyk’s Theorem.

Lemma 4.

Given . Suppose be a point set of size such that for all and is a coloring on . Then, we have

where with . Here, is an infinite lattice grid.

In other words, this lemma says that, given a coloring , .

Proof.

By translation, we can assume that all lies in and . We first use Taylor expansion to have the following expression.

(1)

Here, is Kronecker product. Namely, for any two vectors and , is a dimensional vector indexed by two integers for and such that . Also, for any and any integer , and .

Now, consider polynomial . We first observe that is the same as expression (1) but is truncated at the -th term. We analyze the term for any .

By the error approximation of Taylor expansion of exponential function, we have where . Then,

The last step is due to the fact of . Now, if taking which means , we have

(2)

Let be . If , then
since and therefore each term . That means by triangle inequality.

Now we can assume . We construct an uniform grid of width on . Namely, . Note that .

Let be the closest grid point in to that all coordinate of are larger than . We have

for some on the segment between and . Here, is the gradient. Then,

where . The last step is due to Markov brother’s inequality (Lemma 3). By rearranging the terms and multiplying , we have

Also, since, for each coordinate, .

In last step, recall that . The left hand side is basically . We first analyze the term . Recall that .

which means . Therefore, we have

and then, from (2),

or

∎

We are still assuming that our point set lies inside a -ball of radius . Now, we can apply Banaszczyk’s Theorem to construct our coloring that achieves low kernel discrepancy, . More importantly, the discrepancy we achieve has the dependence on logarithmic of and double-logarithmic of .

Lemma 5.

Given . Suppose be a point set of size such that for all and is a coloring on . Suppose be such that for any . Then, by taking as input, Banaszczyk’s Theorem (Theorem 2) constructs a coloring on such that and with probability .

Proof.

Let be as suggested in Lemma 4 where and . Also, we denote the -by- matrix such that for any . Since Gaussian kernel is positive-definite, is positive-definite. Therefore, we can decompose into for some matrix . Denote be the columns of for any . Without loss of generality, we can assume that when . Note that .

By Banaszczyk’s Theorem (Theorem 2), it constructs a coloring such that, for any unit vector , we have

with probability . Note that the norm of is , which means the above statement is still true up to a constant factor. Denote is a zero vector except that the first coordinate is . Specifically, we have, by taking ,

for any and, by taking ,

with probability . We apply the union bound on being and for . Recall that . From Lemma 4, the size of is . We can conclude that and with probability by taking .

∎

Note that one can always assume that . Otherwise, we can partition into groups that each group is at least from each other and construct the coloring for each group. It means that the above result is no worse than the previous result of [33].

The following algorithm (Algorithm 1) is a Las Vegas algorithm that constructs a coloring on the input point set . We can now show how to construct a coloring such that the discrepancy is low and independent of . Recall that is an infinite lattice grid. The idea of Algorithm 1 is that we first decompose the entire into an infinite amount of -balls of radius . Then, we partition our input such that each point lies in some -ball. For each non-empty -ball, run Banaszczyk’s Theorem to construct a coloring with the desired discrepancy. Finally, we argue that any point can only be influenced by -balls, and therefore there is an extra factor of in the final discrepancy.

input: a point set

1:Β Β initialize to be for all
2:Β Β forΒ each Β do
3:Β Β Β Β Β insert into where is the closest point to .
4:Β Β forΒ each non-empty Β do
5:Β Β Β Β Β construct a collection of vector such that for any
6:Β Β Β Β Β using as input, run Banaszczyk’s Theorem (Theorem 2) to obtain a coloring on
7:Β Β Β Β Β check if produces the discrepancy as guaranteed in Lemma 5, repeat line 6 if not
8:Β Β Β Β Β flip the color of certain point such that half of point in are colored and rest of them are colored .
9:Β Β return a coloring such that when
Algorithm 1 Constructing Coloring for Gaussian Kernel in Constant Dimensional Space
Lemma 6.

Suppose be a point set of size where is constant. Then, Algorithm 1 constructs a coloring on efficiently such that and half of the points in are colored .

Proof.

For any and , denote be . That is a -ball of radius centered at .

Suppose be . We can argue that lies in for some . If , then since each term . Assume that for some and denote be such . Let be the closest point to . By triangle inequality, we have .

Now, there are at most points in that lie inside . Let be the set of . Then, we have

The second last line is due to Lemma 5.

In line 8, suppose there are more than . Choose points assigned arbitrarily and flip them to such that it makes the difference zero. Denote and . Also, and are defined in the same way after flipping the value. For any ,