1 Introduction
Kernel density estimation (KDE) is a common object in data analysis, and it is a non-parametric way to estimate a probability distribution. Suppose a point set
is given, KDE smooths out a discrete point set to a continuous function [37, 38]. More precisely, given a point set and a kernel , KDE is generally defined as the function for any . One common example of kernel is Gaussian kernel, which is for any, and it is the main focus of this paper. A wide range of application includes outlier detection
[44], clustering [35], topological data analysis [34, 12], spatial anomaly detection
[1, 21]and statistical hypothesis test
[20].Generally speaking, the technique using kernel is called kernel method
, in which kernel density estimation is the central role in these techniques. Kernel method is prevalent in machine learning and statistics and often involves optimization problems. Optimization problems are generally hard in the sense that solving them usually has super-linear or oven exponential dependence on the inputās size in its running time. Therefore, reducing the size of the input will be desirable. The most straightforward way to do so is to extract a small subset
of the input . This paper will study the construction of the subset such that approximates .Classically, statistician concerns about different sort of average error such as -error [17] or -error [37, 38]. However, there are multiple modern applications that require -error such as preserving classification margin [36], density estimation [43], topology [34] and hypothesis test on distribution [20]. Formally, we would like to solve the following problem.
Given a point set and , we construct a subset of such that
Then, how small can the size of , , be?
We call this subset -coreset.
1.1 Known Results of -coreset
We now discuss some previous results for the size of -coreset. The summary is presented in Table 1.
Paper | Coreset Size | |
---|---|---|
Joshi et al.Ā [22] | any | |
Lopaz-Paz et al. [26] | any | |
Lacoste-Julien et al. [24] | any | |
Joshi et al.Ā [22] | 1 | |
Joshi et al. [22] | sub- | constant |
Phillips [31] | constant | |
Phillips and Tai [32] | constant | |
Phillips and Tai [33] | any | |
Phillips [31] | any | |
Phillips and Tai [32] | ||
Phillips and Tai [33] | ||
Our result | constant |
Josh et al. [22] showed that random sampling can achieve the size of . They investigated the VC-dimension of the super-level set of a kernel and analyzed that the sample size can be bounded by it. In particular, the super-level set of the Gaussian kernel is a family of balls in , which reduces the problem to bounding the sample size of the range space of balls.
Lopaz-Paz et al. [26] later proved that the size of the coreset could be reduced to by random sampling. They studied the reproducing kernel Hilbert space (RKHS) associate with a positive-definite kernel [3, 42, 40]. Note that the Gaussian kernel is a positive-definite kernel. In RKHS, one can bound the -error between two KDEs of point sets and by the kernel distance of and . They showed that the sample size of is sufficient to bound the kernel distance.
Other than random sampling, Lacoste-Julien et al. [24] showed a greedy approach can also achieve the size of . They applied Frank-Wolfe algorithm [15, 18] in RKHS to bound the error of the kernel distance.
Note that all of the above results have a factor of . Josh et al. [22] first showed that sub- result can be obtained by reducing the problem to constructing a -approximation for the range space of balls [28]. An important case is that gives the size of . They assume that is constant.
Later, Phillips [31] improved the result to for constant via geometric matching. Notably, for , their bound is which is nearly-optimal and is the first nearly-linear result for the case of .
Recently, Phillips and Tai [32] further improved the size of coreset to for constant . It is based on a discrepancy approach. They exploited the fact that the Gaussian kernel is multiplicatively separable. It implies that the Gaussian kernel can be rewritten as the weighted average of a family of axis-parallel boxes in . Finally, they reduced the problem to TusnĆ”dyās problem [8, 2].
Also, Phillips and Tai [33] proved the nearly-optimal result of
shortly after that. They observed that the underlying structure of the positive-definite kernel allows us to bound the norm of the vectors and apply the lemma in
[29]. Recall that the Gaussian kernel is a positive-definite kernel.Except for the upper bound, there are some results on the lower bound for the size of -coreset. Phillips [31] provided the first lower bound for the size of the coreset. They proved a lower bound of by giving an example that all points are far from each other. When assuming , Phillips and Tai [32] gave another example that forms a simplex and showed a lower bound of . Later, Phillips and Tai [33] combined the technique of the above two results and showed the lower bound of .
There are other conditional bounds for this problem. We suggest the reader refer to [33] for a more extensive review. Recently, Karnin and Liberty [23] claimed that there exist a coreset of size for Gaussian kernel. Their result is also conditional. They assume that the point set lies inside a -ball of radius and the dependence of in their result is exponential of . Also, the result is non-constructive. However, we are not convinced by their argument^{1}^{1}1Through private communication, the authors acknowledged our comment.: in Lemma 16 of [23], they claimed that a trivial modification on the proof of Theorem 1 in [41] is sufficient to conclude their lemma while it is not.
1.2 Our Result
Upper | Lower | ||
---|---|---|---|
[22, 31] | |||
constant | new | ||
any | [33] | ||
[4, 32] |
We bound the size of -coreset via discrepancy theory. Roughly speaking, we construct coloring on our point set such that its discrepancy is small. Then, we drop the points colored and recursively construct the coloring on the points colored . Eventually, the remaining point set is the coreset we desire. A famous theorem in discrepancy theory is Banaszczykās Theorem [5], which is the central piece of our proof. In constant dimensional space, we carefully study the structure of Gaussian kernel and it allows us to construct a -coreset of size . The summary of the best-known result is shown in Table 2.
1.3 Related Works
In computational geometry, -approximation is the approximation of a general set by a smaller subset. Given a set and a collection of subset of , a subset is called -approximation if for all . The pair is called set system (also known as range space or hypergraph). One can rewrite the above guarantee as where
is the characteristic function of the set
. If we replace this characteristic function by a kernel such as the Gaussian kernel, it is the same as our -coreset. There is a rich history on the construction of -approximation [13, 28]. One notable method is discrepancy theory, which is also our main technique. There is a wide range of techniques employed in this field. In the early 1980s, Beck devised the technique of partial coloring [9], and later refinement of this technique called entropy method was introduced by Spencer [39]. The entropy method is first used to solve the famous āsix standard deviationsā theorem: given a set system of
points and subsets, there is a coloring of discrepancy at most . In contrast, random coloring gives the discrepancy of . However, this entropy method is a non-constructive approach. Namely, they did not provide an efficient algorithm to find the coloring achieving such bound. A more geometric example in discrepancy theory is TusnĆ”dyās problem. It states that, given point set of size in , assign coloring on each point such that for any axis parallel box in the discrepancy is minimized. One previous result of the -coreset problem is reduced to TusnĆ”dyās problem.On the topic of approximating KDE, Fast Gauss Transform [19] is a method to preprocess the input point set such that the computation of KDE at a query point is faster than the brute-force approach. The idea in this method is to expand the Gaussian kernel by Hermite polynomials and truncate the expansion. Assuming that the data set lies inside a bounded region, the query time in this method is poly-logarithmic of in constant dimension . Also, Charikar and SiminelakisĀ [11] studied the problem of designing a data structure that preprocesses the input to answer a KDE query in a faster time. They used locality-sensitive hashing to perform their data structure. However, the guarantee they obtained is a relative error, while ours is an additive error. More precisely, given a point set , Charikar and Siminelakis designed a data structure such that, for any query , the algorithm answers the value within -relative error. Also, the query time of their data structure is sublinear of .
2 Preliminary
Our approach for constructing coreset relies on discrepancy theory, which is a similar technique in range counting coreset [14, 30, 10]. We first introduce an equivalent problem (up to a constant factor) as follows.
Given a point set , what is the smallest quantity of over all , the set of all possible coloring from to ?
Now, one can intuitively view the equivalence in the following way. If we rewrite the objective into:
where is the set of points that is assigned , then we can apply the halving technique [14, 30] which is to recursively invoke the coloring algorithm and retain the set of point assigned until the subset of desire size remains. Note that there is no guarantee that half of the points are assigned , while the other half is assigned . However, we can handle this issue by some standard technique [28] or see our proof for details.
Also, we denote the following notations. Given a point set and a subset , we call kernel range space where . Next, given a kernel range space and a coloring , we define the discrepancy w.r.t. as
We finally define the discrepancy of a kernel range space as
where is the set of all possible coloring from to . Karnin and Liberty [23] also introduce the similar notation.
An important result in discrepancy theory is Banaszczykās Theorem, which is stated as follows.
Theorem 1 (Banaszczykās Theorem [5]).
Given a convex body of Gaussian measure at least and vectors of Euclidean norm at most , there is a coloring such that the vector . Here, is an absolute constant and Gaussian measure of a convex body is defined as .
The proof of this theorem relies on the property of convex geometry and recursively enumerate all possible coloring on vectors. Hence, the approach is non-constructive.
There is no breakthrough on how to construct the coloring efficiently since then except some partial result on special case [6, 8, 25]. Until recently, Dadush et al. [16] showed that Banaszczykās Theorem is equivalent to the following statement up to a constant factor on the constant .
Theorem 2 (Equivalent form of Banaszczykās Theorem [16]).
Given vectors of Euclidean norm at most , there is a probability distribution on with the following guarantee: there are two absolute constant such that, for any unit vector and ,
Here, is the set of all possible coloring from to and
is the random variable of
where .The guarantee in the above statement is also known as sub-Gaussian distribution. One key difference in Dadush
et al.ās statement is that there is no convex body in the statement. The intuition behind the equivalence is one can āapproximateā any convex body by the intersection of a family of half-space. The question is converted to how many half-space one needs to have a good approximation of a convex body, or equivalently, how many events in union bound one need such that the failure probability is at most constant in Dadush et al.ās statement. Dadush et al. devised an efficient algorithm to find such coloring but is sub-optimal with an extra factor. Namely, by scaling, the condition on Euclidean norm of input vector is instead of . Later, Bansal et al. [7] constructed a coloring efficiently, which satisfied the guarantee of Banaszczykās Theorem. Therefore, Banaszczykās Theorem is now constructive. Moreover, assuming , the running time is where is the exponent of matrix multiplication.The connection between Banaszczykās Theorem and discrepancy theory can be seen in the paper of MatouÅ”ek et al. [29]. An important result in this paper is: given a -by- matrix and a -by- matrix , the discrepancy of the matrix , , is bounded by where is the largest Euclidean norm of rows in and is the largest Euclidean norm of columns in . The sketch of proof for this result is to apply Banszczykās Theorem: take the columns of as the input and use the union bound on the rows of .
Finally, we introduce a useful lemma which is Markov Brotherās Inequality.
Lemma 3 (Markov Brotherās Inequality [27]).
Let be a polynomial of degree . Then,
Here, is the derivative of .
3 Construction of -coreset
We first assume our point set lies inside a -ball of radius . The following lemma shows that bounding the discrepancy on a finite set implies the discrepancy in the entire space is also bounded. This finite set is a bounded grid that its width depends on poly-logarithmic of . The advantage of doing this is to bound the number of event in the union bound when we apply Banaszczykās Theorem.
Lemma 4.
Given . Suppose be a point set of size such that for all and is a coloring on . Then, we have
where with . Here, is an infinite lattice grid.
In other words, this lemma says that, given a coloring , .
Proof.
By translation, we can assume that all lies in and . We first use Taylor expansion to have the following expression.
(1) |
Here, is Kronecker product. Namely, for any two vectors and , is a dimensional vector indexed by two integers for and such that . Also, for any and any integer , and .
Now, consider polynomial . We first observe that is the same as expression (1) but is truncated at the -th term. We analyze the term for any .
By the error approximation of Taylor expansion of exponential function, we have where . Then,
The last step is due to the fact of . Now, if taking which means , we have
(2) |
Let be .
If , then
since and therefore each term .
That means by triangle inequality.
Now we can assume . We construct an uniform grid of width on . Namely, . Note that .
Let be the closest grid point in to that all coordinate of are larger than . We have
for some on the segment between and . Here, is the gradient. Then,
where . The last step is due to Markov brotherās inequality (Lemma 3). By rearranging the terms and multiplying , we have
Also, since, for each coordinate, .
In last step, recall that . The left hand side is basically . We first analyze the term . Recall that .
which means . Therefore, we have
and then, from (2),
or
ā
We are still assuming that our point set lies inside a -ball of radius . Now, we can apply Banaszczykās Theorem to construct our coloring that achieves low kernel discrepancy, . More importantly, the discrepancy we achieve has the dependence on logarithmic of and double-logarithmic of .
Lemma 5.
Given . Suppose be a point set of size such that for all and is a coloring on . Suppose be such that for any . Then, by taking as input, Banaszczykās Theorem (Theorem 2) constructs a coloring on such that and with probability .
Proof.
Let be as suggested in Lemma 4 where and . Also, we denote the -by- matrix such that for any . Since Gaussian kernel is positive-definite, is positive-definite. Therefore, we can decompose into for some matrix . Denote be the columns of for any . Without loss of generality, we can assume that when . Note that .
By Banaszczykās Theorem (Theorem 2), it constructs a coloring such that, for any unit vector , we have
with probability . Note that the norm of is , which means the above statement is still true up to a constant factor. Denote is a zero vector except that the first coordinate is . Specifically, we have, by taking ,
for any and, by taking ,
with probability . We apply the union bound on being and for . Recall that . From Lemma 4, the size of is . We can conclude that and with probability by taking .
ā
Note that one can always assume that . Otherwise, we can partition into groups that each group is at least from each other and construct the coloring for each group. It means that the above result is no worse than the previous result of [33].
The following algorithm (Algorithm 1) is a Las Vegas algorithm that constructs a coloring on the input point set . We can now show how to construct a coloring such that the discrepancy is low and independent of . Recall that is an infinite lattice grid. The idea of Algorithm 1 is that we first decompose the entire into an infinite amount of -balls of radius . Then, we partition our input such that each point lies in some -ball. For each non-empty -ball, run Banaszczykās Theorem to construct a coloring with the desired discrepancy. Finally, we argue that any point can only be influenced by -balls, and therefore there is an extra factor of in the final discrepancy.
Lemma 6.
Suppose be a point set of size where is constant. Then, Algorithm 1 constructs a coloring on efficiently such that and half of the points in are colored .
Proof.
For any and , denote be . That is a -ball of radius centered at .
Suppose be . We can argue that lies in for some . If , then since each term . Assume that for some and denote be such . Let be the closest point to . By triangle inequality, we have .
Now, there are at most points in that lie inside . Let be the set of . Then, we have
The second last line is due to Lemma 5.
In line 8, suppose there are more than . Choose points assigned arbitrarily and flip them to such that it makes the difference zero. Denote and . Also, and are defined in the same way after flipping the value. For any ,