Coresets for Data Discretization and Sine Wave Fitting

03/06/2022
by   Alaa Maalouf, et al.
58

In the monitoring problem, the input is an unbounded stream P=p_1,p_2⋯ of integers in [N]:={1,⋯,N}, that are obtained from a sensor (such as GPS or heart beats of a human). The goal (e.g., for anomaly detection) is to approximate the n points received so far in P by a single frequency sin, e.g. min_c∈ Ccost(P,c)+λ(c), where cost(P,c)=∑_i=1^n sin^2(2π/N p_ic), C⊆ [N] is a feasible set of solutions, and λ is a given regularization function. For any approximation error ε>0, we prove that every set P of n integers has a weighted subset S⊆ P (sometimes called core-set) of cardinality |S|∈ O(log(N)^O(1)) that approximates cost(P,c) (for every c∈ [N]) up to a multiplicative factor of 1±ε. Using known coreset techniques, this implies streaming algorithms using only O((log(N)log(n))^O(1)) memory. Our results hold for a large family of functions. Experimental results and open source code are provided.

READ FULL TEXT VIEW PDF

Authors

page 15

03/09/2020

Sets Clustering

The input to the sets-k-means problem is an integer k≥ 1 and a set P={P_...
09/05/2020

PySAD: A Streaming Anomaly Detection Framework in Python

PySAD is an open-source python framework for anomaly detection on stream...
11/26/2020

Faster Projective Clustering Approximation of Big Data

In projective clustering we are given a set of n points in R^d and wish ...
04/03/2020

Relative Error Streaming Quantiles

Approximating ranks, quantiles, and distributions over streaming data is...
06/09/2020

Faster PAC Learning and Smaller Coresets via Smoothed Analysis

PAC-learning usually aims to compute a small subset (ε-sample/net) from ...
11/04/2021

Introduction to Coresets: Approximated Mean

A strong coreset for the mean queries of a set P in ℝ^d is a small weigh...
10/31/2019

Outsourcing Computation: the Minimal Refereed Mechanism

We consider a setting where a verifier with limited computation power de...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction and Motivation

Anomaly detection is a step in data mining which aims to identify unexpected data points, events, and/or observations in data sets. For example, we are given an unbounded stream of numbers that are obtained from a heart beats of a human (hospital patients) sensor, and the goal is to detect inconsistent spikes in heartbeats. This is crucial for proper examination of patients as well as valid evaluation of their health. Such data forms a wave which can be approximated using a sine

wave. Fitting a large data of this form (heart wave signals), will result in obtaining an approximation towards the distribution from which the data comes from. Such observation aids in detection of outliers or anomalies.

Figure 1: Sine fitting. Given a set of integers (blue points on the -axis), and wave (the red signal), then the cost of the Sine fitting problem with respect to this input, is the sum of vertical distances between the points in (on the x-axis) and the sine signal (the sum of lengths of the green lines). The goal is to find the sine signal that minimizes this sum.

Formally speaking, the anomaly detection problem can be stated as follows. Given a large positive number , a set of integers, the objective is to fit a sine signal, such that the sum of the vertical distances between each (on the -axis) and its corresponding point on the signal, is minimized; see Figure 1. Hence, we aim to solve the following problem that we call the Sine fitting problem:

(1)

where is the set of feasible solutions, and is a regularization function to put constraints on the solution.

The generalized form of the fitting problem above was first addressed by souders1994ieee and later generalized by ramos2008new, where proper implementation have been suggested over the years da2003new; chen2015improved; renczes2016efficient; renczes2021computationally

. In addition, the Sine fitting problem and its variants gained attention in recent years in solving various problems, e.g., estimating the shift phase between two signal with very high accuracy 

queiros2010cross, characterizing data acquisition channels and analog to digital converters pintelon1996improved, high-accuracy sampling measurements of complex voltage ratio of sinusoidal signals augustyn2018improved, etc.

Data discretization. In many applications, we aim to find a proper choice of floating-point grid. For example, when we are given points encoded in bits, and we wish to use a floating point grid. A naive way to do so is by simply removing the most/least significant bits from each point. However such approach results in losing most of the underlying structure that these points form, which in turn leads to unnecessary data loss. Instead, arithmetic modulo or sine functions that incorporate cyclic properties are used, e.g., naumov2018periodic; nagel2020up; gholami2021survey. Such functions aim towards retaining as much information as possible when information loss is inevitable. This task serves well in the field of quantization gholami2021survey

, which is an active sub-field in deep learning models.

To solve this problem, we first find the sine wave that fits the input data using the cost function at (1). Then each point in the input data is projected to its nearest point from the set of roots of the signal that was obtained from the sine fitting operation; see Figure 2.

All of the applications above, e.g., monitoring, anomaly detection, and data discretization, are problems that are reduced to an instance of the Sine fitting problem. Although these problems are desirable, solving them on large-scale data is not an easy task, due to bounded computational power and memory. In addition, in the streaming (or distributed) setting where points are being received via a stream of data, fitting such functions requires new algorithms for handling such settings. To handle these challenges, we can use coresets.

1.1 Coresets

Coreset was first suggested as a data summarization technique in the context of computational geometry (agarwal2004approximating), and got increasing attention over recent years (broder2014scalable; nearconvex; huang2021novel; cohen2021improving; huang2020coresets; mirzasoleiman2020coresets); for extensive surveys on coresets, we refer the reader to (feldman2020core; phillips2016coresets), and jubran2019introduction; maalouf2021introduction for an introductory.

Informally speaking, a coreset is (usually) a small weighted subset of the original input set of points that approximates its loss for every feasible query , up to a provable multiplicative error of , where

is a given error parameter. Usually the goal is to have a coreset of size that is independent or near-logarithmic in the size of the input (number of points), in order to be able to store a data of the same structure (as the input) using small memory, and to obtain a faster time solutions (approximations) by running them on the coreset instead of the original data. Furthermore, the accuracy of existing (fast) heuristics can be improved by running them many times on the coreset in the time it takes for a single run on the original (big) dataset. Finally, since coresets are designed to approximate the cost of every feasible query, it can be used to solve constraint optimization problems, and to support streaming and distributed models; see details and more advantages of coresets in 

(feldman2020core).

In the recent years, coresets were applied to improve many algorithms from different fields e.g. logistic regression 

(huggins2016coresets; munteanu2018coresets; karnin2019discrepancy; nearconvex), matrix approximation (feldman2013turning; maalouf2019fast; feldman2010coresets; sarlos2006improved; maalouf2021coresets)

, decision trees 

(jubran2021coresets), clustering (feldman2011scalable; gu2012coreset; lucic2015strong; bachem2018one; jubran2020sets; schmidt2019fair), -regression (cohen2015lp; dasgupta2009sampling; sohler2011subspace), SVM (har2007maximum; tsang2006generalized; tsang2005core; tsang2005very; tukan2021coresets), deep learning models (baykal2018data; maalouf2021unified; liebenwein2019provable; mussay2021data), etc.

Sensitivity sampling framework. A unified framework for computing coresets to wide range family of problems was suggested in (braverman2016new). It is based on non-uniform sampling, specifically, sensitivity sampling. Intuitively, the sensitivity of a point from the input set is a number that corresponds to the importance of this point with respect to the other points, and the specific cost function that we wish to approximate; see formal details in Theorem 2

. The main goal of defining a sensitivity is that with high probability, a non-uniform sampling from

based on these sensitivities yields a coreset, where each point is sampled i.i.d. with a probability that is proportional to , and assigned a (multiplicative) weight which is inversely proportional to . The size of the coreset is then proportional to (i) the total sum of these sensitivities

, and (ii) the VC dimension of the problem at hand, which is (intuitively) a complexity measure. In recent years, many classical and hard machine learning problems 

(braverman2016new; sohler2018strong; maalouf2020tight) have been proved to have a total sensitivity (and VC dimension) that is near-logarithmic in or even independent of the input size .

Figure 2: Discretization. Given a set of points (blue points), we find a sine wave (red signal) that fits the input data. Then each input point is projected to its nearest point from the set of roots of the signal.

1.2 Our Contribution

We summarize our contribution as follows.

  1. Theoretically, we prove that for every integer , and every set of integers:

    1. The total sensitivity with respect to the Sine fitting problem is bounded by , and the VC dimension is bounded by ; see Theorem 3.1 and Claim 3.2 respectively.

    2. For any approximation error , there exists a coreset of size111 hide terms related to (the approximation factor), and (probability of failure). (see Theorem 3 for full details) with respect to the Sine fitting optimization problem.

  2. Experimental results on real world datasets and open source code (opencode) are provided.

2 Preliminaries

In this section we first give our notations that will be used throughout the paper. We then define the sensitivity of a point in the context of the Sine fitting problem (see Definition 2), and formally write how it can be used to construct a coreset (see Theorem 2). Finally we state the main goal of the paper.

Notations. Let denote the set of all positive integers, for every , and for every denote the rounding of to its nearest integer by (e.g. ).

We now formally define the sensitivity of a point in the context of the Sine fitting problem.

[Sine fitting sensitivity] Let be a positive integer, and let be a set of integers. For every , the sensitivity of is defined as

The following theorem formally describes how to construct an -coreset via the sensitivity framework. We restate it from braverman2016new and modify it to be specific for our cost function.

Let be a positive integer, and let be a set of integers. Let be a function such that is an upper bound on the sensitivity of (see Definition 2). Let and be the VC dimension of the Sine fitting problem; see Definition 3.2. Let , and let be a random sample of i.i.d points from , where every is sampled with probability . Let for every . Then with probability at least , we have that for every , we have

Problem statement. Theorem 2 raises the following question: Can we bound the the total sensitivity and the VC dimension of the Sine fitting problem in order to obtain small coresets?

Note that, the emphasis of this work is on the size of the coreset that is needed (required memory) to approximate the Sine fitting cost function.

3 Coreset for Sine Fitting

In this section we state and prove our main result. For brevity purposes, some proofs of the technical results have been omitted from this manuscript; we refer the reader to the supplementary material for these proofs.

Note that since the regularization function at (1) is independent of , a multiplicative approximation of the terms at (1), yields a multiplicative approximation for the whole term in (1).

The following theorem summarizes our main result. [Main result: coreset for the Sine fitting problem] Let be a positive integer, be a set of integers, and let . Then, we can compute a pair , where , and , such that

  1. the size of is polylogarithmic in and logarithmic in , i.e.,

  2. with probability at least , for every ,

To prove Theorem 3, we need to bound the total sensitivity (as done in Section 3.1) and the VC dimension (see Section 3.2) of the Sine fitting problem.

3.1 Bound On The Total Sensitivity

In this section we show that the total sensitivity of the Sine fitting problem is small and bounded. Formally speaking, Let and . Then

We prove Theorem 3.1 by combining multiple claims and lemmas. We first state the following as a tool to use the cyclic property of the sine function.

Let be a pair of positive integers. Then for every ,

We now proceed to prove that one doesn’t need to go over all the possible integers in to compute a bound on the sensitivity of each , but rather a smaller compact subset of is sufficient.

Let be a set of integer points. For every , let

Then for every ,

(2)
Proof.

Put , and let be an integer that maximizes the left hand side of (2) with respect to , i.e.,

(3)

If the claim trivially holds. Otherwise, we have , and we prove the claim using case analysis: , and . 1: Let be an integer, and let . We first observe that

(4)
where the second equality hold by properties of the ceiling function. We further observe that,
(5)
where the first equality holds by expanding using (4), and the last inclusion holds by the assumption of 1. Since is entirely included in , then it holds that . Similarly, one can show that , which means that . We now proceed to show that the sensitivity can be bounded using some point in . Since for every , , then it holds that
(6)
where the first equality holds by plugging , and into Claim 3.1, the first inequality holds since , the second equality holds by multiplying and dividing by , the second inequality follows from combining the fact that which is derived from (5) and the observation that for every , and the last equality holds by plugging , and into Claim 3.1.
In addition, it holds that for every
(7)
where the first inequality holds by combining the assumption of 1 and the observation that for every , the first equality holds by multiplying and dividing by , the second inequality holds by combining (5) with the observation that for every where in this context , and finally the last equality holds by plugging , , and into Claim 3.1.
Combining (3), (5), (6) and (7) yields that
where last equality holds from combining (3) and .
2: Let , and note that . For every ,
We observe that
Hence, the proof of Claim 3.1 in 2 follows by replacing with in 1. ∎

In what follows, we show that the sensitivity of each point is bounded from above by a factor that is proportionally polylogarithmic in and inversely linear in the number of points that are not that far from in terms of arithmetic modulo.

Let be as in Lemma 3.1 for every , and let

(8)

Then for every ,

Proof.

Put , , and let .

First we observe that for every such that , it is implied that . By the cyclic property of , it holds that .

Combining the above with the fact that , yields that

where the second inequality follows from , and the last derivation holds since which follows from (8). ∎

The bound on the sensitivity of each point (from the Lemma 3.1) still requires us to go over all possible queries in to obtain the closest points in to . Instead of trying to bound the sensitivity of each point by a term that doesn’t require evaluation over every query in , we will bound the total sensitivity in a term that is independent of for every . This is done by reducing the problem to an instance of the expected size of independent set of vertices in a graph (see Claim 3.1). First, we will use the following claim to obtain an independent set of size polylogarithmic in the number of vertices in any given directed graph.

Let be a directed graph with vertices. Let denote the out degree of the th vertex, for . Then there is an independent set of vertices in such that

Proof.

Let denote the set of vertices of , and let denote the set of edges of . Partition the vertices of into induced sub-graphs, where each vertex in the th sub-graph has out degree in for any non-negative integer . Let denote the sub-graph with the largest number of vertices. Pick a random sample of nodes from . The expected number of edges in the induced sub-graph of by is bounded by . Let

By Markov inequality, with probability at least we have . Assume that this event indeed holds. Hence, the sub-graph of that is induced by is an independent set of with nodes. Since we have . Hence, . By the pigeonhole principle, there is such that can be bounded from below by . ∎

There is a set such that for every , and

Proof.

Let be defined as in the proof of Lemma 3.1. For every , let such that

Let denote the directed graph whose vertices are the integers in , and whose edges are

(9)

The out degree of a vertex of is . By Claim 3.1, there is independent set of such that Since is independent set, for every we have . ∎

The following claim serves to bound the size of the independent set, i.e., for every point in the set, . Let such that for every . Then

Finally, combining Lemma 3.1, Lemma 3.1, Claim 3.1 and Claim 3.1, satisfies Theorem 3.1 which presents a bound on the total sensitivity with respect to the cost function.

3.2 Bound on The VC Dimension

First, we define VC dimension with respect to the Sine fitting problem. [VC-dimension (braverman2016new)] Let be a positive integer, be a set of integer, and let , we define

for every and . The dimension of the Sine fitting problem is the size of the largest subset such that

[Bound on the VC dimension of the Sine fitting problem] Let be a pair of positive integers such that , and let be a set of points. Then the VC dimension of the Sine fitting problem with respect to and is .

Proof.

We note that the VC-dimension of the set of classifiers that output the sign of a sine wave parametrized by a single parameter (the angular frequency of the sine wave) is infinite. However since our query space is bounded, i.e., every query is an integer in the range

, then the VC dimension is bounded as follows. First, let for every , and . We observe that for every and , . Hence, for every and it holds that where is defined as in Definition 3.2. Secondly, by the definition of , we have that for every pair of and where ,

This yields that for any , which consequently means that since is an integer, and each such would create a different set of subsets of . Thus we get that :

The claim then follows since the above inequality states that the VC dimension is bounded from above by . ∎

4 Remarks and Extensions

In this section briefly discuss several remarks and extensions of our work.

Parallel implementation. Computing the sensitivities for input points requires time, this is by computing for every , and then bounding the sensitivity for very by iterating over all queries , and taking the one which maximizes its term. However, this can be practically improved by applying a distributed fashion algorithm. Notably, one can compute the cost of every query independently from all other queries in , similarly, once we computed the cost of every query , the sensitivity of each point can be computed independently from all of the other points. Algorithm 1 utilises these observations: It receives as input an integer which indicates the query set range, a set , and an integer indicating the number of machines given to apply the computations on. Algorithm 1 outputs a function , where is the sensitivity of for every .

Input : An integer , a set of integers, and an integer .
Output : A function , where for every is the sensitivity of .
1 a partition of into disjoint subsets, each contains at most integers from . {In some cases, the last set might be empty.} a partition of into disjoint subsets, each contains at most integers from . {In some cases, the last set might be empty.} for every , in distributed manner do
2       for every  do
3             Set
4for every , in distributed manner do
5       for every  do
6             Set
return
Algorithm 1

Extension to high dimensional data.

Our results can be easily extended to the case where (i) the points (of ) lie on a polynomial grid of resolution of any dimension , and (ii) they are further assumed to be contained inside a ball of radius . Note that, such assumptions are common in the coreset literature, e.g., coresets for protective clustering edwards2005no

, relu function 

mussay2021data, and logistic regression tolochinsky2018generic. The analysis with respect to the sensitivity can be directly extended, and the VC dimension is now bounded by . Both claims are detailed at Section B of the appendix.

Approximating the optimal solution via coresets. Let be an integer, , and let be a coreset for as in Theorem 2. Let and be the optimal solutions on the input and its coreset, respectively, then .

5 Experimental Results

Figure 3: Optimal solution approximation error: The x axis is the size of the chosen subset, the y axis is the optimal solution approximation error. Datasets, from left to right, (i)-(1), (i)-(2), and (iii).
Figure 4: Maximum approximation error: The x axis is the size of the chosen subset, the y axis is the maximum approximation error across the whole set of queries. Datasets, from left to right, (i)-(1), (i)-(2), and (iii).

In what follows we evaluate our coreset against uniform sampling on real-world datasets.

Software/Hardware. Our algorithms were implemented in Python 3.6 (10.5555/1593511) using Numpy (oliphant2006guide). Tests were performed on GHz i-U ( cores total) machine with GB RAM.

5.1 Datasets And Applications

  1. [label=()]

  2. Air Quality Data Set (de2008field), which contains instances of hourly averaged responses from an array of metal oxide chemical sensors embedded in an Air Quality Chemical Multisensor Device. We used two attributes (each as a separate dataset) of hourly averaged measurements of (1) tungsten oxide - labeled by (i)-(1) in the figures, and (2) NO2 concentration - labeled by (i)-(2). Fitting the sine function on each of these attributes aids in understanding their underlying structure over time. This helps us in finding anomalies that are far enough from the fitted sine function. Finding anomalies in this context could indicate a leakage of toxic gases. Hence, our aim is to monitor their behavior over time, while using low memory to store the data.

  3. Single Neuron Recordings 

    (singlenueron) acquired from a cat’s auditory-nerve fiber. The dataset has samples and the goal of Sine fitting with respect to such data is to infer cyclic properties from neuron signals which will aim in further understanding of the wave of a single neuron and it’s structure.

  4. Dog Heart recordings of heart ECG (dogHeart). The dataset has samples. We have used the absolute values of each of the points corresponding to the “electrocardiogram” feature which refers to the ECG wave of the dog’s heart. The goal of Sine fitting on such data is to obtain the distribution of the heart beat rates. This aids to detects spikes, which could indicate health problems relating to the dog’s heart.

5.2 Reported Results

Approximation error. We iterate over different sample sizes, where at each sample size, we generate two coresets, the first is using uniform sampling and the latter is using sensitivity sampling. For every such coreset , we compute and report the following.

  1. [label=()]

  2. The optimal solution approximation error, i.e., we find . Then the approximation error is set to be ; see Figure 3.

  3. The maximum approximation error of the coreset over all query in the query set, i.e., ; see Figure 4.

The results were averaged across trials. As can be seen in Figures 3 and 4, the coreset in such context (for the described applications in Section 5.1) encapsulates the structure of the dataset and approximate the datasets behavior. Our coreset obtained consistent smaller approximation errors in almost all the experiments in both experiments than those obtained by uniform sampling. Observe that our advantage on Dataset (iii) is much more significant than the others as this dataset admits a clear periodic underlying structure. Note that, in some cases the coreset is able to encapsulate the entirety of the underlying structure at small sample sizes much better than uniform sampling due to its sensitivity sampling. This means that the optimal solution approximation error in practice can be zero; see the rightmost plot in Figure 3.

Figure 5: Sine fitting cost as a function of the given query. Dataset (ii) was used.
Figure 6: Sine fitting cost as a function of the given query. Dataset (iii) was used.

Approximating the Sine function’s shape and the probability density function of the costs. In this experiment, we visualize the Sine fitting cost as in (1) on the entire dataset over every query in as well as visualizing it on our coreset. As depicted in Figures 5 and 6, the large the coreset size, the smaller the deviation of both functions. This proves that in the context of Sine fitting , the coreset succeeds in retaining the structure of the data up to a provable approximation. In addition, due to the nature of our coreset construction scheme, we expect that the distribution will be approximated as well. This also can be seen in Figure 5 and 6. Specifically speaking, when the coreset size is small, then the deviation (i.e., approximation error) between the cost of (1) on the coreset from the cost of (1) on the whole data, will be large (theoretically and practically), with respect to any query in . As the coreset size increases, the approximation error decreases as expected also in theory. This phenomenon is observed throughout our experiments, and specifically visualized at Figures 5 and 6

where one can see that the alignment between probability density functions with respect to the coreset and the whole data increases with the coreset size. Note that, we used only

points from Dataset (iii) to generate the results presented at Figure 6.

6 Conclusion, Novelty, and Future Work

Conclusion. In this paper, we proved that for every integer , and a set of integers, we can compute a coreset of size for the Sine fitting problem as in (1). Such a coreset approximates the Sine fitting cost for every query up to a multiplicative factor, allowing us to support streaming and distributed models. Furthermore, this result allows us to gain all the benefits of coresets (as explained in Section 1.1) while simultaneously maintaining the underlying structure that these input points form as we showed in our experimental results.

Novelty. The proofs are novel in the sense that the used techniques vary from different fields that where not previously leveraged in the context of coresets, e.g., graph theory, and trigonometry. Furthermore to our knowledge, our paper is the first to use sensitivity to obtain a coreset for problems where the involved cost function is trigonometric, and generally functions with cyclic properties. We hope that it will help open the door for more coresets in this field.

Future work includes (i) suggesting a coreset for a high dimensional input, (ii) computing and proving a lower bound on the time it takes to compute the coreset, (iii) extending our coreset construction to a generalized form of cost function as in (souders1994ieee; ramos2008new), and (iv) discussing the applicability of such coresets in a larger context such as quantization (hong2022daq; zhou2018adaptive; park2017weighted)

of deep neural networks while merging it with other compressing techniques such as pruning 

(liebenwein2019provable; baykal2018data) and low-rank decomposition (tukan2021no; maalouf2020deep; liebenwein2021compressing), and/or using it as a prepossessing step for other coreset construction algorithms that requires discretization constraints on the input, e.g., (varadarajan2012near).

7 Acknowledgements

This work was partially supported by the Israel National Cyber Directorate via the BIU Center for Applied Research in Cyber Security.

References

References

Appendix A Proof of Technical Results

a.1 Proof of Claim 3.1

Proof.

Put and observe that

(10)

Thus,

(11)

where the first equality holds by (10).

Using trigonometric identities, we obtain that

(12)

Since , we have that

and

By combining the previous equalities with (11) and (12), Claim 3.1 follows. ∎

a.2 Proof of Claim 3.1

Proof.

Contradictively assume that , and let be a subset of integers from . Since for every , we have

Observer that (i) the set has different subsets. Hence it has distinct pair of subsets, and (ii) for any we have that . By (i), (ii) and the pigeonhole principle there are two distinct sets such that

Put , and observe that

Therefore for every ,

(13)

Since by the assumption of the claim, there is such that for every (where ), either , or, . Handling Case 1. Assuming that this case holds, then by (13) we obtain that

This contradicts the assumption that . Handling Case 2. Combining the assumption of this case with (13), yields that
This is a contradiction to the assumption that . ∎

Appendix B Extension to High Dimensional Data

In this section we formally discuss the generalization of our results to constructing coresets for sine fitting of rational high dimensional data. First note that in such (high dimensional) settings, the objective of the Sine fitting problem becomes

where is the set of high dimensional input points and is the set of queries. Note that, we still assume that both sets are finite and lie on a grid of resolution ; see next paragraph for more details.

Assumptions.

To ensure the existence of coresets for the generalized form, we first generalize the assumptions of our results as follows: (i) the original set of queries is now generalized to be the set of all points with non-negative coordinates and of resolution . Formally speaking, let be a rational number that denotes the resolution, and let