Ranking and benchmarking framework for sampling algorithms on synthetic data streams

06/17/2020 ∙ by József Dániel Gáspár, et al. ∙ 0

In the fields of big data, AI, and streaming processing, we work with large amounts of data from multiple sources. Due to memory and network limitations, we process data streams on distributed systems to alleviate computational and network loads. When data streams with non-uniform distributions are processed, we often observe overloaded partitions due to the use of simple hash partitioning. To tackle this imbalance, we can use dynamic partitioning algorithms that require a sampling algorithm to precisely estimate the underlying distribution of the data stream. There is no standardized way to test these algorithms. We offer an extensible ranking framework with benchmark and hyperparameter optimization capabilities and supply our framework with a data generator that can handle concept drifts. Our work includes a generator for dynamic micro-bursts that we can apply to any data stream. We provide algorithms that react to concept drifts and compare those against the state-of-the-art algorithms using our framework.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The number of computers grows at a never before seen rate, mainly due to the inclusion of microchips in everyday household items, embedded devices, the availability of personal computers, and mobile devices. Thus, our world is becoming more and more interconnected, creating vast networks of devices producing data at a large scale.

We have to process this constantly generated data in a way that satisfies consumer needs, such as zero downtime and low latency. Algorithms over the incoming data stream cannot be computed locally on one central server due to the bitrate of the stream [Lam2012]; furthermore, requests could come in from all over the world. Therefore, using some distributed system is necessary to provide the power to meet these demands [Leonard1985], however partitioning our computation raises interesting challenges.

Partitioning often occurs with simple hash functions [spark, nasir2015], which does not take the weights of the keys into account. This only works well when the distribution of the incoming data stream follows a near-uniform distribution [nasir2015].

However, requests over the internet do not follow a uniform distribution; instead, they follow Zip’s law, and can, therefore, be fitted using a Zipfian distribution [Adamic2002]. This means that the majority of internet traffic may attribute to a small portion of the userbase.

Changing or non-uniform data distributions can cause overloaded partitions, and manually adjusting partitioners is infeasible due to the amount and frequency of incoming data [Zliobaite2012]. Unexpected and high loads in a distributed system can cause downtime, which entails significant losses in revenue. Examples include Amazon’s and Lowe’s downtime during Black Friday [business_insider_2018, cnn_business_2017].

The base problem of partitioning is the NP-complete Bin Packing problem’s special case, the Multiprocessor Scheduling Problem [garey1979]. Although it has a trivial polynomial-time solution, this would require tasks with the same lengths [garey1979].

For non-trivial cases, the problem requires the knowledge of every key’s weight, which, in a distributed system, is a hard, if not nearly impossible task.

However, the problem can be alleviated by using the knowledge that the data follows Zip’s law, therefore there exists a cutoff where the incoming requests are so sparse that it will not influence any partitioning algorithm significantly. It becomes possible to count only a subset of the elements with the highest frequencies, called frequent elements problem. Some of the earlier works researching the frequent elements problem were done by Alon et. al. [Alon1999] ; Henzinger et. al. [henzinger1998]; and Charikar et. al. [charikar2002].

Algorithms that can precisely and cheaply estimate the frequencies of such elements are essential. Stalling the data stream with heavy computations is infeasible, so solutions that can process each element in the stream in time, query the most frequent items in sublinear time, and use sublinear space are well sought after [Manku2002, metwally2005, demaine2002, misra1982, charikar2002, Cormode2005, Golab2003].

A major problem in in data streaming is changing distribution, also called Concept drift, which requires algorithms that can detect the changes without explicit knowledge about them [Widmer1996]. Failing to do so causes their accuracy to degrade [Zliobaite2012]. However overcorrection triggered by mild noise is also a problem, so balancing robustness and flexibility is key [Stanley2003].

Data bursting happens when data packets arrive at their destination more rapidly than intended by the transmitter [allman2005]. The phenomenon of micro-bursting is quite common in networks with window-based transport protocols, especially TCP [allman2005].

Distributed systems (and most parts of the internet) are built on TCP, but even on non-window based protocols with physically far enough components, where packets are going to be transmitted over several jumps, data burst can occur. Algorithms that are not designed to handle periodical micro-bursting could show significant inaccuracies in their results.

In this paper, our main contributions are: a data generator that can handle concept drifts and data bursts; a mechanism to benchmark and rank sampling algorithms, and a baseline Oracle, which correctly estimates the ground truth based on our data generator; formalization of concepts and concept drifts and proof of the correctness of the Oracle; a hyperparameter optimizer for sampling algorithms; two novel sampling algorithms that can react to concept drifts and analysis of them in conjunction with state-of-the-art sampling algorithms using our framework.

2 Related work

2.1 Sampling algorithms

There are multiple algorithms that try to solve the problem of frequent items with fundamentally different approaches. The common ground for all of these algorithms is that they have to carefully balance the memory usage, run time, and precision simultaneously.

These algorithms can be categorized into the following three categories: Counter-based, Sketch-based and Change respecting.

Counter-based algorithms rely on a counter to update the currently estimated frequencies of each key. Strategies exist to periodically flush or thin the counters to save memory. Three defining algorithms of this category are: Sticky Sampling [Manku2002], Lossy Counting [Manku2002] and Space Saving [misra1982, demaine2002, metwally2005].

Sketch-based algorithms work by keeping fewer counters than their counter-based counterparts. Often using a probabilistic approach to keep estimations of the frequency in sub-linear space. However they require expensive calculations when recording a key, so they cannot fit into the tight budget of a data streaming application. Notable algorithms are: Count Sketch [charikar2002] and Count-Min Sketch [Cormode2005].

Change respecting algorithms have a mechanism in place to detect and conform to the concept drifts unlike static algorithms, which concentrate on either static data or a data stream with a non-changing distribution. We are especially interested in this category, the two main algorithm here are Landmark [Zhu2002] and Frequent [Golab2003].

2.2 Partitioning algorithms

Hash partitioning works by taking the modulus of the hash of the data key with the number of partitions [spark]. This is a static partitioning algorithm that only solves the Multiprocessor Scheduling problem with uniformly distributed keys.

In distributed systems, the number of partitions can change and migration costs cannot be ignored, as parts of the system could be physically apart. Consistent hash is therefore commonly used [karger1997, Gedik2014]. Its major problem is that it has difficulties with non-uniform distributions [Gedik2014].

Gedik proposed three algorithms that consider non-uniform distributions as well [Gedik2014]. Therefore, we decided to use Gedik’s algorithms in our tests. Furthermore, these algorithms are constructed in such a way that both balance and migration costs can be configured easily.

2.3 Concept Drifts

Widmer defines concept drift as the radical change in a target concept introduced by changes in a hidden context [Widmer1996]. Studying concept drift

in data science is a fundamental building block to a good algorithm. Internet traffic is ever-changing due to minor or major real-world events and drifts in the data will be inevitable.

A lot of research went into studying concept drifts, but the emphasis was placed on machine learning, with a categorizer on a finite number of categories.

The term concept drift does not have a clear definition in the field of data streaming or it does not apply to the sampling problem. It is often used interchangeably with the following terms: concept shift, changes of classification, fracture points, fractures between data, since these all refer to the same basic idea [torres2012]. Therefore, when we mention concept drift or drifting data stream we refer to the change of the underlying distribution of the stream. For example, the cause of this change in the distribution could be a sporting event, a sale at an online store or sudden hardware failure at a major data centre.

Concept drift can be sudden, also known as abrupt or instantaneous, where a change takes place suddenly; and gradual, where a transition period exists between two concepts [Stanley2003, tsymbal2004problem]. Sudden concept drift may occur because of critical failure at a major server park, while a gradual drift can occur every evening as different parts of the world start their day, while others go to sleep. Concept drift may also occur during any of our computations, which renders both static partitioning algorithms and static algorithms ineffective.

2.4 Moa

The closest framework to our proposed one is MOA [moa]:

It was built for AI development and its main strengths are built around this idea. It has concept drift generation capabilities, but it was designed with classification in mind.

For a machine learning setting over data streams, the authors of MOA formulate the following requirements [moa]:

  • Process an example at a time, and inspect it at most once

  • Use a limited amount of memory

  • Work in a limited amount of time

  • Be ready to predict at any time

These requirements are also integrated in our requirements.

MOA has no burst generation capabilities, which would allow it to make more real-world-like scenarios. It also does not offer an out-of-the-box solution for metadata generation. Metadata would allow us to make experiments repeatable and could be used to make accurate error calculations. Concept drifts are only loosely defined in the paper [moa], a formal definition would allow us to craft a precise error calculation method and prove its correctness.

3 Definitions

3.1 Data streams

We define data streams with a finite cartesian product over the set of keys . A key could be anything, we define over the natural numbers ().

Definition 3.1 (Data Stream).

Given a key-set and , let be a data stream.

Although it is not the nature of the stream to have a random accessor to their items, for further definitions let be the -th element of the stream.

3.2 Concepts

Concepts

are probability distributions with a start and an end index, between which the

concept is considered active.

Definition 3.2 (Concept).

Let be the set of all concepts, let , if and is a discrete probability distribution over , we call this a concept.

Given a stream (), let be the set of all concepts for stream . We require exactly one concept to be active at every point of the stream.

And require the concept’s probability function to cover the range of the concept.

Given a stream (), and we study the following two types of :

  • Constant concept: discrete probability distribution over ,

  • Changing concept: if and discrete probability distributions over ,

Function can be defined in multiple ways, we will provide further information on the specifics of how we use it in Section 4.1.

Definition 3.3 (Concept Drift).

Given a stream (), let be the set of all concepts for stream . concept drift occurs in if,

Definition 3.4 (Abrupt Concept Drift).

Given two consecutive concepts, , abrupt concept drifts occur at if .

Gradual concept drift occurs between the start and end of a changing concept.

Definition 3.5 (Gradual Concept Drift).

Given a stream () and a changing concept gradual concept drift occurs during

Figure 1: Concept drifts described by concepts

Certain combinations of and range can be chosen for any changing concept so that an abrupt drift could be defined. We deem these as a misuse because a changing concept describes a gradual drift, which should happen over time, not suddenly.

Given all the concepts that are acting on the stream, the true distribution can be defined at any location.

Definition 3.6 (Concept and True Distribution).

Given a stream (), its concepts . At any point the true distribution of the stream is determined by the underlying concept,

3.3 Sampling algorithm

Our problem is similar to the frequent items problem [Cormode2008].

Beyond the most frequent items above a threshold, the frequency of those items is also necessary.

Definition 3.7 (Sampling problem).

Given a key-set , and a data stream , let the resulting most frequent items with their frequency be with threshold defined as

Instead of , a can be given, resulting in at most number of items with the highest frequencies. This parameter is often called top-K in the literature [metwally2005].

Algorithms that solve the sampling problem have to work with heavy limitations. Memory is limited, the run time has to be as low as possible, the stream can only be procesedded once, it’s elements one-by-one and the length of the stream is hidden or non-existent.

The result of these algorithms are normalized to produce relative frequencies, we will call these sample distributions.

3.4 Oracle

Definition 3.8 (Oracle).

An oracle is a sampling algorithm that has access to the data stream’s metadata, which contains the concepts and the function that was used to generate that data stream. It can trivially calculate the true distribution () using the method described in definition 3.6.

Using the oracle as a sampling algorithm in a real distributed system is of course impossible.

3.5 Error calculation

3.5.1 Distribution difference

Directly measuring the error of an algorithm in isolation is key to determine how fast and how accurately it reacts to drifts compared to other algorithms. We need the baseline distribution, provided by the oracle sampling, to which we can compare all of the other distributions.

There are multiple ways, to compare probability distributions, but usually, either Kullback–Leibler divergence or Hellinger distance is used

[fink2009, Webb2016].

Given target discrete distribution and approximate discrete distribution the Hellinger distance is the following formula [Hellinger1909]:

(1)

Hellinger distance is bounded between and . It is a distance function (metric), and unlike Kullback–Leibler divergence, it does satisfy the rule of symmetry and triangle inequality.

3.5.2 Load imbalance

A partitioning problem can be traced back to the Multiprocessor scheduling problem originally defined by Garey et. al. in 1979 [garey1979]. It is NP-complete, but pseudo-polynomial solutions exist for any fixed . [garey1979].

When measuring the error of a whole system we use the most commonly used metric, the percent imbalance metric. It measures the lost performance to the imbalanced load, or in other words the performance that could be gained by balancing our partitions [pearce2012]. Pearce et. al. (2012) [pearce2012] defined this imbalance as follows:

Definition 3.9 (Imbalance).

Let be the loads on our partitions, then is the maximum load on any process. Let be the mean load over all processes. The percent imbalance metric for , , is:

(2)

Calculating the of the loads is sufficient because that will determine the runtime of our whole computation.

4 Our Framework

To test sampling algorithms we need a flexible, fast and deterministic testbed with low overhead.

We list our findings of what the requirements should be for a ranking and benchmarking framework for algorithms over data streams. We base these requirements on previous frameworks [moa] and our own experience while developing sampling algorithms.

  • Allow algorithms to only process the data stream at most once [moa].

  • Provide metrics about the algorithms.

  • Provide those metrics on-demand during the computation [moa].

  • Be Fast with low overhead [moa].

  • Be Deterministic.

  • Include a data generator that can

    • handle concept drifts;

    • allow pre-generation of data streams;

    • generate the metadata of pre-generated streams;

    • be capable of simulating micro-bursting during tests.

  • Offer Hyper-parameter optimization.

4.1 Data generator

4.1.1 Concept drift generation

It makes more sense to define concepts and then concept drifts as their consequence. However, for generating concept drifts it is more straightforward to define the points at which concept drifts are occurring and derive the concepts from those.

Let , is a drift if () (), where is the length, is the midpoint of the drift and are the two probability distributions of the two concepts that the drift is created by.

In further definitions let and .

We have exactly one concept active at any given point, so overlapping drifts should not be allowed either. Given and , the drifts for that data stream:

There also has to be an initial drift starting at index ,

For a , () stream we call a gradual drift, if , and .

For a , () stream we call an abrupt drift, if and .

Given a stream generated by drifts, , let

be a random variable following

discrete distribution, where is

  1. then,

  2. then,

To make sure that these drift definitions are compliant with the concepts defined in Section 3.2, we prove that their expressive powers are equal (Theorem 1, Theorem 2). Therefore the same concept drifts can be described by them.

Theorem 1 (Drifts and concepts are equal in expressive power I.).

and : drifts, where and

Proof.

We construct such , and show that the construction is correct. Then for any (based on requirements for ). We show that for any concept, our constructed corresponding drift is correct by showing that . Please see Appendix A for the whole proof. ∎

Theorem 2 (Drifts and concepts are equal in expressive power II.).

and : concepts, where

Proof.

We begin by constructing such , , then for any , we show that for any drift, our constructed corresponding concept is correct by showing that . Please see Appendix B for the whole proof. ∎

Both the generator and the concepts depend on a function. To allow the most generic way of defining drifts we left

undefined up until this point. For performance reasons, we only approximate a linear interpolation with the following

function, and in Theorem 3 we prove the correctness of this approximation.

Let

where is a random variable following standard uniform distribution.

We use this to avoid recalculating a new distribution for every generated item.

To see smooth transitions in the generator’s implementation we also require that the probability distribution at the final point of a drift matches the probability distribution at the first point of the following drift. To achieve this we require the drifts to comply with the following restriction:

Let () be a data stream and its concepts.

Using the aforementioned it can be shown that this is sufficient precondition.

4.1.2 Dynamic burst generation

When simulating bursts we simulate faulty routers. Before beginning every micro-batch, a burst has a chance to start. During the duration of the burst, the keys have a certain probability that they will be held back. At the end of the burst, which could last a couple of micro-batches, the keys are released back into the stream at once.

Bursts can only be applied to already existing streams, for this, it is a prerequisite for a burst process to already have a data stream prepared. We will use the phrase loading to describe the process of getting the next item of this stream.

Bursts are defined with the following structure: . is a valid burst configuration, if .

Burst start probability () describes the probability of a burst starting at any micro-batch. Once the burst started, another one cannot start again. Before any bursting could take place, the faulty keys are calculated in advance, with Key burstiness probability () that gives the probability of each key being held up. This process generates a map (Faulty Keys Map - ) in which we can store these keys until the burst is over.

During the burst, the faulty keys are counted in . Whenever a key that is in would be loaded into the stream as a next item, instead will be put into the . If a key is not faulty, then the stream loading can continue as usual.

The burst duration is defined by Burst length in micro-batches, which is a random number between the minimum () and maximum () duration. At the end of the burst, the held-up data in is released back into the stream.

4.2 Ranking

We use on-demand metric querying to archive the ranking of sampling algorithms. We show two different ways to compare them. One method works with the direct sampling outputs, while the other method indirectly measures them in a simulated environment by the efficiency of that simulated system.

4.2.1 Ranking by reported distribution difference

Sampling algorithms can be measured directly by the samples described in Section 3.5.1.

To achieve this, we run the algorithms in separate processes and provide them the same data stream. At the end of each micro-batch, the samples of the algorithms are saved and the sample distributions are calculated from them. We run an Oracle on the same data stream to provide the baseline, the true distribution, at the end of each micro-batch.

The Oracle is a good choice for this, because it can correctly calculate the true distributions even during concept drifts (Theorem 3). Seeing the difference increase between a sample distribution and the true distribution during concept drifts means the sampling algorithm could not react to the drift fast enough. After this, the speed at which the difference shrinks tells us how fast it can detect and correct the drift. The Oracle’s correctness is based on the synthetic data generator. We showed that the data generator can correctly generate concept drifts in Theorem 1 and Theorem 2. Therefore, it is enough to show that the oracle correctly calculates the true distribution based on the data generator. We have access to all of the drifts from the metadata, which was used to generate data stream.

Because there is always at least one drift present (the starting drift) and no two drifts can overlap at any given point, for all , is:

  1. then,

  2. then,

Theorem 3 (Oracle is correct).

Given () stream generated by drifts.

Proof.

There are two cases for each :

  1. let , the generator defines to be

    we provided the generator with the following function,

    where is a random variable following standard uniform distribution.

    therefore, the expected value of ,

  2. otherwise, by definition.

4.2.2 Ranking by load imbalance

Figure 2: Flow of data in a distributed computation

In real distributed systems incoming data is processed by multiple nodes. The data arrives at the first stage from a data source, such as Kafka, Couchbase, through a hash partitioner denoted by the blue square. The nodes in the first stage then shuffle this data by a grouping, which is a frequently used operation and could be the result of group-by or join operations. The second stage runs calculations on the shuffled data stream.

In our framework, we simulate a distributed system with multiple nodes by starting multiple sampling algorithms. We process the data stream in micro-batches and for each element of the micro-batch, we select a sampling algorithm with a hash partitioner to feed that element to. After each micro-batch, we gather the outputs of the sampling processes and aggregate them. Based on the aggregated output, we determine whether repartitioning is necessary with a decider strategy. If so, we create a new partitioning algorithm for the imbalance measurement, which will be used by the next micro-batch.

The calculation of the load imbalance happens during each micro-batch. A naive approach would be to calculate the partition loads from the reported data of the sampling processes, however, this would allow cheating, the sampling algorithm could report falsified data and produce perfect results. To make sure this does not happen, we calculate the partition loads during the micro-batch processing. We determine this load by counting the actual number of elements that would be shuffled to certain partitions if we were to continue the computation. If repartitioning happens after processing a micro-batch, the load imbalance caused by the new partitioning algorithm will be calculated during the following micro-batch.

To rank multiple sampling algorithms, we only compare test cases that have the same data stream, number of nodes, decider strategy, repartitioner, and micro-batch size.

4.3 Optimizer

We provide a framework for optimization of sampling algorithms in which a new optimization strategy is easy to implement.

The optimization process works the following way:

  1. The initial population is selected as a starting point.

  2. The initial population is benchmarked

  3. In the selection process, the selectors are used to thin out the population based on their fitness.

  4. In the evolution phase, a new population is generated based on the previous generation’s survivors.

  5. The survivors are added to the new population

Figure 3: State diagram of our optimization process.

The optimization strategy we provide is a local minimum search, in which we choose the member of the population with the best fitness value in the selection process.

The hyperspace of an algorithm could be high dimensional. Every additional parameter increases the number of neighbours exponentially. Therefore, calculating all possible neighbours of an algorithm is infeasible and instead, we generate a subset of them and only include those in the population. Because of the complexity of the problem domain, the correct steps and ranges of the parameters are not universal. For example, a ”probability” parameter only makes sense, if it is between and . Furthermore, it often does not make sense to use the whole domain of the parameter as the range. For example, a ”window size” parameter should not be bigger than the stream length and should have a sensible minimum size as well.

To evaluate these algorithms with specific hyperparameters, fitness functions are used by the optimizer. They can be defined by the use of the metrics, such as accuracy, run time and memory usage.

If the diversity of the data stream is low, overfitting could occur during the optimization process. To avoid this multiple data streams are used during optimization, and also allow different kinds of bursts to be introduced to the data streams.

5 Our Algorithms

5.1 Temporal Smoothed

Our first algorithm is called Temporal Smoothed. This algorithm is inspired by the Landmark, and it aims to solve a flaw in it. In the Landmark, when the sampler is asked to report shortly after a landmark, the reported output is too small in size and may not show an accurate distribution in its sampled keys.

Our algorithm, like the Landmark, works with windows (threshold), but rather than resetting the whole inner state of the sampling after each window, it starts a new sampling process instead. It maintains the original sampler (main sampler) and the newly created one (secondary sampler) for a predetermined window size (switch threshold). During this window, the incoming data is sampled by both of them and only the main sampler’s results are being reported. After the window is processed, the secondary sampler gains a stable size and becomes the new main sampler.

Input : key: Key
Parameter : : SamplerBase, main sampler
Parameter : : SamplerBase, secondary sampler
Parameter : : , threshold
Parameter : : , switch threshold
1 ;
2 ms.recordKey(key);
3 if ss initialized then
4        ss.recordKey(key);
5       
6 end if
7if ss not initialized and  then
8        initialize ss;
9       
10 else if ss initialized and  then
11        ;
12        discard ss;
13       
14 end if
Algorithm 1 TemporalSmoothed

The TemporalSmoothed algorithm can encapsulate any sampling algorithm. As a result, the memory usage and run time heavily depend on which algorithm is encapsulated.

Theorem 4 (Memory usage of Temporal Smoothed).

Assuming the chosen sampling algorithm has a maximum memory usage of where is the current length of the stream, the TemporalSmoothed algorithm’s maximum memory usage will be:

Proof.

First, it has to be stated the function is monotone increasing, since the maximum memory usage of a stream cannot decrease with the increase of the stream length. Secondly, with the help of Figure 4, it can be shown that the main sampler during its lifetime will sample times, which means that the maximum memory usage for the main sampler will be . During the main sampler’s lifetime, the secondary sampler samples times, which means a maximum memory usage of . ∎

Figure 4: Temporal Smoothed phases

Assuming the chosen sampling algorithm can initialize itself and sample a key in time, then the Temporal Smoothed can also sample a key in time. This assumption is not unreasonable, the sampling process has to sample large amounts of data in real-time.

Theorem 5 (Temporal Smoothed drift detection).

The Temporal Smoothed algorithm can detect a change in the stream after sampling less then number of times.

Proof.

In the algorithm a newly created sampler has three phases (Figure 4).

  • In phase one, the sampler takes the role of the secondary sampler. The length of this phase is samples. (The exception to this is the very first sampler which already starts as main sampler)

  • In phase two, the sampler becomes a main sampler. The length of this phase is samples.

  • In phase three, the sampler is still the main sampler, but at the end of this phase, it gets discarded and replaced. This phase overlaps the newly created secondary sampler’s phase one. The length of this phase is samples.

By examining the phase in which the change in the stream happens, we can determine the following cases:

  1. The change happens exactly at the start of phase one. The change will be detected when the sampler becomes the main sampler, which is in samples. (This cannot happen for the very first sampler, because the starting point of the change and the stream would overlap)

  2. The change happens in phase one or two. The sampler may not correctly detect the change since it has samples from before the change and after it. On the other hand, the next main sampler will only have samples from after the change and can correctly detect it. This will happen in no more than samples, which is the upper estimate for the replacement of the current main sampler.

  3. The change happens in phase three. This case is covered by the first two cases because this coincides with the phase one of the current secondary sampler.

5.2 Checkpoint Smoothed

Our next algorithm is called Checkpoint Smoothed. This algorithm works similarly to Temporal Smoothed, but rather than periodically and rigidly renewing the main sampler, it aims to only replace it if we are certain enough that the main sampler’s reported results are not accurate due to a possible change in the stream.

Our algorithm works with a main sampler and a secondary sampler. After our main sampler samples enough data (checkpoint window), a new secondary sampler is initialized. The two samplers sample concurrently for another window (check threshold). After this, the reported results of the main sampler and secondary sampler gets compared using Hellinger distance. If the result is beyond a predetermined threshold (error threshold), the main sampler is replaced by the secondary sampler. We repeat the aforementioned process, either with the original main sampler or the new one.

Input : key: Key
Parameter : : SamplerBase, main sampler
Parameter : : SamplerBase, secondary sampler
Parameter : : , checkpoint window
Parameter : : , check threshold
Parameter : : , error threshold
Parameter : : , helper variable
1 ;
2 ms.recordKey(key);
3 if ss initialized then
4        ss.recordKey(key);
5       
6 end if
7if ss not initialized and  then
8        initialize ss;
9       
10 else if ss initialized and  then
11        ;
12        ;
13        ;
14        if  then
15               ;
16              
17        end if
18       discard ss;
19        ;
20       
21 end if
Algorithm 2 CheckpointSmoothed

The Checkpoint Smoothed algorithm can also encapsulate any sampling algorithm.

Theorem 6 (Memory usage of Checkpoint Smoothed).

Assuming the chosen sampling algorithm has a maximum memory usage of where is the current length of the stream, the Checkpoint Smoothed algorithm’s maximum memory usage will be:

Proof.

If there is too little or no change in the stream, the main sampler will never be replaced. This means that the main sampler’s maximum memory usage is . If the main sampler never gets replaced, a new secondary sampler is started periodically and will sample amount of data. This means that the maximum memory usage of the secondary sampler is . ∎

The sampling time of a key for the Checkpoint Smoothed algorithm depends on multiple things:

  1. The chosen sampling algorithm: As stated in the Temporal Smoothed algorithms run time, it is not unreasonable to assume that a sampler can initialize and sample a key in .

  2. To calculate the Hellinger distance periodically, the relative frequencies of the main sampler and secondary sampler are needed. The run time is therefore tied to the size of the main sampler’s and the secondary sampler’s output.

  3. The calculation of the Hellinger distance can be done in linear time based on the size of its inputs, therefore this does not increase the asymptotic run time.

Based on these, the Checkpoint Smoothed algorithm can sample a key in time. Where is the maximum number of relative frequencies calculated by the chosen sampling algorithm, given samples.

The asymptotic run time can be quite misleading, because the majority of times ( out of times) the run time will be . The algorithm’s sample time can be improved upon. For example, if we run the calculation of Hellinger distance concurrently, we do not have to interrupt the sampling process. This introduces a possible error, which is that we may calculate the errors for a main sampler with samples and for a secondary sampler with samples. If the parameter and are reasonable in size this will not cause a significant change in the algorithm’s output.

6 Analysis

We use concept drifts where the midpoint of the concept drift is at the midpoint of the data stream. This is to give the algorithms as much time to react to the concept drifts as they had to estimate the frequencies before the concept drift.

All data sets consist of elements with a key set of size . In our tests, we use micro-batches of size and a top-K value of , because with the chosen Zipfian distributions this top-K should include the keys with meaningful frequencies ().

The number of test cases grows exponentially with each new algorithm and hyperparameter. Showing all of these would be impractical, because we have limited space, so instead we present only the relevant cases and provide all of our measurements on GitHub 111https://github.com/g-jozsef/sampling-framework-aux. The measurements were made on a core, thread Ryzen CPU clocked at GHz with GB of RAM.

We only show our algorithms in this paper with Frequent and Lossy Counting. Frequent performed the best in the majority of our tests and only seems to have difficulties with bursts. We use Lossy Counting as the encapsulated sampling algorithm in Temporal Smoothed and Checkpoint Smoothed, due to its speed, low memory usage and accuracy. Lossy Counting is robust, it is resilient against data burst and is amongst the worst performers when it comes to reacting to concept drifts.

We use the same parameters for Temporal Smoothed and Checkpoint Smoothed as for Landmark to allow a fair comparison. After running multiple optimizations over various datasets, we decided to use an error threshold of for Checkpoint Smoothed.

  • Temporal Smoothed: threshold, switch threshold

  • Checkpoint Smoothed: checkpoint window, check threshold, error threshold

(a) gradual,
(b) gradual,
(c) gradual, , light burst
(d) gradual, , light burst
(e) gradual, , heavy burst
(f) gradual, , heavy burst
Figure 5: Hellinger distance between the sample distribution and true distribution at each micro-batch for Frequent, Lossy Counting, Checkpoint Smoothed (CPS) and Temporal Smoothed (TMP) without burst, and with light and heavy burst introduced.

Temporal Smoothed and Frequent react the fastest to concept drifts, which is due to their windowed nature. Checkpoint Smoothed can be seen slightly inaccurate during the concept drift, but it quickly corrects itself. During the gradual drift, we can see multiple cuts made by Checkpoint Smoothed, but after the concept drift it rapidly corrects itself.

With the introduction of light burst Checkpoint Smoothed gives the best results on average because it is not hypersensitive to tiny changes. Temporal Smoothed and Frequent however give better results between light bursts compared to Checkpoint Smoothed.

With the introduction of heavy burst Checkpoint Smoothed can better dampen the effects of bursts compared to Frequent and Temporal Smoothed.

(a) gradual,
(b) gradual, , heavy burst
(c) gradual,
(d) gradual, , heavy burst
Figure 6: Memory usage (number of counters) for Frequent, Lossy Counting, Checkpoint Smoothed (CPS) and Temporal Smoothed (TMP) algorithms with and without bursts with Zipfian distribution of and

We can observe the standard deviation of the memory usage increase when bursts are introduced.

Checkpoint Smoothed uses almost twice as much memory than its base, Lossy Counting. Temporal Smoothed uses similar amounts of memory compared to its base but with a greater deviation which becomes more apparent with bursts.

Algorithm Run time
gradual gradual burst gradual gradual burst
Frequent 1355 ms 1168 ms 546 ms 421 ms
Lossy Counting 1493 ms 1387 ms 616 ms 504 ms
Checkpoint Smoothed 2089 ms 1854 ms 868 ms 760 ms
Temporal Smoothed 1979 ms 1784 ms 849 ms 743 ms
Table 1: Run time for Frequent, Lossy Counting, Checkpoint Smoothed and Temporal Smoothed algorithms with and without bursts

The introduction of bursts does not influence run time significantly. We can see that our algorithms behave the same way as the state-of-the-art algorithms. When the exponent becomes higher, the run time becomes lower.

We chose a partition number of to test our algorithms.

(a) abrupt,
(b) abrupt, , heavy burst
(c) gradual,
(d) gradual, , heavy burst
Figure 7: Percent imbalance error for Frequent, Lossy Counting, Checkpoint Smoothed (CPS) and Temporal Smoothed (TMP) algorithms for abrupt and gradual concept drifts with and without bursts with Zipfian distribution of , using partitions.

An abrupt concept drift causes a sudden jump in the imbalance error, which is hard to recover from, but Temporal Smoothed can do so quickly. Checkpoint Smoothed is slowly improving. When heavy bursts are introduced the imbalance caused by the concept drift is not significantly greater than before (please note the scale of the diagrams in Figure 7).

The gradual concept drift causes dipping in the imbalance error, which can be observed here as well. The gradual concept drift is easier to recover from, and both Temporal Smoothed and Checkpoint Smoothed can do so quickly. They can also achieve similar results in the load imbalance in heavy burst scenarios compared to Frequent.

In summary, we found that Temporal Smoothed and Checkpoint Smoothed can react fast to concept drifts. These algorithms improved the reaction time to concept drifts compared to the encapsulated Lossy Counting algorithm. The cost of this improvement was the increase in memory usage and run time.

7 Summary and Future Work

In this work, we defined concepts and concept drifts of distributions for data streams. We approached concept drifts from different perspectives and showed that the two approaches have the same expressive power, and used this to provide an intuitive way to define concept drifts in data streams.

We provided a data generator that can handle data bursts and concept drifts, and showed two methods of measuring the error of sampling algorithms. An optimizer was also included in the framework with a random local minimum search that can be used to tune the hyperparameters of algorithms.

We introduced two novel algorithms that could react to concept drifts, and analyzed them in conjunction with the state-of-the-art algorithms using our framework. In our analysis, we found that Frequent reacts fastest to concept drifts, and our algorithms also show good reaction times. Checkpoint Smoothed could achieve better results than Frequent in heavy burst scenarios.

7.1 Future Work

We would like to mesure our algorithms on a real world dataset.

We conjecture that concepts and concept drifts can be extended to any number of underlying probability distributions.

[heading=bibnumbered]

Appendix A Drifts and concepts are equal in expressive power I.

Proof.

We construct such , and show that the construction is correct.

  1. if is a changing concept, say discrete probability distributions over , , let

  2. if is a constant concept, say discrete probability distribution over , , let be an arbitrary discrete probability distribution over (it does not matter)

For any (based on requirements for ). We show that for any concept, our constructed corresponding drift is correct by showing that

  1. if is a changing concept, with discrete probability distributions over , , from our construction

    reduces to

    Therefore we can use our first generation rule for drift,

  2. if is a constant concept, with discrete probability distribution over ,

    The concepts are non-overlapping, so

    (an arbitrary discrete probability distribution over ),

    if , we can use our second generation rule for drift, that is

    if , we can use our third generation rule for drift, that is

Appendix B Drifts and concepts are equal in expressive power II

Proof.

We begin by constructing such , ,

  1. if , let

    and if and

  2. if , let

For any , we show that for any drift, our constructed corresponding concept is correct by showing that

  1. if and then, based on the given construction