## I Introduction

Networking applications such as load balancing [LoadBalancing], traffic-engineering [TrafficEngeneering], SLA enforcement [SLA], and intrusion detection [IntrusionDetection, IntrusionDetection2] require measurement information such as flow sizes and heavy hitter flows.
Computing this information is challenging due to the limited amount of fast memory and the rapid line rates [Nitro, RHHH, Brick].
Such constraints motivate *approximate* measurements which reduce the overheads at the cost of introducing a provably bounded error [univmon, CountSketch, CountMinSketch, RandomizedCounterSharing, SketchVisor, SpaceSavings].

Accordingly, many measurement algorithms use a small number of ”shared” counters for providing estimates for all flow sizes instead of tracking each with a dedicated counter.
Previous work suggests replacing counters used in these methods with shorter probabilistic counters (a.k.a *estimators*) that approximately count up to large numbers with fewer bits [SAC, DISCO, CEDAR, ICE-Buckets, CASE, Infocom2019]. Such estimators require less memory than regular counters, allowing more to fit within a given amount of space.

Such estimators have been shown to empirically improve the accuracy on networking workloads at the cost of added complexity and reduced speed [Infocom2019]. Approximate measurement algorithms that can benefit from such estimators [CountSketch, CUSketch, univmon] often require significant per-packet processing to calculate multiple hash values or update sophisticated data structures. Sampling techniques [RHHH, Nitro] reduce the number of packets that need to be processed, increasing speed at the cost of losing accuracy and requiring more memory.

Our work provides simple and effective estimator techniques that increase the processing speed *and* reduce the required space. In particular, we make use of the fact that most sketching and sampling based algorithms yield *additive* errors on the order of , where is pre-selected constant and is the size of the total count (in terms of number of packets or bytes).
Therefore, unlike previous work that provided estimators with a multiplicative error, we focus on estimators that themselves have an additive error bound.
As the combination of an additive-error algorithm with a multiplicative-error estimator results in an additive error solution anyway, we study the potential benefits of additive-error estimators for accuracy and speed.
We provide formal accuracy guarantees for our methods, including examples of practical configurations where our approach improves the accuracy.
We then evaluate our methods empirically on real network traces, and show that they improve the accuracy compared to the state of the art estimators while being - faster. Further, for a given error target, we improve the speed and space of the uncompressed solutions by - and up to respectively.

## Ii Related Work

We describe the related work in terms of estimators, sketch algorithms, and cache-based counting algorithms. We note that this terminology does not appear standard and previous work refer to them as ”counters” or ”approximate counters” (regardless of whether they are counting one object or many); we find distinguishing the types of algorithms in this way clearer.

#### Ii-1 Estimators

We use the term *Estimator*

to refer to a small approximate counter (e.g., a register), which can approximately represent a large number. An estimator generally works via probabilistic increments; when an item corresponding to that counter arrives, we flip a coin and add one to the estimator with a certain probability. The estimator’s value is used to derive an approximate estimate for the actual count. In what follows we refer to a

probabilistic increment operation (or PI) as an operation where the estimator may be increased, and an increment as a case where the estimator is incremented (due to a successful coin flip.). The estimator value is used to estimate the number of PIs associated with estimator. Estimators differ from each other by the PI probabilities. Some estimators work for fixed ranges, while others utilize techniques to dynamically increase the counting range (generally at the expense of a larger error).The *Approximate Counting* [ApproximateCounting] algorithm is the first estimator we are aware of, and it inspired a substantial number of follow-on works [ANLS, ANLSUpscaling, CEDAR, ICE-Buckets, CASE, SAC, DISCO] (that we do not discuss here).

#### Ii-2 Sketch Algorithms

Sketch algorithms for keeping large-scale count information in networks are typically composed of arrays of counters. When a packet arrives, the algorithm applies multiple hash functions to its flow id, mapping the flow to a set of counters. Examples include the Count Min Sketch (CMS) [CountMinSketch], the Count Sketch [CountSketch], Spectral Bloom Filter [SpectralBloom], and the Conservative Update (CU) Sketch [CUSketch]. CMS utilizes multiple counter arrays, where each has a hash function that associates each flow with a counter. To increment a flow count in CMS, we apply the hash function of each array to the element and increment the corresponding counter. We estimate the count for a flow by returning the minimal value of all of its relevant counters. The CU Sketch optimizes the accuracy of CMS in a simple manner. When we add an item to the CMS, we only increment the corresponding counters whose value is minimal. That is, if we read 3,4,3, and 5 then we only increment the counters that show 3 to 4. This optimization avoids unnecessary increments, giving more accurate estimates. However, while CMS supports decrements, the CU Sketch does not.

*CounterBraids* [CounterBraids] introduce an hierarchical structure which reduces the average counter length of CMS at the expense of much slower decoding process.
Alternatively, *Randomized Counter Sharing (RCS)* [RandomizedCounterSharing] only updates a single randomly selected counter to achieve a faster update time, and sum all counters for an estimate.
NitroSketch [Nitro] takes RCS a step further,
providing several techniques to accelerate software sketches in virtual switches, including geometric sampling. In general, NitroSketch increases the required space, but accelerates the sketch’s throughput in software.
*Counter Tree* [countertree] introduces multiple *virtual counters* that extend multiple physical counters in a tree structure.
Counter Tree also trades off speed for space efficiency.

The use of estimator algorithms to compress sketch counters is particularly relevant to our work.*Small Active Counters* [SAC] implement an array of estimators, where each estimator keeps track of an exponent and an estimation part. The exponent part determines the probability of success for the PI, which increments the estimation part. When the estimation part reaches its maximum value, the exponent increases and the estimation part resets to 0.
The *DISCO* [DISCO] algorithm improves [SAC]’s accuracy and supports weighted updates (where a counter increases by a given quantity). The work of [ANLSUpscaling] introduces a way to gradually increase the measurement scale when a counter overflows at the expense of larger error.
*CEDAR* [CEDAR] proves that their estimation function is optimal for min-max relative error.
*ICE-Buckets* [ICE-Buckets] uses multiple measurement scales within a single array of estimators to reduce the error, while
*CASE* [CASE] shows that using a cache to monitor the largest flows accurately improves the estimation accuracy. Most relevant to our paper, the recent work of [Infocom2019] suggests a new estimator with multiple counter scales and demonstrates an empirical error reduction at the expense of a slower run-time.

#### Ii-3 Cache-Based Algorithms

We refer to cache-based algorithms for the class of algorithms that maintain a small cache of entries, each containing generally at least the flow identifier and its packet or byte count [frequent4, SpaceSavingIsTheBest, HashPipe, HeavyHitters, 10.14778/3297753.3297762]. To keep space usage reasonable, cache-based algorithms do not keep counts for all flows.

Cache-based algorithms differ from each other in their cache policy, governing when to admit a new flow and which flow to evict when admitting a new flow to a full cache. In software deployments, cache-based algorithms often yield an attractive space/accuracy trade-off when compared to sketch algorithms [SpaceSavingIsTheBest, SpaceSavingIsTheBest2010, SpaceSavingIsTheBest2011]. The *Misra-Gries (MG)* algorithm [misra1982finding] is perhaps the most famous cache-based algorithm, and requires logarithmic update time. The works of [frequent4, BatchDecrement] independently improve the update time to a constant for unweighted streams.

The Space-Saving algorithm [SpaceSavings] maintains a cache of flow entries, each with its own packet (or byte) counter. When a packet from an unmonitored flow arrives to a full cache we evict the entry whose packet count is the smallest among all monitored flows (there may be more than one), and admit the unmonitored flow with an initial packet count of . Space saving also supports weighted updates. In that case, we admit a new entry with a count of where is the weight of the update. Formally, when the Space-Saving algorithm is configured with entries (for some in ), it provides an additive error when is the totoal number of packets.

The Randomized Admission Policy (RAP) [RAP]

provides a simple heuristic that optimizes cache-based algorithms for heavy-tailed workloads. RAP leverages the fact that most packets belong to small flows, so admitting them to the cache means that we stop monitoring important flows. Therefore, RAP admits a new flow with probability

( for unweighted streams). The technique gives a significant empirical improvement in accuracy but currently lacks formal correctness proofs. The authors also suggest -way RAP, which has smaller implementation overhead by using limited associativity arrays. They show that 16-way RAP achieves almost the same results as its fully associative counterpart.Cache-based algorithms can also process weighted inputs, but generally requires more sophisticated algorithms and resources. The Space-Saving algorithm can be implemented with constant update complexity for unit weights and with a logarithmic complexity for general weights. Recent works suggest weighted cache-based algorithms with a constant update complexity [dimsum, IMSUM], at the expense of a larger space requirement.

To the best of our knowledge, estimators were not previously suggested for cache-based algorithms. A possible explanation lies with the data structures associated with counter algorithms. Specifically, flow identifiers are typically 13 bytes long, and such algorithms also have other additional space overheads. When the actual counters are typically 4-8 bytes long the benefit of reducing the counter size is limited. We show that estimators can benefit cache-based algorithms, especially when optimizing their data structures for space.

## Iii Additive-error Estimator

We start by presenting our estimator.
In this section, we assume that the required counting range () is known in advance.
We later show in Section III-D how to dynamically increase the counting range.
Our additive error estimator can count up to with an *additive* error of at most , with probability at least .
We emphasize again that additive guarantees are uncommon in estimator algorithms, which typically provide *multiplicative* error [ICE-Buckets, CASE, ANLSUpscaling, DISCO].
We choose additive error as it allows for smaller estimators, and it is similar to the error of common frequency estimation and heavy hitter algorithms [SpaceSavings, CountMinSketch]. That is,
additive error is unavoidable even if we integrate multiplicative counters into such algorithms. Another argument for additive error is that our estimator size is independent of while the size of multiplicative error estimators cannot be independent of .

### Iii-a Unit Weight Estimators

A unit weight estimator supports the Probabilistic Increment (PIncrement, or in short PI) and Query methods. The PIncrement method adds one to our estimator with a (fixed) probability which we determine below. The Query method estimates the number of PIs attempted by returning the value where is the estimator value. To determine we first set , and .

Since we know that the maximal query return value is , our estimator only need to count to . Intuitively, if we want to increase the estimator above it is always due to oversampling. As a result, we require bits. Note that the number of bits we require to count until (estimator value of ) with an additive error of is independent of . That is, our estimators have an unbounded counting range within the additive error model (note that the error in the additive model depends on ). We note that representing requires bits which implies that our memory consumption still depends on . However, when we move to using arrays of these estimators, since all of the estimators use the same , encoding introduces a negligible overhead.

Theorem III-A shows that our estimation method has the desired property. The proof is delayed to Appendix -A. [Single Estimator]thmsingle For any number of probabilistic increments , we have

As an example, Theorem III-A implies that a -bit estimator can approximate any count up to any pre-specified within an additive error of for , and be correct with probability of .

### Iii-B Weighted Estimators

We now consider a weighted estimator where the desired increment can be an arbitrary number (and not just by 1). Such estimators are useful for applications that, for example, rely on the byte volume of flows rather than their packet counts. Further, most existing sketches (e.g., Count Min [CountMinSketch] and Count Sketch [CountSketch]) and counter-based algorithms (including Space Saving [SpaceSavings], Frequent [BatchDecrement, frequent4] and RAP [RAP]) support weighted updates. The recent estimators by [Infocom2019] support it as well.

Our weighted estimator supports the Add() method, and the Query method estimates the sum of all add operations. For example, PIncrement is equivalent to Add(). We generalize to be the sum of all add operations when discussing weighted measurements. The notation and are unchanged.

In the Add() method, we break the update into two parts. Let and . We increase the estimator (deterministically) by , and with a probability of (notice that and this is a valid probability), we further increase the estimator by 1. In Appendix -B we prove the correctness of this approach.

### Iii-C Estimator Arrays

We now discuss how to efficiently implement an estimator array, which is an important building block for sketch algorithms.
An *estimator array* supports the PIncrement and Query methods, for . Here, is the number of estimators in the array, also referred to as its *width*.
is then defined as the overall number of probabilistic increments across all ’s and the goal is to estimate the number of PIncrement’s to within an additive error.

We can further reduce the size of the array since the sum of all
estimators is unlikely to be much larger than , as an estimator value of yields an estimation of .
Specifically, in Appendix -C we prove that the total number of actual increments to the array is at most with probability (the subscript denotes *oversampling* error probability to distinguish it from the other error sources).

Our goal is to use shorter estimators, and to do so we consider a threshold value , such that each estimator is bits long. *Heavy estimators* are ones which reach the maximal estimator value of , these counters *overflow* to a secondary data structure. Since we keep the sum of all counters bounded by , there can be at most heavy counters.

We store the list of heavy estimators in a hash table where the key is the index of the heavy estimator and the value contains the most significant bits of that estimator. For example, if , we can have two byte (16 bit) estimators, and extend estimators that require more than 16 bits with another 8 bits. In practice, we suggest storing the heavy counters in a compact hash table such as [TinyTable, TinyTable2] which adds an additional bits per heavy counter or bits overall.
This means that our total space requirement is .
We minimize this quantity by setting and which gives a total space of bits. ^{1}^{1}1For performance, it may be better to set for some integer parameter . This allows byte alignment and faster implementation. That is, we save nearly bits per counter by encoding the heavy ones separately.
For example, if and , we can set to encode each counter with two bytes and have at most heavy counters (even if ), for a total memory of less than KB.
In comparison, allocating 3 bytes for each counter, as in the previous sections, requires KB (20% more space).

### Iii-D Dynamically increasing

Heretofore, we have assumed that is known, which allowed us to tune our sampling rate . Sometimes may not be known in advance (e.g., in the case where the measurement length is defined in time and not packets). We propose two algorithms for such a scenario – MaxAccuracy and MaxSpeed. Intuitively, MaxAccuracy aims for the best accuracy possible given the counter size, while MaxSpeed uses the minimal sampling probability to preserve the accuracy guarantee and is therefore faster.

In MaxAccuracy, we start with , and whenever some counter needs to exceed its maximal value we independently replace each

-valued counter with a generated binomial random variable

and halve the value of . This procedure is called*downsampling*and was first introduced in [gibbons1998new]. That is, once

*some*counter overflows we decrease the value of

*all*counters. This simulates a process where each PIncrement increased the value of the estimator with the current value of . As a result, our accuracy guarantees seamlessly follow for the new estimator, given that are such that is smaller than for estimators of length . For example, if we are using -bit counters, then once a counter is incremented for the ’th time, we halve and downsample the estimator.

MaxSpeed does not wait for a counter to reach its maximal value, but instead tracks the number of PIs, which we denote by , and uses a sampling probability . That is, the first PIs are performed with probability , the next PIs with probability , then for PIs it is reduced to , etc. Whenever we halve the sampling probability, we also downsample the counter to maintain the accuracy guarantees. We note that this estimator requires bits, i.e., one additional bit compared to our estimator when knowing in advance.

The pseudocode for MaxAccuracy is given in Algorithm 1, and for MaxSpeed in Algorithm 2. These are generic algorithms that apply to many sketch and cache-based algorithms. Such algorithms vary in the way they implement Line 4 in Algorithm 1, and Line 12 in Algorithm 2. The line returns the counters of , which are algorithm dependent. For example, in the CM Sketch [CountMinSketch] and the CU Sketch [CUSketch] the set contains a single counter from each array chosen by applying a hash function to . In Space Saving [SpaceSavings] and RAP [RAP], the counter is counter if it is monitored, or the minimal counter if it is not monitored. Notice that the algorithms may take steps in addition to increasing the counters using our algorithm. For example, Space Saving and RAP may replace the identifier associated with the minimal counter in addition to increasing it.

Deterministic Downsampling.
We now propose a deterministic method for reducing the estimator values (in both MaxAccuracy and MaxSpeed). Specifically, when downsampling a -valued estimator, we replace its value with instead of .^{2}^{2}2One can get slightly more accurate results by randomized rounding up the estimator by with probability 50% if

was odd. However, as this improvement is negligible compared with the error of the estimator we eschew it for faster implementation.

The intuition is that this allows us to reduce the variance in the estimation. We have run experiments to confirm that the accuracy of the deterministic downsampling is superior to that of the probabilistic one. The theoretical accuracy guarantee of the deterministic downsampling is left for future work. The experiments, whose results are depicted in Figure

1, are obtained by running each pointtimes and reporting its 95% interval according to Student t-test

[student1908probable]. As shown, the deterministic downsampling is indeed more accurate.Deamortized Downsampling. Both algorithm variants include a downsampling operation that requires linear time. In some deployments, having a long maintenance operation may cause high latency and even packet drops. To deamortize the downsampling operation and ensure low worst-case update time, we add a *generation* bit to each counter, which specifies if it was downsampled an even number of times. Then, for each packet, we downsample a number of counters that asymptotically equals the amortized update time (e.g., with sixteen-bit counters, we can downsample counters in each update). Importantly, if a counter that has not been downsampled yet overflows, we immediately downsample it and switch its generation bit, to identify it once the maintenance operation reaches it.

### Iii-E Optimizing the Update Speed

While our proposed estimator saves space, we designed it in a manner that can also reduce the update time. The key aspect of our approach is that the probability for updating an estimator *does not* depend on its current value. In comparison, the update probability in all the estimator techniques surveyed in this work [ICE-Buckets, CASE, CEDAR, ANLSUpscaling, DISCO, SAC, Infocom2019] depends on the current estimator value.

Specifically, we can decide if an estimator is updated prior to calculating the sketch hash functions, and without reading any data structure. When is large enough, most packets require no additional work as they do not update any estimator.
Further, we can use Geometric Sampling [Nitro] to determine how many packets to skip before an estimator is updated. If each packet is sampled with probability , then the number of packets until the next sample is distributed geometrically with mean . Geometric Sampling simply generates a single variable (i.e., ) by using the Inverse Transform Sampling method.
The method sets for a uniform random variable ; it requires a single uniform variate and a few floating-point operations. The variable is shared across *all* estimators and thus does not impose a significant memory overhead (e.g., it can be implemented as a 64-bit integer).
While a similar approach for acceleration appears in NitroSketch [Nitro], it does not allow for shorter counters as they add to the sampled counters and vary over time.

For sketches that associate each flow with estimators, such as the Count Min Sketch and Conservative update, the geometric sampling only requires operations per packets, which gives an amortized complexity of . That is, we have a constant update time for streams in which .

While cache-based algorithms such as Space-Saving and Frequent have data structures that allow constant-time updates [CormodeCode], they may require seven pointers per entry. Alternative approaches include a heap-implementation [CormodeCode] that, while being space-efficient, requires a logarithmic update time. Our approach allows using a heap while keeping the amortized update complexity constant (in streams in which ).

## Iv Integrating Estimator Arrays with Sketches

Sketch data structures utilize several independent counter arrays. Intuitively, each array provides an estimation which is (roughly) accurate with a constant probability, and additional arrays amplify the success probability. For example, the Count Min Sketch (CMS) [CountMinSketch] employs arrays of counters each. Whenever an element arrives, it uses uncorrelated pairwise-independent hash functions that map the input to the range , and for each it increments the counter of the ’th array. When receiving a query for the multiplicity of , we take the minimum over all of . Clearly, CMS can be implemented using our estimator array algorithm above, replacing increment operations with the probabilistic increment operations. For example, with arrays of counters each, we require about KB for the entire encoding.

The sketch itself also has an error that is caused by collisions of different items that increment the same counter. For CMS, it guarantees that the error will be bounded by with probability , for and . Combining the error from the sketch with that of the counter arrays, we have an error of at most with probability at least . For example, if and then replacing the CMS’s counters (assuming they are 32-bits each) with our estimators reduces the space from KB to KB while increasing the error from 0.271% to 0.371% and the error probability from 0.67% to 0.97%. We note that a CMS configured for a 0.371% error except with probability 0.97% would still require more space (KB) than our solution (while also being considerably slower).

## V Cache-based Counter Algorithms

Sketches are a popular design choice for hardware as they are easy to implement in hardware. In software, however, one can generally get a better accuracy to space tradeoff by using cache-based counter algorithms [SpaceSavingIsTheBest2010, SpaceSavingIsTheBest2011]. Specifically, algorithms like Space Saving [SpaceSavings], Misra-Gries [misra1982finding], and Frequent [BatchDecrement, frequent4] use counters (as opposed to in sketches such as Count Min).

In this section, we consider compact cache-based algorithms that can benefit from utilizing estimators, rather than full-sized counters. To obtain maximal benefits, we concurrently aim to minimize the overhead from the flow identifiers. For example, flows are typically defined by five-tuples that are 13 bytes long, whereas counters are typically 4 to 8 bytes long. In such a setting, reducing a 4-byte counter to a 2-byte estimator offers only marginal space improvements.
We therefore propose replacing the identifiers with *fingerprints*, i.e., short pseudo-random bitstrings generated as hashes of the identifiers.
Fingerprints were proposed before (e.g., see [HeavyHitters]) to compress identifiers; however, the following analysis, which asks for the shortest size at which an element experiences additive error at most appears to be new.
In particular, it allows us to use shorter fingerprints compared to previous analyses.
If the stream contains distinct items, then fingerprints of size suffice to ensure that no two items have a fingerprint collision (with suitably high probability) and thus the accuracy is essentially unaffected by this compression. However, while fingerprints may be smaller than the bytes required for encoding five-tuples, they may still be significantly larger than the estimator.
We can do better by not requiring no collisions, and instead finding the minimal fingerprint length () that allows an error of at most with probability . We show that suffices, implying that the fingerprint length can be of the same order as our estimators.

We use a weighted variant of the Chernoff bound which states that for independent random variables with values in the interval for some , the sum satisfies for all ,

Given a parameter , we split the items into large and small ones. Let denote the set of items whose size is at least , and let denote the remaining.
Further, let denote the total size of the large items and let denote the total size of the small ones. We have that .
We want to set the fingerprint size such that with probability *none* of the large items collide with and the sum of sizes for the small colliding items is at most .
Using the union bound, and the fact that , we have that the probability for a collision with a large item is at most .
For each small item with size , we define the random variable to take the value if has the same fingerprint as and otherwise. The total volume that collides with is then (i.e., ). Since each is bounded by , we use the Chernoff bound with to conclude that

Therefore, the overall chance of failure is at most

(1) |

To account for all possible splits of packets into large and small flows and guarantee that (1) is at most , we choose

to conclude that with probability at most packets collide with the fingerprint of . For example, by setting , we find that two byte identifiers yield , three bytes yield an error lower than , and -bit identifiers yield .

Space Saving, Misra Gries, and Frequent are all deterministic and have an additive error of , where and is again the width. Therefore, combining them with our estimators (with an guarantee) yields an overall error of with probability at least .

For brevity, we next provide two numerical examples with and .

Example 1. Consider and ; we get an error lower than with probability at least , while compressing the identifiers into three bytes and replacing the counters with two-byte estimators. That is, our example requires 5-bytes per entry, compared with bytes in the original. We also have at most large counters (see Section IV), for an overall memory of KB. In contrast, for a error guarantee, these algorithms would need counters, requiring more space.

Example 2. Consider and . That is, we require 24-bit estimators and have at most large estimators. This configuration has a total error of at most with probability and requires 6.2KB. In comparison, the uncompressed variants require nearly KB of space for the same guarantees.

## Vi Evaluation

We evaluate our algorithms on two real packet traces: the first 98M packets of (1) the CAIDA equinix-newyork 2018 (NY18) [CAIDA2018] and (2) the CAIDA equinix-newyork 2016 (CH16) [CAIDA2016] backbone traces. We picked these traces as they are somewhat different: CH16 contains 2.5M flows while NY18 exhibits a heavier tail and has nearly 6.5M flows. We implement our algorithms in C++ and compare them with the, state of the art, SAC estimators [Infocom2019] whose code we obtained from the authors. The Baseline code for Space Saving was taken from [CormodeCode] and we extended it to implement the RAP and dWay-RAP algorithms. For a fair comparison, all algorithms employ the same hash function (BobHash). The default setting for our algorithm is MaxAccuracy and we evaluate the difference from MaxSpeed in Section VI-E. We ran the evaluation on a PC with an Intel Core i7-7700 CPU @3.60GHz and 16GB DDR3 2133MHz RAM. Finally, we refer to a PI as increment, to be consistent across all algorithms.

We use the following metrics; for speed, we use Million operations per second (Mops). For accuracy, on single-estimator experiments, we use Normalized Error, which is defined as the absolute error divided by the number of increments (or the sum of additions in the weighted experiment).

Finally, we run every data point 10 times and use Student t-test [student1908probable]

to report the 95% confidence intervals.

### Vi-a Single Estimator

We begin by estimating the error and throughput of a single estimator as a function of the number of increments. We compare our Additive Error Estimator (AEE) to Static SAC [Infocom2019] and Dynamic SAC [Infocom2019]. Figure (a)a shows the normalized error for each 8-bit estimator as a function of the number of increments. AEE retains roughly the same normalized error regardless of the number of increments. In contrast, Static SAC and Dynamic SAC experience higher error and can only count until about . This is because each SAC counter requires few bits to encode its sampling probability, which leaves very few bits for the estimator itself. In contrast, all the AEE estimators use the same sampling probability, which means that we can leverage all 8 bits. Figure (b)b shows the speed of an 8-bit estimator. AEE is orders of magnitude faster since we do not need to access it to decides whether to increment. Figure (c)c and Figure (d)d repeat this experiment for a 16 bit counter. Static SAC and Dynamic SAC perform better than in the 8-bit case but eventually experience increasing error when the count becomes sufficiently large. In comparison, AEE’s error remains the same regardless of the number of increments and is always lower (or equal) to that of Static SAC and Dynamic SAC. Figure (d)d compares the speed, showing that AEE is considerably faster. The non-monotone shape of the AEE curve is due to the computationally expensive random numbers generation. Specifically, AEE is especially fast when not sampling (less than increments) and when sampling aggressively (when is large, and is small). In between, there is a range in which sampling occurs with a relatively high probability (e.g., 1/2) slowing AEE down.

### Vi-B Sketch Algorithms

Next, we evaluate the accuracy and speed of the CM sketch [CountMinSketch] and the CU Sketch [CUSketch], using standard 32-bit counters (denoted Baseline), AEE, Dynamic SAC, and Static SAC estimators. Let us first consider the error in the NY18 trace (Figure (a)a and Figure (e)e). All estimators attain a similar accuracy, which is better than Baseline for both CM Sketch and CU Sketch. Then, as the Memory increases, the precision of the estimator based sketches stops improving while that of the Baseline improves further. Intuitively, the error of estimator based sketches has two components. One is the sketch error that decreases as we allocate more estimators to the sketch. Another comes from the estimator error that stays the same. Thus, as we gradually reduce the sketch error, it eventually becomes negligible compared to the estimation error. Since the CU Sketch is more accurate than the CM Sketch [CUSketch], the estimation error becomes the bottleneck earlier. Figure (b)b and Figure (f)f repeats this experiment on the CH16 trace. The main difference is that the CH16 trace contains only 2.5M distinct flows, while the NY18 trace contains 6.4M distinct flows. As such, the sketch error is considerably lower in the CH16 trace (as there are fewer flows that receive the same counters). Indeed, we see that the error of estimator based sketches does not improve, which implies that estimation error is the dominant one throughout the range. Notably, AEE attains lower error than Static SAC and Dynamic SAC. Figure (c)c, Figure (d)d, Figure (g)g and Figure (h)h show the speed for the CM Sketch and the CU Sketch. Static SAC and Dynamic SAC are slower than Baseline because their sampling probability depends on the specific counter. Therefore, for each increment, we first access the sketch counters (and calculate multiple hash functions), and only then determine the sampling probability. In contrast, in AEE, the sampling probability is identical for all counters. Thus, we first flip a coin and access the sketch counters only if we need to update them. As a result, AEE is considerably faster than Baseline.

### Vi-C Cache-based Algorithms

We evaluate our cache-based algorithms compared to their vanilla baseline. Specifically, we compare Space Saving in the original implementation by [SpaceSavingIsTheBest] (denoted BaselineSS), RAP and 16-Way RAP (denoted BaselineRAP and Baseline16W-RAP), and our compressed versions of these algorithms (denoted AAE-SS, AEE-RAP, and AEE-16W-RAP respectively). Figure (b)b shows the update speed. AEE algorithms are an order of magnitude faster than the Baseline algorithms as we do not need to update the data structures for each packet.

Figure (a)a depicts the error for the NY18 trace. At the beginning of the range, each AEE algorithm is more accurate than its corresponding Baseline, and the most accurate ones are Baseline16W-RAP and AEE-16W-RAP. At first glance, it may seem strange that we gain better accuracy in the limited associativity model than in the fully associative model. However, 16W-RAP can be implemented efficiently in an array, whereas RAP uses the same heap data structure as in the Space Saving implementation, which requires about 41 bytes per entry [CormodeCode]. In contrast, 16W-RAP only takes 13 bytes for flow identifier and 4 bytes for the estimator, or a total of 17 bytes per entry. AEE-16W-RAP takes it one step further with just 4 bytes for a fingerprint and 2 for the estimator, i.e., six bytes per entry overall. Thus, for a given space, Baseline16W-RAP has more entries than BaselineRAP, and AEE-16W-RAP has even more. As we increase the amount of space, all Baseline algorithms improve, while the AEE algorithms improve until the estimation error becomes the dominant one.

### Vi-D Weighted Counters

We estimate the total byte volume of the NY18 trace using a single estimator. The results are depicted in Figures (a)a and (b)b. As in the unweighted case, AEE has better accuracy () and speed () compared with Dynamic SAC.

Figures (c)c and (d)d show results for per-flow byte volume estimation on the NY18 trace. AEE is more accurate than the baseline () until the estimation error becomes dominant (). AEE is also faster than the Baseline (). For accuracy, Dynamic SAC shows a similar trend, but its estimation error becomes dominant at a smaller size.

### Vi-E The MaxSpeed Variant

We now evaluate MaxSpeed versus MaxAccuracy (which we used in previous sections). As shown in Figure 6, MaxSpeed is about faster than MaxAccuracy while offering similar accuracy when the allocated memory is small. We conclude that MaxSpeed is suitable when space is tight or if one requires extremely high speeds.

## Vii Discussion

Our work explores the opportunities offered by replacing full-sized counters in approximate measurement algorithms with short estimators. Specifically, we observe that the target algorithms provide additive error guarantees, while most estimators are designed to provide multiplicative error, which adds needless complexity in this context.

We introduce an Additive Error Estimator (AEE) that offers benefits over multiplicative estimators when combined with sketches and cache-based counting algorithms. Most notably, it maintains the same additive error guarantee over any counting range. Namely, AEE allows us to count indefinitely without overflowing while maintaining the accuracy guarantee. Further, AEE offers faster update speed as it increments all counters with the same probability and avoids computing hash functions for non-sampled packets. Our empirical results show that the AEE estimator is faster and more accurate than existing estimators. The evaluation also shows the limitations of our estimator, which are in line with the theoretical results.

The code of our algorithms is available as open source

[opensource].### -a Proof of our single counter correctness

###### Proof.

If the number of Increments was , then . We have that and . We use a variant of the Bennett bound (see [janson2016large, Eq.1.15]) stating that for every set of independent Bernoulli random variables such that and otherwise, their sum satisfies such that :

Consider our counter where is the indicator of the event in which the ’th attempted increment operation increased the counter. Choosing we get that for all :

(2) |

We use (2) for our counter , and set to obtain:

The function is monotonically increasing in , and therefore so is . As we can bound the error probability as

We use the elementary inequality which gets us to

where the last inequality follows from our choice of . ∎

### -B Proof of our weighted updates correctness

Consider a stream of weighted updates and let denote the total additions made to the counter. For each , let and denote the partitioning of the weight as explained in Section III-B. We also use and to denote the partial weights.

For each , let denote whether we incremented the counter as a result of the coin flip for the ’th update, i.e., and otherwise. Observe that . As the first summand is deterministic, we denote for its probabilistic part; we have that and . Our goal is to show that as this would imply the correctness of our algorithm similarly to the unweighted case:

That is, we showed that and that . The correctness then follows from an analysis similar to that of Appendix -A.

### -C Proof of the sum-of-counters Bound

We now prove that the sum of compressed counters in our counter array is at most with probability . Let the number of times an Increment operation was called, for , and let denote the total number of increments. Notice that since we have . For , let denote whether the ’th increment operation (to any counter) resulted in an increase in a counter. We denote by the sum of all counters after the increments. Then and a simple application of the Chernoff bound implies .