# The Adversarial Robustness of Sampling

## Authors

• 21 publications
• 10 publications
03/31/2020

### A Framework for Adversarially Robust Streaming Algorithms

We investigate the adversarial robustness of streaming algorithms. In th...
01/29/2018

### Temporally-Biased Sampling for Online Model Management

To maintain the accuracy of supervised learning models in the presence o...
11/17/2021

### Stream Sampling with Immediate Decision

The manuscript introduces a method to select a random sample from a stre...
06/11/2019

### Temporally-Biased Sampling Schemes for Online Model Management

To maintain the accuracy of supervised learning models in the presence o...
10/11/2018

### Analysis of Noisy Evolutionary Optimization When Sampling Fails

In noisy evolutionary optimization, sampling is a common strategy to dea...
05/22/2021

### Exact PPS Sampling with Bounded Sample Size

Probability proportional to size (PPS) sampling schemes with a target sa...
03/13/2019

### Cardinality Estimation in a Virtualized Network Device Using Online Machine Learning

Cardinality estimation algorithms receive a stream of elements, with pos...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Random sampling is a simple, generic, and universal method to deal with massive amounts of data across all scientific disciplines. It has wide-ranging applications in statistics, databases, networking, data mining, approximation algorithms, randomized algorithms, machine learning, and other fields (see e.g., [CJSS03, JMR05, JPA04, CDK11, CG05, CMY11] and [Cha01, Chapter 4]

). Perhaps the central reason for its wide applicability is the fact that it (provably, and with high probability) suffices to take only a small number of random samples from a large dataset in order to “represent” the dataset truthfully (the precise geometric meaning is explained later). Thus, instead of performing costly and sometimes infeasible computations on the full dataset, one can sample a small yet “representative” subset of a data, perform the required analysis on this small subset, and extrapolate (approximate) conclusions from the small subset to the entire dataset.

The analysis of sampling algorithms has mostly been studied in the non-adaptive (or static) setting, where the data is fixed in advance, and then the sampling procedure runs on the fixed data. However, it is not always realistic to assume that the data does not change during the sampling procedure, as described in [MNS11, GHR12, GHS12, HW13, NY15]. In this work, we study the robustness of sampling in an adaptive adversarial environment.

In high-level, the model is a two-player game between a randomized streaming algorithm, called , and an adaptive player, . In each round,

1. first submits an element to . The choice of the element can depend, possibly in a probabilistic manner, on all elements submitted by up to this point, as well as all information that observed from up to this point.

2. Next, probabilistically updates its internal state, i.e., the sample that it currently maintains. An update step usually involves an insertion of the newly received element to the sample with some probability, and sometimes deletion of old elements from the sample.

3. Finally, is allowed to observe the current (updated) state of , before proceeding to the next round.

’s goal is to make the sample as unrepresentative as possible, causing to come with false conclusions about the data stream. The game is formally described in Section 2.

Even when there is no apparent adversary, the adaptive perspective is sometimes natural and required. For instance, adaptive data analysis [DFH15, WFRS18] aims to understand the challenges arising when data arrives online, such as data reuse, the implicit bias “collected” over time in scientific discovery, and the evolution of statistical hypotheses over time. In graph algorithms, [CGP18] observed that an adversarial analysis of dynamic spanners would yield a simpler (and quantitively better) alternative to their work.

In view of the importance of robustness against adaptive adversaries, and the fact that random sampling is very widely used in practice (including in streaming settings), we ask the following.

#### Bernoulli and reservoir sampling.

We mainly focus on two of the most basic and well-known sampling algorithms: Bernoulli sampling and reservoir sampling. The Bernoulli sampling algorithm with parameter runs as follows: whenever it receives a stream element , the algorithm stores the element with probability . For a stream of length the sample size is expected to be ; and furthermore, it is well-concentrated around this value. We denote this algorithm by .

The classical reservoir sampling algorithm [Vit85] (see also [Knu97, Section 3.4.2] and a formal description in Section 2) with parameter maintains a uniform sample of fixed size , acting as follows. The first elements it receives, , are simply added to the memory with probability one. When the algorithm receives its element , where , it stores it with probability , by overriding a uniformly random element from the memory (so the memory size is kept fixed to ). We henceforth denote this algorithm by .

#### Attacking sampling algorithms.

To answer the question above of whether sampling algorithms are robust against adversarially chosen streams, we must first define a notion of a representative sample, as several notions might be appropriate. However, we begin the discussion with an example showing how to attack the Bernoulli (and reservoir) sampling algorithm with respect to merely any definition of “representative”.

Consider a setting where the stream consists of points in the one-dimensional range of real numbers . receives these points and samples each one independently with probability . One can observe that, in the static setting and for sufficiently large , the sampled set will be a good representation of the entire points for various definitions of the term “representation”. For example, the median of the stream will be -close111The term “close” here means that the median of the sampled set will be an element whose order among the elements of the full stream, when the elements are sorted by value from smallest to largest, is within the range , with high probability where the parameter depends on the probability . to the median of the sampled elements with high probability, as long as for some constant

(this also holds for any other quantile).

Consider the following adaptive adversary which will demonstrate the difference of the adaptive setting. keeps a “working range” at any point during the game, starting with the full range . In the first round, chooses the number as the first element in the stream. If is sampled, then moves to the range , and otherwise, to the range . Next, submits as the middle of the current range. This continues for steps; Formally, ’s strategy is as follows. Set and . In round , where runs from to , submits to ; If is sampled then sets , and otherwise, it sets . The final stream is .

Note that at any point throughout the process, always submits an element that is larger than all elements in the current sampled set, and also smaller than all the non-sampled elements of the stream. Therefore, the end result is that after this process is over, with probability 1, the sampled elements are precisely the smallest elements in the stream. Of course, the median of the sampled set is far from the median of the stream as such a subset is very unrepresentative of the data. Actually, one might consider it as the “most unrepresentative” subset of the data.

The exact same attack on works almost as effectively against . In this case, the attack will cause all of the sampled elements at the end of the process to lie among the first elements with high probability. For more details, see Section 5.

#### The good news.

This attack joins a line of attacks in the adversarial model. Lipton and Naughton [LN93] showed that an adversary that can measure the time of operations in a dictionary can use this information to increase the probability of a collision and as a result, significantly decrease the performance of the hashtable. Hardt and Woodruff [HW13] showed that linear sketches are inherently non-robust and cannot be used to compute the Euclidean norm of its input (where in the static setting they are used mainly for this reason). Naor and Yogev [NY15] showed that Bloom filters are susceptible to attacks by an adaptive stream of queries if the adversary is computationally unbounded and they also constructed a robust Bloom filter against computationally bounded adversaries.

In our case, we note that the given attack might categorize it as “theoretical” only. In practice, it is unrealistic to assume that the universe from which can pick elements is an infinite set; how would the attack look, then, if the universe is the discrete set ? splits the range to half for times, meaning that the precision of the elements required is exponential; The analogous attack in the discrete setting requires to be exponentially large with respect to the stream size . Such a universe size is large and “unrealistic”: for to memorize even a single element requires memory size that is linear in , whilst sampling and streaming algorithms usually aim to use an amount sublinear in of memory.

Thus, the question remains whether there exist attacks that can be performed on elements using substantially less precision, that is, on a significantly smaller size of discrete universe. In this work, we bring good news to both the Bernoulli and reservoir sampling algorithms by answering this question negatively. We show that both sampling algorithms, with the right parameters, will output a representative sample with good probability regardless of ’s strategy, thus exhibiting robustness for these algorithms in adversarial settings.

We note that any deterministic algorithm that works in the static setting is inherently robust in the adversarial adaptive setting as well. However, in many cases, deterministic algorithms with small memory simply do not exist, or they are complicated and tailored for a specific task. Here, we enjoy the simplicity of a generic randomized sampling algorithm combined with the robust guarantees of our framework.

#### What is a representative sample?

Perhaps the most standard and well-known notion of being representative is that of an -approximation, first suggested by Vapnik and Chervonenkis [VC71] (see also [MV17]), which originated as a natural notion of discrepancy [Cha01] in the geometric literature. It is closely related to the celebrated notion of VC-dimension [VC71, Sau72, She72], and captures many quantitative properties that are desired in a random subset. Let be a sequence of elements from the universe (repetitions are allowed) and let . The density of in is the fraction of elements in that are also in (i.e., ).

A set system is simply a pair where is a collection of subsets. A non-empty subsequence of is an -approximation of with respect to the set system if it preserves densities (up to an factor) for all subsets .

###### Definition 1.1 (ε-approximation).

We say that a (non-empty) sample is an -approximation of with respect to if for any subset it holds that

If the universe is well-ordered, it is natural to take as the collection of all consecutive intervals in , that is, (including all singletons ). With this set system in hand,

-approximation is a natural form of “good representation” in the streaming setting, pointed out by its deep connection to multiple classical problems in the streaming literature, like approximate median, and more generally, quantile estimation

[MRL99, GK01, WLYC13, GK16, KLL16] and range searching [BCEG07]. In particular, if is an -approximation of w.r.t. , then any -quantile of is -close to the -quantile of ; this holds simultaneously for all quantiles (see Section 1.2).

### 1.1 Our Results

Fix a set system over the universe . A sampling algorithm is called -robust if for any (even computationally unbounded) strategy of , the output sample is an -approximation of the whole stream with respect to , with probability at least .

Our main result is an upper bound (“good news”) on the -robustness of Bernoulli and reservoir sampling, later to be complemented them with near-matching lower bounds.

###### Theorem 1.2.

For any , set system , and stream length , the following holds.

• with parameter is -robust.

• with parameter   is -robust.

The proof appears in Section 4. As the total number of elements sampled by is well-concentrated around , the above theorem implies that a sample of total size (at least) , obtained by any of the algorithms, or , is an -approximation with probability .

This should be compared with the static setting, where the same result is known as long as for , and for , where is the VC-dimension of and is a constant [VC71, Tal94, LLS01] (see also [MV17]).

As you can see, to make the static sampling algorithm robust in the adaptive setting one solely needs to modify the sample size by replacing the VC-dimension term with the cardinality dimension (and update the multiplicative constant). Below, in our lower bounds, we show that this increase in the sample size is inherent, and not a byproduct of our analysis.

#### Lower Bounds.

We next show that being adaptively robust comes at a price. That is, the dependence on the cardinality dimension, as opposed to the VC dimension, is necessary. By an improved version of the attack described in the introduction, we show the following:

###### Theorem 1.3.

There exists a constant and a set system with VC-dimension 1, where such that for any :

1. The algorithm with parameter is not -robust.

2. The algorithm with parameter is not -robust.

Moreover, for any , there exists as above where .

The proof can be found in Section 5.

#### Continuous robustness.

The condition of -robustness requires that the sample will be -representative of the stream in the end of the process. What if we wish the sample to be representative of the stream at any point throughout the stream? Formally, we say that a sampling algorithm is -continuously robust if, with probability at least , at any point the sampled set is an -approximation of the first elements of the stream, i.e., of . The next theorem shows that continuous robustness of can be obtained with just a small overhead compared to “standard” robustness. (For one cannot hope for such a result to be true, at least for the above definition of continuous robustness.)

###### Theorem 1.4.

There exists , such that for any , set system , and stream length , with parameter is -continuously robust.

Moreover, if only continuous robustness against a static adversary is desired, then the term can be replaced with the VC-dimension of .

We are not aware of a previous analysis of continuous robustness, even in the static setting. The proof, appearing in Section 6, follows by applying Theorem 1.2 (or its static analogue) in carefully picked “checkpoints” along the stream, where . It shows that if the sample is representative of the stream in any of the points , then with high probability, the sample is also representative in any other point along the stream. (We remark that a similar statement with weaker dependence on can be obtained from Theorem 1.2 by a straightforward union bound.) The proof can be found in Section 6.

#### Comparison to deterministic sampling algorithms.

Our results show that sampling algorithms provide an -approximation in the adversarial model. One advantage of using the notion of -approximation is its wide array of applications, where for each such task we get a streaming algorithm in the adversarial model as described in the following subsection. We stress that for any specific task a deterministic algorithm that works in the static setting will also automatically be robust in the adversarial setting. However, deterministic algorithms tend to be more complicated, and in some cases they require larger memory. Here, we focus on showing that the most simple and generic sampling algorithms “as is” are robust in our adaptive model and yield a representative sample of the data that can be used for many different applications.

The best known deterministic algorithm for computing an -approximating sample in the streaming model is that of Bagchi et al. [BCEG07]. The sample size they obtain is ; the working space of their algorithm and the processing time per element are of the form , where is the scaffold dimension222The scaffold dimension is a variant of the VC-dimension equal to . of the set system. The exact bounds are rather intricate, see Corollary 4.2 in [BCEG07]. While the space requirement of their approach does not have a dependence on , its dependence on and is generally worse than ours, making their bounds somewhat incomparable to ours. Finally, we note that there exist more efficient methods to generate an -approximation in some special cases, e.g., when the set system constitutes of rectangles or halfspaces [STZ04].

### 1.2 Applications of Our Results

We next describe several representative applications and usages of -approximations (see also [BCEG07] for more applications in the area of robust statistics). For some of these applications, there exist deterministic algorithms known to require less memory than the simple random sampling models discuss in this paper. However, one area where our generic random sampling approach shines compared to deterministic approaches is the query complexity or running time (under a suitable computational model). Indeed, while deterministic algorithms must inherently query all elements in the stream in order to run correctly, our random sampling methods query just a small sublinear portion of the elements in the stream.

Consequently, to the best of our knowledge, Bernoulli and reservoir sampling are the first two methods known to compute an -approximation (and as a byproduct, solve the tasks described in this subsection) in adversarial situations where it is unrealistic or too costly to query all elements in the stream. The last part of this subsection exhibits an example of one such situation.

#### Quantile approximation.

As was previously mentioned, -approximations have a deep connection to approximate median (and more generally, quantile estimation). Assume the universe is well-ordered. We say that a streaming algorithm is an -robust quantile sketch if, in our adversarial model, it provides a sample that allows to approximate the rank333The rank of an element in a stream is the total amount of elements in the stream so that . of any element in the stream up to additive error with probability at least . Observe that this is achieved with an -approximation with respect to the set system where . For example, set to be the median of the stream. Since the density of the range is preserved in the sample, we know that the median of the sample will be -close to the median of the stream. This works for any other quantile simultaneously. The sample size is .

###### Corollary 1.5.

For any , well-ordered universe , and stream length , with parameter is an -robust quantile sketch. The same holds for the algorithm with parameter .

A corollary in the same spirit regarding continuously robust quantile sketches can be derived from Theorem 1.4.

#### Range queries.

Suppose that the universe is of the form for some parameters and . One basic problem is that of range queries: one is given a set of ranges and each query consists of a range where the desired answer is the number of points in the stream that are in this range. Popular choices of such ranges are axis-aligned or rotated boxes, spherical ranges and simplicial ranges. An -approximation allows us to answer such range queries up to an additive error of . Suppose the sampled set is , then an answer is given by computing . For example, when consists of all axis-parallel boxes, and thus the sample size required to answer range queries that are robust against adversarial streams is ; for rotated boxes, one should replace with in this expression. See [BCEG07] for more details on the connection between -approximations and range queries.

#### Center points.

Our result is also useful for computing -center points. A point in the stream is a -center point if every closed halfspace containing in fact contains at least points of the stream. In [CEM96, Lemma 6.1] it has been shown that an -approximation (with respect to half-spaces) can be used to get a -center point for suitable choices of the parameters. For example, setting we get that a -center of the sample is a -center of the stream . Thus, we can compute a -center of a stream in the adversarial model. See also [BCEG07].

#### Heavy hitters.

Finding those elements that appear many times in a stream is a fundamental problem in data mining, with a myriad of practical applications. In the heavy hitters problem, there is a threshold and an error parameter . The goal is to output a list of elements such that if an element appears more than times in the stream (i.e., ) it must be included in the list, and if an element appears less than times in the stream (i.e., it cannot be included in the list.

Our results yield a simple and efficient heavy hitters streaming algorithm in the adversarial model. For any universe let be the set of all singletons. Now, pick and use either Bernoulli or reservoir sampling to compute an -approximation of the stream , outputting all elements with . Indeed, if then . On the other hand, if then .

###### Corollary 1.6.

There exists such that for any , universe , and stream length , with parameter solves the heavy hitters problem with error in the adversarial model. The same holds for with parameter .

#### Clustering.

The task of partitioning data elements into separate groups, where the elements in each group are “similar” and elements in different groups are “dissimilar” is fundamental and useful for numerous applications across computer science. There has been lots of interest on clustering in a streaming setting, see e.g. [GLA16] for a survey on recent results. Our results suggest a generic framework to accelerate clustering algorithms in the adversarial model: Instead of running clustering on the full data, one can simply sample the data to obtain (with high probability, even against an adversary) an -approximation of it, run the clustering algorithm on the sample, and then extrapolate the results to the full dataset.

#### Sampling in modern data-processing systems.

It is very common to use random sampling (sometimes “in disguise”) in modern data-intensive systems that operate on streaming data, arriving in an online manner. As an illustrative example, consider the following distributed database [OV11] setting. Suppose that a database system must receive and process a huge amount of queries per second. It is unrealistic for a single server to handle all the queries, and hence, for load balancing purposes, each incoming query is randomly assigned to one of query-processing servers. Seeing that the set of queries that each such server receives is essentially a Bernoulli random sample (with parameter ) of the full stream, one hopes that the portion of the stream sampled by each of these servers would truthfully represent the whole data stream (e.g., for query optimization purposes), even if the stream changes with time (either unintentionally or by a malicious adversary). Such “representation guarantees” are also desirable in distributed machine learning systems [GDG17, SKYL17], where each processing unit learns a model according to the portion of the data it received, and the models are then aggregated, with the hope that each of the units processed “similar” data.

In general, modern data-intensive systems like those described above become more and more complicated with time, consisting of a large number of different components. Making these systems robust against environmental changes in the data, let alone adversarial changes, is one of the greatest challenges in modern computer science. From our perspective, the following question naturally emerges:

Is random sampling a risk in modern data processing systems?

Fortunately, our results indicate that the answer to this question is largely negative. Our upper bounds, Theorems 1.2 and 1.4, show that a sufficiently large sample suffices to circumvent adversarial changes of the environment.

### 1.3 Related Work

#### Online learning.

One related field to our work is online learning, which was introduced for settings where the data is given in a sequential online manner or where it is necessary for the learning algorithm to adapt to changes in the data. Examples include stock price predictions, ad click prediction, and more (see [Sha12] for an overview and more examples).

Similar to our model, online learning is viewed as a repeated game between a learning algorithm (or a predictor) and the environment (i.e., the adversary). It considers rounds where in each round the environment submits an instance , the learning algorithm then makes a prediction for , the environment, in turn, chooses a loss for this prediction and sends it as feedback to the algorithm. The goal in this model is usually to minimize regret (the sum of losses) compared to the best fixed prediction in hindsight. This is the typical setting (e.g., [HAK07, SST10]), however, many different variants exist (e.g., [DGS15, ZLZ18]).

#### PAC learning.

In the PAC-learning framework [Val84], the learner algorithm receives samples generated from an unknown distribution and must choose a hypothesis function from a family of hypotheses that best predicts the data with respect to the given distribution. It is known that the number of samples required for a class to be learnable in this model depends on the VC-dimension of the class.

A recent work of Cullina et al. [CBM18] investigates the effect of evasion adversaries on the PAC-learning framework, coining the term of adversarial VC-dimension for the parameter governing the sample complexity. Despite the name similarity, their context is seemingly unrelated to ours (in particular, it is not a streaming setting), and correspondingly, their notion of adversarial VC-dimension does not seem to relate to our work.

#### Adversarial examples in deep learning.

A very popular line of research in modern deep learning proposes methods to attack neural networks, and countermeasures to these attacks. In such a setting, an adversary performs adaptive queries to the learned model in order to fool the model via a malicious input. The learning algorithms usually have an underlying assumption that the training and test data are generated from the same statistical distribution. However, in practice, the presence of an adaptive adversary violates this assumption. There are many devastating examples of attacks on learning models

[SZS14, BCM13, PMG17, BR18, MHS19] and we stress that currently, the understanding of techniques to defend against such adversaries is rather limited [GMP18, MW18, MM19, MHS19].

#### Maintaining random samples.

Reservoir sampling is a simple and elegant algorithm for maintaining a random sample of a stream [Vit85], and since its proposal, many flavors have been introduced. Chung, Tirthapura, Woodruff [CTW16] generalized reservoir sampling to the setting of multiple distributed streams, which need to coordinate in order to continuously respond to queries over the union of all streams observed so far (see also Cormode et al. [CMYZ12]). Another variant is weighted reservoir sampling where the probability of sampling an element is proportional to a weight associated with the element in the stream [ES06, BOV15]. A distributed version as above was recently considered for the weighted case as well [JSTW19].

### 1.4 Paper Organization

Section 2 contains an overview of our adversarial model and a more precise and detailed definition than the one given in the introduction. In Section 3 we mention several concentration inequalities required for our analysis. In Section 4 we present and prove our main technical Lemma, from which we derive Theorem 1.2. This includes analysis of both and . In Section 5 we present our “attack”, i.e., our lower bound showing the tightness of our result. Finally, in Section 6, we prove our upper bounds in the continuous setting.

## 2 The Adversarial Model for Sampling

In this section, we formally define the online adversarial model discussed in this paper. Roughly speaking, we say that is an -robust sampling algorithm for a set system if for any adversary choosing an adaptive stream of elements , the final state of the sampling algorithm is an -approximation of the stream with probability . This is formulated using a game, , between two players, and .

#### Rules of the game:

1. is a streaming algorithm, which gets a sequence of elements one by one in an online manner (the sampling algorithms we discuss in this paper do not need to know in advance). Upon receiving an element , can perform an arbitrary computation (the running time can be unbounded) and update a local state . We denote the local state after steps by , and write .

2. The stream is chosen adaptively by : a probabilistic (unbounded) player that, given all previously sent elements and the current state , chooses the next element to submit. The strategy

that Adversary employs along the way, that is, the probability distribution over the choice of

given any possible set of values and , is fixed in advance. The underlying (finite or infinite) set from which is allowed to choose elements during the game is called the universe, and denoted by . We assume that does not change along the game.

3. Once all rounds of the game have ended, outputs . For the sampling algorithms discussed in this paper, is a subsequence of the stream . is usually called the sample obtained by in the game.

For an illustration on the rules of the game see Figure 1.

Using the game defined above, we now describe what it means for a sampling algorithm to be (adversarially) robust.

###### Definition 2.1 (Robust sampling algorithm).

We say that a sampling algorithm is -robust with respect to the set system and the stream length if for and any (even unbounded) strategy of , it holds that

The memory size used by is defined to be the maximal size of throughout the process of .

A stronger requirement that one can impose on the sampling algorithm is to hold an -approximation of the stream at any step during the game. To handle this, we define a continuous variant of which we denote , presented in Figure 2.

For the sampling algorithms that we consider, the state at any time is essentially equal to the sample . In any case, the definition of the framework given in Figure 2 generally allows to contain additional information, if needed. A sampling algorithm is called -continuously robust if the following holds with probability at least : for any strategy of , and all , the sample is an -approximation of the stream at time .

###### Definition 2.2 (Continuously robust sampling algorithm).

We say that a sampling algorithm is -continuously robust with respect to the set system and the stream length if for and any (even unbounded) strategy of , it holds that

The memory size used by is defined to be the maximal size of throughout the process of .

#### Reservoir sampling.

For completeness, we provide the pseudocode of the reservoir sampling algorithm [Vit85, Knu97]. Here, denotes the (fixed) memory size of the algorithm, denotes the current round number, and is the currently received element.

:

1. If then parse and output .

2. Otherwise, parse .

3. With probability do:
choose uniformly at random and output .

4. Otherwise, output .

## 3 Technical Preliminaries

The logarithms in this paper are usually of base , and denoted by . The exponential function is . For an integer we denote by the set

. We state some concentration inequalities, useful for our analysis in later sections. We start with the well-known Chernoff’s inequality for sums of independent random variables.

###### Theorem 3.1 (Chernoff Bound [Che52]; see Theorem 3.2 in [Cl06]).

Let be independent random variables that take the value 1 with probability and 0 otherwise, , and . Then for any ,

 Pr[X≤(1−δ)μ]≤exp(−δ2μ2)

and

 Pr[X≥(1+δ)μ]≤exp(−δ2μ2+2δ/3).

Our analysis of adversarial strategies crucially makes use of martingale inequalities. We thus provide the definition of a martingale.

###### Definition 3.2.

A martingale is a sequence of random variables with finite means, so that for , it holds that .

The most basic and well-known martingale inequality, Azuma’s (or Hoeffding’s) inequality, asserts that martingales with bounded differences are well-concentrated around their mean. For our purposes, this inequality does not suffice, and we need a generalized variant of it, due to McDiarmid [McD98, Theorem 3.15]; see also Theorem 4.1 in [Fre75]. The formulation that we shall use is given as Theorem 6.1 in the survey of Chung and Lu [CL06].

###### Lemma 3.3 (See [Cl06], Theorem 6.1).

Let be a martingale. Suppose further that for any

, the variance satisfies

for some values , and there exists some so that always holds. Then, for any , we have

 Pr(X−X0≥λ)≤exp(−λ22∑ni=1(σ2i)+Mλ/3).

In particular,

 Pr(|X−X0|≥λ)≤2exp(−λ22∑ni=1(σ2i)+Mλ/3).

Unlike Azuma’s inequality, Lemma 3.3 is well-suited to deal with martingales where the maximum value of is large, but the maximum is rarely attained (making the variance much smaller than ). The martingales we investigate in this paper depict this behavior.

## 4 Adaptive Robustness of Sampling: Main Technical Result

In this section, we prove the main technical lemma underlying our upper bounds for Bernoulli sampling and reservoir sampling. The lemma asserts that for both sampling methods, and any given subset of the universe , the fraction of elements from within the sample typically does not differ by much from the corresponding fraction among the whole stream.

###### Lemma 4.1.

Fix , a universe and a subset , and let be the sequence chosen by in against either or .

1. For with parameter , we have .

2. For with memory size , it holds that .

Both of these bounds are tight up to an absolute multiplicative constant, even for a static adversary (that has to submit all elements in advance); see Section 6 for more details.

The proof of Theorem 1.2 follows immediately from Lemma 4.1, and is given below. The proof of Theorem 1.4 requires slightly more effort, and is given in Section 6.

###### Proof of Theorem 1.2.

Let , , , be as in the statement of the theorem, and let and denote the stream and sample, respectively. We start with the Bernoulli sampling case, and assume that . For each , we apply the first part of Lemma 4.1 with parameters and , concluding that

 Pr(|dR(X)−dR(S)|≥ε)≤δ/|R|.

In the event that for any , by definition is an -approximation of . Taking a union bound over all , we conclude that the probability of this event not to hold is bounded by , meaning that with as above is -robust.

The proof for is identical, except that we replace the condition on with the condition that , and apply the second part of Lemma 4.1. ∎

It is important to note that the typical proofs given for statements of this type in the static setting (i.e., when Adversary submits all elements in advance, and cannot act adaptively) do not apply for our adaptive setting. Indeed, the usual proof of the static analogue of the above lemma goes along the following lines: Adversary chooses which elements to submit in advance, and in particular, determines the number of elements from sent, call it . Then, the number of sampled elements from

is distributed according to the binomial distribution

for Bernoulli sampling, and for reservoir sampling. One can then employ Chernoff bound to conclude the proof. This kind of analysis crucially relies on the adversary being static.

Here, we need to deal with an adaptive adversary. Recall that at any given point is modeled as a probabilistic process, that given the sequence of elements sent until now, and the current state of , probabilistically decides which element to submit next. Importantly, this makes for a well-defined probability space, and allows us to analyze ’s behavior with probabilistic tools, specifically with concentration inequalities.

Chernoff bound cannot be used here, as it requires the choices made by the adversary along the process to be independent of each other, which is clearly not the case. In contrast, martingale inequalities are suitable for this setting. We shall thus employ these, specifically Lemma 3.3, to prove both parts of our main result in this section.

### 4.1 The Bernoulli Sampling Case

We start by proving the Bernoulli sampling case (first statement of Lemma 4.1). Recall that here each element is sampled, independently, with probability . At any given point along the process, let denote the sequence of elements submitted by the adversary until round , and let denote the subsequence of sampled elements from . Note that and , and hence, to prove the lemma, we need to show that .

As a first attempt, it might make sense to try applying a martingale concentration inequality on the sequence of random variables , where we define . Indeed, our end-goal is to bound the probability that significantly deviates from zero. However, a straightforward calculation shows that this is not a martingale, since the condition that does not hold in general. To overcome this, we show that a slightly different formulation of the random variables at hand does yield a martingale. Given the above , for any we define the random variables

 ARi=in⋅dR(Xi)=|R∩Xi|n;BRi=|R∩Si|np;ZRi=BRi−ARi, (1)

where, as before, the intersection between a set and a sequence is the subsequence of consisting of all elements that also belong to .

Importantly, as is described in the next claim, the sequence of random variables defined above forms a martingale. The claim also demonstrates several useful properties of these random variables, to be used later in combination with Lemma 3.3.

###### Claim 4.2.

The sequence is a martingale. Furthermore, the variance of conditioned on is bounded by , and it always holds that .

We shall prove Claim 4.2 later on; first we use it to complete the proof of the main result.

###### Proof of Lemma 4.1, Bernoulli sampling case.

It suffices to prove the following two inequalities for any satisfying the conditions of the lemma for the Bernoulli sampling case:

 Pr(|ARn−BRn|≥ε/2)≤δ/2;Pr(|BRn−dR(Sn)|≥ε/2)≤δ/2. (2)

Indeed, taking a union bound over these two inequalities, applying the triangle inequality, and observing that , we conclude that , as desired.

The first inequality follows from Claim 4.2 and Lemma 3.3. Indeed, in view of Claim 4.2, we can apply Lemma 3.3 on with parameters , , and . As , we have , and so

 Pr(|ARn−BRn|≥ε/2)≤2exp⎛⎜⎝−(ε/2)22n⋅1n2p+ε6np⎞⎟⎠<2exp(−ε2np9).

The right hand side is bounded by when , settling the first inequality of (2).

We next prove the second inequality of (2). Observe that . Since each element is added to the sample with probability , independently of other elements, the size of is distributed according to the binomial distribution , regardless of the adversary’s strategy. Applying Chernoff inequality with , we get that

 Pr(∣∣|Sn|−np∣∣≥εnp/2)≤2exp(−(ε/2)2np2+ε/3)<2exp(−ε2np10).

This probability is bounded by provided that . Conditioning on this event not occurring, we have that

 ∣∣dR(Sn)−BRn∣∣=∣∣∣1−|Sn|np∣∣∣⋅dR(Sn)≤∣∣∣1−|Sn|np∣∣∣≤ε2 ,

where the first inequality follows from the fact that densities (in this case, ) are always bounded from above by one, and the second inequality follows from our conditioning. This completes the proof of the second inequality in (2). ∎

The proof of Claim 4.2 is given next.

###### Proof of Claim 4.2.

We first show that is a martingale. Fix , and suppose that the first rounds of have just ended (so the values of are already fixed), and that now picks an element to submit in round of the game.

If then and and so , which trivially means that as desired.

When , we have

 ARi=ARi−1+1n; BRi={BRi−1if xi is not sampled.BRi−1+1npif xi is sampled. ⇒ ZRi={ZRi−1−1/n% if xi is not sampled.ZRi−1+1/np−1/nif xi is sampled.

Recall that uses Bernoulli sampling with probability , that is, is sampled with probability (regardless of the outcome of the previous rounds). Therefore, we have that

 E[ZRi | ZR0,…,ZRi−1 ; xi∈R]=ZRi−1+p⋅(1np−1n)+(1−p)⋅(−1n)=ZRi−1.

The analysis of both cases and implies that , as desired.

We now turn to prove the other two statements of Claim 4.2. The maximum of the expression is , obtained when . The variance of given is zero given the additional assumption that ; assuming that , the variance satisfies

 Var(ZR