DeepAI
Log In Sign Up

Tricking the Hashing Trick: A Tight Lower Bound on the Robustness of CountSketch to Adaptive Inputs

07/03/2022
by   Edith Cohen, et al.
Google
MIT
SEO-URI
1

CountSketch and Feature Hashing (the "hashing trick") are popular randomized dimensionality reduction methods that support recovery of ℓ_2-heavy hitters (keys i where v_i^2 > ϵv_2^2) and approximate inner products. When the inputs are not adaptive (do not depend on prior outputs), classic estimators applied to a sketch of size O(ℓ/ϵ) are accurate for a number of queries that is exponential in ℓ. When inputs are adaptive, however, an adversarial input can be constructed after O(ℓ) queries with the classic estimator and the best known robust estimator only supports Õ(ℓ^2) queries. In this work we show that this quadratic dependence is in a sense inherent: We design an attack that after O(ℓ^2) queries produces an adversarial input vector whose sketch is highly biased. Our attack uses "natural" non-adaptive inputs (only the final adversarial input is chosen adaptively) and universally applies with any correct estimator, including one that is unknown to the attacker. In that, we expose inherent vulnerability of this fundamental method.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

02/28/2022

On the Robustness of CountSketch to Adaptive Inputs

CountSketch is a popular dimensionality reduction technique that maps ve...
05/04/2021

Hardness-Preserving Reductions via Cuckoo Hashing

The focus of this work is hardness-preserving transformations of somewha...
06/11/2018

On the adversarial robustness of robust estimators

Motivated by recent data analytics applications, we study the adversaria...
04/08/2022

Testing Positive Semidefiniteness Using Linear Measurements

We study the problem of testing whether a symmetric d × d input matrix A...
06/15/2017

Generalization for Adaptively-chosen Estimators via Stable Median

Datasets are often reused to perform multiple statistical analyses in an...
07/13/2021

The Element Extraction Problem and the Cost of Determinism and Limited Adaptivity in Linear Queries

Two widely-used computational paradigms for sublinear algorithms are usi...
11/14/2014

Asymmetric Minwise Hashing

Minwise hashing (Minhash) is a widely popular indexing scheme in practic...

1 Introduction

CountSketch [11] and its variant feature hashing [30, 40] are immensely popular dimensionality reduction methods that map input vectors in to their sketches in (where

). The methods have many applications in machine learning and data analysis and often are used as components in large models or pipelines 

[40, 34, 32, 12, 13, 1, 36, 2, 15].

The mapping is specified by internal randomness that determines a set of linear measurements vectors in . The sketch of a vector is the matrix of linear measurements

The salient properties of CountSketch are that (when setting and ) the -heavy hitters of an input , that is, keys with , can be recovered from and that the inner product of two vectors , can be approximated from their respective sketches , . This recovery is performed by applying an appropriate estimator to the sketch, for example, the median estimator [11]

provides estimates on values of keys and supports heavy hitters recovery. But recovery can also be implicit, for example, when the sketch is used as a compression module in a Neural Network 

[12].

Randomized data structures and algorithms, CountSketch included, are typically analysed under an assumption that the input sequence is generated in a way that does not depend on prior outputs and on the sketch randomness . This assumption, however, does not always hold, for example, when there is an intention to construct an adversarial input or when the system has a feedback between inputs and outputs [36, 33].

The adaptive setting, where inputs may depend on prior outputs, is more challenging to analyse and there is growing interest in quantifying performance and in designing methods that are robust to adaptive inputs. Works in this vein span machine learning [38, 19, 5, 31], adaptive data analysis [17, 25, 28, 22, 16], dynamic graph algorithms [35, 3, 18, 21, 39, 7], and sketching and streaming algorithms [29, 3, 23, 9, 24, 41, 6, 8].

Robustness to adaptive inputs can trivially be achieved by using a fresh data structure for each query, or more finely, for each time the output changes, hence robustness guarantee that grows only linearly with sketch size. A powerful connection between adaptive robustness and differential privacy [16] and utilizing the workhorse of DP composition, yielded essentially a wrapper around non-robust independent replicas that supports a quadratic number of adaptive queries (or changes to the output) [24, 20, 7]. Research works on robust streaming algorithms include [9, 24, 41, 6, 8, 20, 14]. For the problem of recovering heavy-hitters from CountSketch, the "wrapper" method supports adaptive queries. The current state of the art [14] is a robust estimator that works with a variant of CountSketch and supports adaptive queries.

Lower bounds on the performance of algorithms to adaptive inputs are obtained by designing an attack, a sequence of input vectors, that yields a constructed input that is adversarial to the internal randomness . Tight lower bounds on the robustness of statistical queries were established by [22, 37], who designed an attack with a number of queries that is quadratic in the sample size, which matches the known upper bounds [16]. Their construction was based on fingerprinting codes [10]. A downside of these constructions is that the inputs used in the attack are not "natural" and hence unlikely to shed some understanding on practical vulnerability in the presence of feedback. Hardt and Woodruff [23] provided an impossibility result for the task of estimating the norm of the input within a constant factor from (general) linear sketches. Their construction works with arbitrary correct estimators and produces an adversarial distribution over inputs where the sketch measurements are "far" from their expectations. The attack size, however, has a large polynomial dependence on the sketch size and is far from the respective upper bound. Ben-Eliezer et al [9] present an attack on the AMS sketch [4] for the task of approximating the -norm of the input vector. The attack is tailored to a simplified estimator that is linear in the set of linear measurements (whereas the "classic" estimator uses a median of measurements and is not linear). Their attack is efficient in that the number of queries is of the order of the sketch size, rendering the estimator non-robust. It also has an advantage of using "natural" inputs. More recently, [14] presented attacks that are tailored to specific estimators for CountSketch, including an attack of size on the classic median estimator and an attack of size on their proposed robust estimator.

Contribution

Existing works proposed attacks of size that is far from the corresponding known upper bounds or are tailored to a particular estimator. Specifically for CountSketch, there is an upper bound of but it is not even known whether there exist estimators that support a super-quadratic number of adaptive inputs. This question is of particular importance because CountSketch and its variants are the only known efficient sketching method that allow recovery of -heavy hitters and approximating norms and inner products. Moreover, their form as linear measurements is particularly suitable for efficient implementations and integration as components in larger pipelines. Finally, a recent lower bound precludes hope for an efficient deterministic (and hence fully robust) sketch [27], so it is likely that the vulnerabilities of CountSketch are inherent to -heavy hitter recovery from a small sketch.

We construct a universal attack on CountSketch, that applies against any unknown, potentially non-linear, possibly state maintaining, estimator. We only require that the estimator is correct. Our attack uses queries, matching the robust estimator upper bound [14]

. Moreover, it suffices for the purpose of the attack that the estimator only reports a set of candidate heavy keys without their approximate values (we only require that heavy hitters are reported with very high probability and

value keys are reported with very small probability).

The product of our attack (with high probability) is an adversarial input on which the measurement values of are very biased with respect to their distribution when . Specifically, the design of CountSketch results in linear measurements that are unbiased for any input under the sketch distribution : For each key and measurement vectors with it holds that but the corresponding expected values for our adversarial in are large ( for a desired ). This "bias" means that the known standard (and robust) estimators for heavy hitters and inner products would fail on this adversarial input. And generally estimators that satisfy the usual design goal of being correct on any input with high probability over the distribution of may not be correct on the adversarial inputs. We note however that our result does not preclude the existence of specialized estimators that are correct on our adversarial inputs. We only show that a construction of an input that is adversarial to is possible with any correct estimator.

Finally, our attacks uses "natural" inputs that constitute of a heavy key and random noise. The final adversarial input is a linear combination of the noise components according to the heavy hitter reports and is the only one that depends on prior outputs. The simplicity of this attack suggests "practical" vulnerability of this fundamental sketching technique.

Technique To construct an adversarial input with respect to key we generate "random tails," , which are vectors with small random entries. Ideally, we would like to determine for each whether it is biased up or down with respect to key . That is, considering the set of measurement vectors with , determine the sign of . If we had that, the linear combination (with large enough ) is an adversarial input. The intuition why this works is that the bias accumulates linearly with the number of tails

whereas the standard deviation (considering randomness of the selection of tails) increases proportionally to

. The attack strategy is then to design query vectors so that from whether or not is reported as a heavy hitter candidate we obtain that correlates with . A higher correlation yields more effective attacks: With we get attacks of size and with we get attack of size . We show that we can obtain with (thus matching the upper bound) against arbitrary correct estimators.

Related work: In terms of techniques, our work is most related to [14] in that the structure of our attack vectors is similar to those used in [14] to construct a tailored attack on the classic estimator. The generalization however to a "universal" attack that is effective against arbitrary and unknown estimators was delicate and required multiple new ideas. Our contribution is in a sense complementary to [23] that designed attack on linear sketches that applies with any correct estimator for (approximate) norms. Their attack is much less efficient in that its size is a higher degree polynomial and it uses dependent (adaptive) inputs (whereas with our attack only the final adversarial input depends on prior outputs). The product of their attack are constructed vectors that are in the (approximate) null space of the sketching matrix. These "noise" vectors can have large norms but are "invisible" in the sketch. When such "noise" is added to an input with a signal (say a heavy hitter), the "signal" is suppressed (entry no longer heavy) but can still be recovered from the sketch. Our attack fails the sketch matrix in a complementary way: we construct "noise" vectors that do not involve a signal (a heavy entry) but the sketch mimics a presence of that particular signal.

2 Preliminaries

We use boldface notation for vectors , non boldface for scalars , for inner product, and for scalar product. For a vector we refer to as a key and as the value of the th key (entry). For exposition clarity, we use to mean "within a small relative error."

Definition 2.1.

(heavy hitter) For , and a vector , key is an --heavy hitter if .

Clearly, there could be at most heavy hitters.

Definition 2.2.

(Heavy Hitters Problem) A set of entries is a correct solution for the --heavy hitters problem if (i) includes all the --heavy hitters keys and (ii) A key with value can be included with probability .

Remark: This definition is weaker than what CountSketch provides [11] : CountSketch can recover keys that are heavy with respect to the norm of the tail (input vector with heavy entries nullified) instead of the (larger) norm of the vector, limit the size of the reported set to , and supports reporting of approximate values for the keys in . Since our focus is designing an attack, our design is stronger against a weaker estimator.

2.1 CountSketch

The sketch [11] is specified by parameters , where is the dimension of input vectors, and . The internal randomness specifies a set of random hash functions () with the marginals that , , , and () so that . These hash functions define measurement vectors that are organized as sets of vectors each (, ) where:

The sketch of an input vector is a set of the respective measurement values . Note that for each key there are exactly measurements vectors with a nonzero th entry: and these measurement vectors are independent (as the only dependency is between measurement in the same set of , and there is exactly one from each set that corresponds to a value of ).

For an input , the respective set of adjusted measurements:

(1)

Are unbiased estimates of

: .

The median estimator [11] uses the median adjusted measurement. The keys with highest magnitude estimates are then reported as heavy hitters. For the heavy hitters problem with non-adaptive inputs (inputs selected independently of ), setting the sketch parameters and , and applying the median estimator, yields a correct solution with probability .

CountSketch also supports estimation of inner products. For two vectors , , we obtain an unbiased estimate of their inner product from the respective inner product of the th row of measurements:

(2)

The median of these estimates is within relative error with probability .

We note that pairwise independent hash functions and suffice for obtaining these guarantees [11]. 4-wise independence is needed for approximate inner products. The analysis of the attack we present here, however, holds even under full randomness.

Definition 2.3 (Adversarial input).

We say that an attack that is applied to a sketch with randomness and outputs and (with ) is -adversarial (for ) if, with high probability over the randomness of , the adjusted measurements (1) satisfy:

(3)

3 Attack structure

We describe the three interacting components:

  • A sketch, initialized with internal randomness that specifies linear measurement vectors .

  • An estimator, that outputs a solution to the heavy hitters problem for (see Definition 2.2). The estimator has access to , (and hence the measurement vectors), and to sketches and its outputs on prior queries for .

  • An adversary, that issues input queries and observes the output of the estimator on the query. The goal of the adversary is to construct an adversarial input vector (see Definition 2.3).

Our attack uses inputs of a particular form. The only piece of information needed from the output is whether a particular key is reported as a candidate heavy hitter .

3.1 Query vectors

Our attack constructs vectors of the form:

(4)

where is a special heavy key, that is selected uniformly and remains fixed, is the axis-aligned unit vector along , and the vectors are tails. The (randomized) construction of tails is described in Algorithm LABEL:algo:tails.

The tail vectors have support of size that does not include key () and so that supports of different tails are disjoint:

For query and key , the values are selected i.i.d. .

Note that .

Remark 3.1.

We choose the parameter to be large enough so that and thus with probability there are close to keys from the support of each of the tail that map to each measurement.

algocf[htbp]    

Note that the tails, and (as we shall see) the selection of , and hence the input vectors are constructed non-adaptively. Only the final adversarial input vector depends on the output of the estimator on prior queries.

Estimator: We accordingly restrict the output of the estimator to be that returns the boolean output of whether key is reported as a candidate heavy hitter of . Note that disclosing additional information can only make the estimator more susceptible to attacks.

3.2 Sketch distribution

includes measurements but with our specialized inputs (4) we can restrict the estimator to only use the adjusted measurements (1) of key . We argue that this restriction does not limit the power of the estimator: The additional measurements have and do not depend on . They do provide information on the tail support size parameter but we can assume is known to the estimator. These measurements also provide information on the number of keys in the support that hash to our selected measurement from the each set of measurements. But in our regime of large , the number is very close to and it only impacts the magnitude of the "noise" but not its form.

To simplify our notation going forward, we re-index the set of relevant measurement vectors and use for . We use the notation for the vector of the adjusted measurements the sketch provides for and the notation for the respective contributions of the tail to these values:

(5)
(6)

Recall that and for . With our random tails, even when fixing and and taking the expectation over random choices of tail entry values we still get and . Also observe that the values are i.i.d for different measurements (even when there is overlap in the sets of keys other than in that map to different measurements, the measurement vectors randomness is independent, so the contributions of the same key to different measurements it maps to are independent ).

We now consider the distribution of (and hence ) for some . The respective measurement vector is . We have and for each , with probability . Therefore, the number of keys in the support that contribute to the measurement

is a Binomial random variable

, hence, has expectation

and variance

. The contributions of the keys are i.i.d.  (multiplying by which does not change that). Therefore, the contribution to the measurement, conditioned on , has distribution

(7)

In particular it follows that the random variables are symmetric and .

We can now consider the probability distribution over

. The values for are independent, obtained by first drawing and then drawing according to (7). We therefore obtain:

Lemma 3.2.

The distribution of the random variable conditioned on and is

Proof.

Taking the sum over of the distributions of

where the last equality follows from properties of a sum of independent Binomial random variables. ∎

When is large (see Remark 3.1), the distribution of is approximately (up to discretizing to integral values) . Recall that are i.i.d. The contribution of key to each is exactly .

We express in terms of the empirical mean

(8)

and the "deviations" from the empirical mean

(9)

Note that by definition

. Using the Normal approximation to the Binomial distribution we obtain (approximately)

(10)
(11)

For the purpose of cleaner presentation, we will use this approximation going forward. We use . The analysis carries over by carefully dragging the approximation error term that vanishes for large .111In the regime where the number of standard deviations exceeds we can approximate by and in the regime close to the mean the approximation error is at most .

The probability density function of

, the product of i.i.d. Normal random variables is

(12)

4 Estimators

In this section we provide a framework of estimators and establish properties common to any correct estimator.

The estimator is applied to the content of the sketch, which effectively is the vector . In its most general form, an estimator fixes before each query a reporting function . The estimator then returns T with probability . We allow the estimator to modify arbitrarily the reporting function between queries and in a way that depends on sketches of prior inputs, prior outputs, and on a maintained state from past queries . The only constraint that we impose on the estimator is that (at each round) its output must be correct with high probability .

We can now express, using the reporting function, the probability (over distribution of tails) of the estimator returning T when in round :

(13)

We denote by

(14)

the probability (over distribution of tails) of reporting T when and

Note that since then .

Lemma 4.1 (Correct estimator basic property).

There are and so that the output of any correct -heavy hitter estimator satisfies the following:

  • T (with probability , where is the allowed failure probability)

  • F with probability

  • Otherwise, unrestricted

Proof.

The tail vectors have mass and . From the definition of -heavy hitter, a correct estimator must report key as a heavy hitter when

Recall that the sketch uses , so we get .

We now establish the second part. Here we use the requirement that a correct estimator may report a -value key with probability . When is small, the sketch distribution is closer to that of a sketch with . This allows us to lower bound the reporting probability with by that of key with value .

The probability density of a sketch with is:

The respective probability density with is

The ratio is

Note that the ratio depends only on and and is larger when is closer to than to . Therefore, for the goal of bounding the maximum reporting probability with subject on a bound on the probability for , we place the reporting probability when on the largest values of . The distribution of with mean is . For provides reporting probability for . For the reporting probability is .

We use the fact that the reporting function has increase of almost 1 between to to establish that on average it increases over uniformly random sub-intervals:

Lemma 4.2 (Average increase property).

For (with h.p over the distribution of )

Proof.

From Lemma 4.1, the function is very close to when and very close to when .

The following Lemma shows that the probability of reporting as a heavy hitter, is equally likely when and and when and .

Lemma 4.3.

For any , reporting function , and , .

Proof.

From (14):

The probability is the mean over with of . Similarly, is the mean over with of . It therefore suffices to establish that for when .

This is a property of a product of normal distributions: for any

and , the outcome is equally likely with and :

Let and , using (12) and we obtain:

The calculation for is similar and yields the same expression:

From Lemma 4.2 and Lemma 4.3 we obtain:

Corollary 4.4.

For (with h.p over the distribution of ) and for any correct reporting function (and respective )

This means that if we choose the value uniformly at random from then any correct estimator would exhibit a gap in the probability of reporting as a heavy hitter, depending on the sign of . The gap increases with . Our universal attack exploits this gap.

5 Universal Attack

algocf[htbp]    

Our attack is described in Algorithm LABEL:algo:attack. We generate attack tails using Algorithm LABEL:algo:tails. We then construct queries of the form (4) with i.i.d. . For each we collect the output of the HH estimator. We then set when and when . The last step is the construction of the final adversarial tail:

(15)

The adversarial tail has and norm (it has support of size with values in the support i.i.d from ). Therefore, when sketched with random initializations we have:

and since the measurements are independent we get:

The adversarial input behaves differently with respect to the randomness it was constructed for by the attack :

Lemma 5.1 (Properties of the adversarial tail).
Proof.

The contributions of the adversarial tail to the adjusted estimates in the sketch are:

Therefore

(16)