Log In Sign Up

A Note on Sanitizing Streams with Differential Privacy

by   Haim Kaplan, et al.

The literature on data sanitization aims to design algorithms that take an input dataset and produce a privacy-preserving version of it, that captures some of its statistical properties. In this note we study this question from a streaming perspective and our goal is to sanitize a data stream. Specifically, we consider low-memory algorithms that operate on a data stream and produce an alternative privacy-preserving stream that captures some statistical properties of the original input stream.


page 1

page 2

page 3

page 4


Adversarially Robust Streaming Algorithms via Differential Privacy

A streaming algorithm is said to be adversarially robust if its accuracy...

A Framework for Adversarial Streaming via Differential Privacy and Difference Estimators

Streaming algorithms are algorithms for processing large data streams, u...

Efficient Data Perturbation for Privacy Preserving and Accurate Data Stream Mining

The widespread use of the Internet of Things (IoT) has raised many conce...

Improved Pan-Private Stream Density Estimation

Differential privacy is a rigorous definition for privacy that guarantee...

Median regression with differential privacy

Median regression analysis has robustness properties which make it attra...

Obfuscation for Privacy-preserving Syntactic Parsing

The goal of homomorphic encryption is to encrypt data such that another ...

Statistical anonymity: Quantifying reidentification risks without reidentifying users

Data anonymization is an approach to privacy-preserving data release aim...

1 Introduction and Notations

Data sanitization is one of the most well-studied topics in the literature of differential privacy. Informally, the goal is to design differentially private algorithms that take a dataset and produce an alternative dataset which captures some statistical properties of . Most of the research on data sanitization focuses on the “offline” setting, where the algorithm has full access to the input dataset . We continue the study of this question from a streaming perspective. Specifically, we aim to design low-memory algorithms that operate on a large stream of data points, and produce a privacy-preserving variant of the original stream while capturing some of its statistical properties. We present a simple reduction from the streaming setting to the offline setting. As a special case, we improve on the recent work of Alabi et al. [ABC21], who studied a special case of this question.

Let us begin by recalling basic notations from the literature on data sanitization. Let be a data domain and let be a class of predicates, where each maps to . Given a dataset , a sanitization mechanism for is required to produce a synthetic dataset such that for every we have


Definition 1.1.

Let be a class of predicates mapping to . Let be an algorithm that takes an input dataset and outputs a dataset . Algorithm is an -sanitizer for , if

  1. is -differentially private (see [DMNS16, DR14, Vad17] for background on differential privacy).

  2. For every input and for every predicate we have

    . The probability is over the coin tosses of


Remark 1.2.

It is often convenient to allow the the size of to be different than (the size of the original dataset). For simplicity, here we assume that produces a dataset of the same size as .

We consider a variant of the above definition where the input of algorithm is a stream, and its output is also a stream. The utility requirement is that at the end of the stream, the produced stream is similar to the the original stream (in the same sense as above). Formally,

Definition 1.3.

Let be a class of predicates mapping to . Let be an algorithm that operates on a stream outputs a stream . Algorithm is an -streaming-sanitizer for , if

  1. is -differentially private;

  2. For every input stream of length and for every predicate we have
    . The probability is over the coin tosses of .

Remark 1.4.

For simplicity, in the above definition we required utility to hold only at the end of the stream. Our results remain essentially unchanged also with a variant of the definition where utility must hold at any moment throughout the execution. We can also allow the size of the output stream to be different than the size of the input stream.

2 A Generic Reduction to the Offline Setting

We observe that every (offline) sanitizer can be transformed into a streaming-sanitizer as follows.

Theorem 2.1.

Let be an -sanitizer for a class of predicates , with space complexity . Let for some positive integer . Then, there exists an -streaming-sanitizer for using space (we assume that ’s space is at least linear).

Remark 2.2.

The point here is that, even though the stream is large (of length ), the space complexity of the streaming-sanitizer essentially depends only on the space complexity of the (offline) sanitizer when applied to “small” datasets of size . (As Theorem 2.1 is stated, the length of the stream affects the confidence parameter of the resulting streaming-sanitizer; however, in Remark 2.3 we explain how this can be avoided.)

Proof of Theorem 2.1.

We construct a streaming sanitizer as follows:

  1. Let denote the next items in the stream.

  2. Output and goto Step 1.

First observe applies to disjoint portions of its input stream, and hence, algorithm is -differentially private. Next, by a union bound, with probability at least , we have that all of the applications of algorithm succeed in producing a dataset that maintain averages of predicates in up to an error of . In such a case, for the entire input stream and output stream , and for every , it holds that

and hence

The other direction is symmetric. ∎

Remark 2.3.

Two remarks are in order. First, assuming that is big enough w.r.t. , we can avoid blowing up the confidence parameter by . The reason is that, by the Chernoff bound, w.h.p. we get that at least fraction of the executions of succeed, in which case the overall error would be at most . Second, again assuming that is big enough, we could relax the privacy guarantees of while keeping ’s privacy guarantees unchanged. Specifically, a well known fact is that we can boost the privacy guarantees of a differentially private algorithm by applying it to a subsample of its input dataset. Hence, assuming that is -differentially private, we can execute the above algorithm on a random subsample from the original input stream (by selecting each element of independently with probability ). Assuming that , then the subsampled stream is big enough such that the additional error introduced by this subsampling is at most .

3 Application to Bounded Space Differentially Private Quantiles

Recently, Alabi et al. [ABC21]

introduced the problem of differentially private quantile estimation with sublinear space complexity. Specifically, let

be an approximation parameter, and consider an input stream containing points from a domain . The goal is to design a small-space differentially-private algorithm that, at the end of the stream, is capable of approximating all quantiles in the data up to error and confidence . Specifically, [ABC21] designed a private variant of the streaming algorithm of Greenwald and Khanna [GK01]. They obtained an -differentially private algorithm with space complexity111We use to hide factors. , which is great because it matches the non-private space dependency of . However, the following questions were left open (and stated explicitly as open questions by Alabi et al. [ABC21]).

Question 3.1.

The algorithm of [ABC21] was tailored to the non-private algorithm of Greenwald and Khanna [GK01], which is known to be sub-optimal in the non-private setting. Can we devise a more general approach that would allow us to instantiate (and benefit from) the state-of-the-art non-private algorithms? In particular, can we avoid the dependency of the space complexity in ?

Question 3.2.

The space complexity of the algorithm of [ABC21] grows with . Can this be avoided?

We observe that using our notion of streaming-sanitizers immediately resolves these two questions. To see this, let be a totally-ordered data domain. For a point , let be a threshold function defined by iff . Let denote the class of all threshold functions over . This class captures all quantiles of the original stream. Now, to design a differentially private quantile estimation algorithm (with sublinear space), all we need to do is to apply our generic construction for a streaming-sanitizer for (instantiated with the state-of-the-art offline sanitizer for this class), and to run any non-private streaming algorithm for quantiles estimation on the outcome of the streaming-sanitizer. Using the state-of-the-art offline sanitizer for from [KLM20] and the state-of-the-art non-private streaming algorithm of [KLL16], we get an -differentially-private quantiles-estimation algorithm using space complexity .

Remark 3.3.

The above idea is general, and is not restricted to the algorithms of [KLM20] and [KLL16]. In particular, we could have used an -differentially private sanitizer, at the expense of having the space complexity grow with instead of . See, e.g., [BNS16, BNSV15, KSS21] for additional constructions of (offline) sanitizers for .

Remark 3.4.

As in Remark 2.3, assuming that the stream length is big enough, the space complexity can be made independent of .

Alabi et al. [ABC21] also asked if it is possible to obtain a differentially private streaming algorithm for quantiles estimation that allows for continually monitoring how the quantiles evolve throughout the stream (rather then only at the end of the stream). Specifically, the algorithm of [ABC21] works in the “one-shot” setting where the quantiles are computed once after observing the entire stream. However, many relevant applications require real-time release of statistics. Using our notion of a streaming-sanitizer, this comes essentially for free (see Remark 1.4), because once the stream is private, any post-processing of it satisfies privacy.


  • [ABC21] Daniel Alabi, Omri Ben-Eliezer, and Anamay Chaturvedi. Bounded space differentially private quantiles. In TPDP, 2021.
  • [BNS16] Amos Beimel, Kobbi Nissim, and Uri Stemmer. Private learning and sanitization: Pure vs. approximate differential privacy. Theory Comput., 12(1):1–61, 2016.
  • [BNSV15] Mark Bun, Kobbi Nissim, Uri Stemmer, and Salil P. Vadhan. Differentially private release and learning of threshold functions. In FOCS, pages 634–649. IEEE Computer Society, 2015.
  • [DMNS16] Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam D. Smith. Calibrating noise to sensitivity in private data analysis. J. Priv. Confidentiality, 7(3):17–51, 2016.
  • [DR14] Cynthia Dwork and Aaron Roth. The algorithmic foundations of differential privacy. Found. Trends Theor. Comput. Sci., 9(3-4):211–407, 2014.
  • [GK01] Michael Greenwald and Sanjeev Khanna. Space-efficient online computation of quantile summaries. In SIGMOD Conference, pages 58–66. ACM, 2001.
  • [KLL16] Zohar S. Karnin, Kevin J. Lang, and Edo Liberty. Optimal quantile approximation in streams. In FOCS, pages 71–78. IEEE Computer Society, 2016.
  • [KLM20] Haim Kaplan, Katrina Ligett, Yishay Mansour, Moni Naor, and Uri Stemmer. Privately learning thresholds: Closing the exponential gap. In COLT, volume 125 of

    Proceedings of Machine Learning Research

    , pages 2263–2285. PMLR, 2020.
  • [KSS21] Haim Kaplan, Shachar Schnapp, and Uri Stemmer. Differentially private approximate quantiles. CoRR, abs/2110.05429, 2021.
  • [Vad17] Salil P. Vadhan. The complexity of differential privacy. In Tutorials on the Foundations of Cryptography, pages 347–450. Springer International Publishing, 2017.