1 Introduction and Notations
Data sanitization is one of the most wellstudied topics in the literature of differential privacy. Informally, the goal is to design differentially private algorithms that take a dataset and produce an alternative dataset which captures some statistical properties of . Most of the research on data sanitization focuses on the “offline” setting, where the algorithm has full access to the input dataset . We continue the study of this question from a streaming perspective. Specifically, we aim to design lowmemory algorithms that operate on a large stream of data points, and produce a privacypreserving variant of the original stream while capturing some of its statistical properties. We present a simple reduction from the streaming setting to the offline setting. As a special case, we improve on the recent work of Alabi et al. [ABC21], who studied a special case of this question.
Let us begin by recalling basic notations from the literature on data sanitization. Let be a data domain and let be a class of predicates, where each maps to . Given a dataset , a sanitization mechanism for is required to produce a synthetic dataset such that for every we have
Formally,
Definition 1.1.
Let be a class of predicates mapping to . Let be an algorithm that takes an input dataset and outputs a dataset . Algorithm is an sanitizer for , if
Remark 1.2.
It is often convenient to allow the the size of to be different than (the size of the original dataset). For simplicity, here we assume that produces a dataset of the same size as .
We consider a variant of the above definition where the input of algorithm is a stream, and its output is also a stream. The utility requirement is that at the end of the stream, the produced stream is similar to the the original stream (in the same sense as above). Formally,
Definition 1.3.
Let be a class of predicates mapping to . Let be an algorithm that operates on a stream outputs a stream . Algorithm is an streamingsanitizer for , if

is differentially private;

For every input stream of length and for every predicate we have
. The probability is over the coin tosses of .
Remark 1.4.
For simplicity, in the above definition we required utility to hold only at the end of the stream. Our results remain essentially unchanged also with a variant of the definition where utility must hold at any moment throughout the execution. We can also allow the size of the output stream to be different than the size of the input stream.
2 A Generic Reduction to the Offline Setting
We observe that every (offline) sanitizer can be transformed into a streamingsanitizer as follows.
Theorem 2.1.
Let be an sanitizer for a class of predicates , with space complexity . Let for some positive integer . Then, there exists an streamingsanitizer for using space (we assume that ’s space is at least linear).
Remark 2.2.
The point here is that, even though the stream is large (of length ), the space complexity of the streamingsanitizer essentially depends only on the space complexity of the (offline) sanitizer when applied to “small” datasets of size . (As Theorem 2.1 is stated, the length of the stream affects the confidence parameter of the resulting streamingsanitizer; however, in Remark 2.3 we explain how this can be avoided.)
Proof of Theorem 2.1.
We construct a streaming sanitizer as follows:

Let denote the next items in the stream.

Output and goto Step 1.
First observe applies to disjoint portions of its input stream, and hence, algorithm is differentially private. Next, by a union bound, with probability at least , we have that all of the applications of algorithm succeed in producing a dataset that maintain averages of predicates in up to an error of . In such a case, for the entire input stream and output stream , and for every , it holds that
and hence
The other direction is symmetric. ∎
Remark 2.3.
Two remarks are in order. First, assuming that is big enough w.r.t. , we can avoid blowing up the confidence parameter by . The reason is that, by the Chernoff bound, w.h.p. we get that at least fraction of the executions of succeed, in which case the overall error would be at most . Second, again assuming that is big enough, we could relax the privacy guarantees of while keeping ’s privacy guarantees unchanged. Specifically, a well known fact is that we can boost the privacy guarantees of a differentially private algorithm by applying it to a subsample of its input dataset. Hence, assuming that is differentially private, we can execute the above algorithm on a random subsample from the original input stream (by selecting each element of independently with probability ). Assuming that , then the subsampled stream is big enough such that the additional error introduced by this subsampling is at most .
3 Application to Bounded Space Differentially Private Quantiles
Recently, Alabi et al. [ABC21]
introduced the problem of differentially private quantile estimation with sublinear space complexity. Specifically, let
be an approximation parameter, and consider an input stream containing points from a domain . The goal is to design a smallspace differentiallyprivate algorithm that, at the end of the stream, is capable of approximating all quantiles in the data up to error and confidence . Specifically, [ABC21] designed a private variant of the streaming algorithm of Greenwald and Khanna [GK01]. They obtained an differentially private algorithm with space complexity^{1}^{1}1We use to hide factors. , which is great because it matches the nonprivate space dependency of . However, the following questions were left open (and stated explicitly as open questions by Alabi et al. [ABC21]).Question 3.1.
The algorithm of [ABC21] was tailored to the nonprivate algorithm of Greenwald and Khanna [GK01], which is known to be suboptimal in the nonprivate setting. Can we devise a more general approach that would allow us to instantiate (and benefit from) the stateoftheart nonprivate algorithms? In particular, can we avoid the dependency of the space complexity in ?
Question 3.2.
The space complexity of the algorithm of [ABC21] grows with . Can this be avoided?
We observe that using our notion of streamingsanitizers immediately resolves these two questions. To see this, let be a totallyordered data domain. For a point , let be a threshold function defined by iff . Let denote the class of all threshold functions over . This class captures all quantiles of the original stream. Now, to design a differentially private quantile estimation algorithm (with sublinear space), all we need to do is to apply our generic construction for a streamingsanitizer for (instantiated with the stateoftheart offline sanitizer for this class), and to run any nonprivate streaming algorithm for quantiles estimation on the outcome of the streamingsanitizer. Using the stateoftheart offline sanitizer for from [KLM20] and the stateoftheart nonprivate streaming algorithm of [KLL16], we get an differentiallyprivate quantilesestimation algorithm using space complexity .
Remark 3.3.
The above idea is general, and is not restricted to the algorithms of [KLM20] and [KLL16]. In particular, we could have used an differentially private sanitizer, at the expense of having the space complexity grow with instead of . See, e.g., [BNS16, BNSV15, KSS21] for additional constructions of (offline) sanitizers for .
Remark 3.4.
As in Remark 2.3, assuming that the stream length is big enough, the space complexity can be made independent of .
Alabi et al. [ABC21] also asked if it is possible to obtain a differentially private streaming algorithm for quantiles estimation that allows for continually monitoring how the quantiles evolve throughout the stream (rather then only at the end of the stream). Specifically, the algorithm of [ABC21] works in the “oneshot” setting where the quantiles are computed once after observing the entire stream. However, many relevant applications require realtime release of statistics. Using our notion of a streamingsanitizer, this comes essentially for free (see Remark 1.4), because once the stream is private, any postprocessing of it satisfies privacy.
References
 [ABC21] Daniel Alabi, Omri BenEliezer, and Anamay Chaturvedi. Bounded space differentially private quantiles. In TPDP, 2021.
 [BNS16] Amos Beimel, Kobbi Nissim, and Uri Stemmer. Private learning and sanitization: Pure vs. approximate differential privacy. Theory Comput., 12(1):1–61, 2016.
 [BNSV15] Mark Bun, Kobbi Nissim, Uri Stemmer, and Salil P. Vadhan. Differentially private release and learning of threshold functions. In FOCS, pages 634–649. IEEE Computer Society, 2015.
 [DMNS16] Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam D. Smith. Calibrating noise to sensitivity in private data analysis. J. Priv. Confidentiality, 7(3):17–51, 2016.
 [DR14] Cynthia Dwork and Aaron Roth. The algorithmic foundations of differential privacy. Found. Trends Theor. Comput. Sci., 9(34):211–407, 2014.
 [GK01] Michael Greenwald and Sanjeev Khanna. Spaceefficient online computation of quantile summaries. In SIGMOD Conference, pages 58–66. ACM, 2001.
 [KLL16] Zohar S. Karnin, Kevin J. Lang, and Edo Liberty. Optimal quantile approximation in streams. In FOCS, pages 71–78. IEEE Computer Society, 2016.

[KLM20]
Haim Kaplan, Katrina Ligett, Yishay Mansour, Moni Naor, and Uri Stemmer.
Privately learning thresholds: Closing the exponential gap.
In COLT, volume 125 of
Proceedings of Machine Learning Research
, pages 2263–2285. PMLR, 2020.  [KSS21] Haim Kaplan, Shachar Schnapp, and Uri Stemmer. Differentially private approximate quantiles. CoRR, abs/2110.05429, 2021.
 [Vad17] Salil P. Vadhan. The complexity of differential privacy. In Tutorials on the Foundations of Cryptography, pages 347–450. Springer International Publishing, 2017.