I Introduction
Ia Background
In recent years, differential privacy (DP) [1, 2] has been increasingly accepted as the current standard for data privacy [3, 4, 5, 6]. With the centralized model of DP, a trusted curator has access to compute on the entire raw data of users (e.g., the Census Bureau [7, 8]). By ‘trusted’, we mean that curators do not misuse or leak private information from individuals. However, this assumption does not always hold in real life, e.g., data breaches are all too common [9].
To preserve privacy at the userside, an alternative approach, namely, local differential privacy (LDP), was initially formalized in [10]. With LDP, rather than trusting in a data curator to have the raw data and sanitize it to output queries, each user applies a DP mechanism to their data before transmitting it to the data collector server. The local DP model allows collecting data in unprecedented ways and, therefore, it has led to several adoptions by industry (e.g., Google Chrome browser [11], Microsoft windows 10 operation system [12], Apple iOS and macOS [13]).
IB Motivation and problem statement
When collecting data in practice, one is often interested in multiple attributes of a population, i.e., multidimensional data. For instance, in crowdsourcing applications, the server may collect both demographic information (e.g., gender, nationality) and user habits in order to develop personalized solutions for specific groups. In addition, one generally aims to collect data from the same users throughout time (i.e., longitudinal studies), which is essential in many situations [11, 12]. For example, the fact that two medical acts identified at different times have been performed on the same patient or two different patients means treatment in the first case or two isolated acts in the second.
So, in this paper, we focus on the problem of private frequency (or histogram) estimation of multiple attributes throughout time with LDP. Frequency estimation is a primary objective of LDP, in which the data collector (a.k.a. the aggregator) decodes all the privatized data of the users and can then estimate the number of users for each possible value. More formally, we assume there are attributes , where each attribute with a discrete domain has a specific number of values . Each user for has a tuple , where represents the value of attribute in record . Thus, for each attribute at time , the aggregator’s goal is to estimate a bins histogram, including the frequency of all values in .
Indeed, in both longitudinal and multidimensional settings, one needs to consider the allocation of the privacy budget, which can grow extremely quickly due to the composition theorem [3]. However, on the one hand, most frequency estimation academic literature [14, 15, 16, 17, 18, 19, 20] focuses on a single data collection (i.e., nonlongitudinal studies). On the other hand, the studies for collecting multidimensional data with LDP mainly focused on other complex tasks (e.g., analytical/range queries [21, 22, 23, 24], estimating marginals [25, 26, 27, 28, 29]) and numerical data only (e.g., [30, 31, 32, 33]).
IC Summary of contributions
In this paper, we extend the analysis of three stateoftheart LDP protocols, namely, generalized randomized response (GRR) [16], optimized unary encoding (OUE) [14], and symmetric unary encoding (SUE) [11] for both longitudinal and multidimensional frequency estimates. On the one hand, for all three protocols, we theoretically prove that randomly sampling a single attribute per user improves data utility, which is an extension of common results in the LDP literature [34, 22, 35, 27, 36].
On the other hand, in the literature, both SUE and OUE protocols have been extended (and also applied [37, 38]) to longitudinal studies based on the concept of memoization [11, 12], i.e., LSUE and LOUE, respectively. However, we numerically and experimentally show that combining both protocols provides higher data utility, i.e., starting with OUE and then with SUE (LOSUE) optimizes data utility rather than using SUE or OUE twice. In addition, we also extended GRR for longitudinal studies (i.e., LGRR), which provides higher data utility than the other protocols based on unary encoding for attributes with small domain size.
Lastly, in a multidimensional setting with different domain sizes for each attribute, a dynamic selection of longitudinal LDP protocols is preferred. Therefore, we also proposed a new solution named Adaptive LDP for LOngitudinal and Multidimensional FREquency Estimates (ALLOMFREE), which combines all the aforementioned results. More specifically, ALLOMFREE randomly samples a single attribute to send with the whole privacy budget and adaptively selects the optimal protocol, i.e., either LGRR or LOSUE. To validate our proposal, we conducted a comprehensive and extensive set of experiments on four realworld open datasets. Under the same privacy guarantee, results show that ALLOMFREE consistently and considerably outperforms the stateoftheart LSUE and LOUE protocols in the quality of the frequency estimates.
Paper’s Outline. The remainder of this paper is organized as follows. In Section II, we review the privacy notion that we are considering, i.e., LDP and the protocols we further analyze in this paper. In Section III, we extend the analysis of GRR, OUE, and SUE to multidimensional data collections. In Section IV we present the memoizationbased framework for longitudinal data collections, the extension and analysis of longitudinal GRR and longitudinal UEbased protocols; the numerical evaluation of their performance, and we present our ALLOMFREE solution. In Section V, we present experimental results, discuss our results and limitations, and review related work. Lastly, in Section VI, we present the concluding remarks and future directions.
Ii Theoretical background
In this section, we briefly present the concept of privacy considered in this work, that is, LDP (Subsection IIA), and the LDP protocols we will apply in this paper and their analysis (Subsection IIB).
Iia Local differential privacy (LDP)
Local differential privacy, initially formalized in [10], protects an individual’s privacy during the data collection process. A formal definition of LDP is given in the following:
Definition 1 (Local Differential Privacy).
A randomized algorithm satisfies LDP if, for any pair of input values and any possible output of :
Similar to the centralized model of DP, LDP also enjoys several important properties, e.g., immunity to postprocessing ( is LDP for any function ) and composability [3]. That is, combining the results from locally differentially private protocols also satisfies LDP. If these protocols are applied separately in disjointed subsets of the dataset, , …, LDP (parallel composition). On the other hand, if these protocols are sequentially applied to the same dataset, LDP (sequential composition).
IiB LDP protocols
Randomized response (RR), a surveying technique proposed by Warner [39], has been the building block for many LDP protocols. Let be a set of values of a given attribute and let be the privacy budget, we review three stateoftheart LDP mechanisms for singlefrequency estimation (a.k.a. frequency oracles) that will be used in this paper.
IiB1 Generalized randomized response (GRR)
The kAry RR [16] mechanism extends RR to the case of and is also referred to as direct encoding [14] or generalized RR (GRR) [40, 41, 27]. Throughout this paper, we use the term GRR for this LDP protocol. Given a value , GRR()
outputs the true value with probability
, and any other value such that with probability . More formally, the perturbation function is defined as:This satisfies LDP since . To estimate the frequency that a value occurs for , one calculates [14]:
(1) 
in which is the number of times the value has been reported and is the total number of users. In [14], it is shown that
is an unbiased estimation of the true frequency
, and the variance of this estimation is
. In the case of small , this variance is dominated by the first term, which gives the approximate variance as [14]:(2) 
Replacing and into Eq. (2), the GRR variance is calculated as:
(3) 
IiB2 Unary encodingbased
Protocols based on unary encoding (UE) consist of transforming a value into a binary representation of it. So, first, for a given value , , where , a bit array where only the th position is set to one. Next, the bits from are flipped, depending on parameters and
, to generate a sanitized vector
, in which:The proof that UEbased protocols satisfy LDP for
(4) 
is known in the literature and can be found in [11, 14]. In [14] the authors presents two ways for selecting probabilities and , which determines the protocol variance. One wellknown UEbased protocol is the Basic Onetime RAPPOR [11], referred to as symmetric UE (SUE), which selects and , where (symmetric). The estimated frequency that a value occurs for is also calculated using Eq. (1). Replacing and into Eq. (2), the SUE variance is calculated as [11]:
(5) 
Moreover, rather than selecting and to be symmetric, Wang et al. [14] proposed optimized UE (OUE), which selects parameters and that minimize the variance of UEbased protocols while still satisfying LDP. Similarly, the estimation method used in Eq. (1) equally applies to OUE. Replacing and into Eq. (2), the OUE variance is calculated as [14]:
(6) 
Iii Multidimensional Frequency Estimates with LDP
In the literature, there are few works for collecting multidimensional data with LDP based on random sampling (i.e., dividing users in groups) [30, 31, 32, 33, 14, 36]. This technique reduces both dimensionality and communication costs, which will also be the focus of this paper. Let be the total number of attributes, be the domain size of each attribute, be the number of users, and be the privacy budget. An intuitive solution (Spl) is splitting the privacy budget, i.e., assigning for each attribute. The other solution (Smp) is based on uniformly sampling (without replacement) only attribute(s) out of possible ones, i.e., assigning per attribute. Notice that both solutions satisfy LDP according to the sequential composition theorem [3].
For the first case, Spl, the variances () of GRR, SUE, and OUE are, respectively:
(7) 
For the second case, Smp, the number of users per attribute is reduced to . Thus, the variances () of GRR, SUE, and OUE are, respectively:
(8) 
Notice that if in Eq. (8), one achieves Eq. (7). Practically, the objective is reduced to finding , which minimizes for each protocol. This way, to find the optimal for each protocol, we first multiply each in Eq. (8) by . Without loss of generality, minimizing , , and is equivalent to minimizing , , and , respectively. Hence, let be the independent variable, and can be rewritten as , and can be rewritten as as functions over . It is not hard to prove that both and are increasing functions w.r.t. and, hence, we have a minimum and optimal when (a single attribute per user) for all three protocols. We highlight that this is a common result in the LDP literature obtained for different protocols and contexts [30, 31, 33, 14, 22, 35, 34, 42].
Therefore, in this paper, we adopt the multidimensional setting Smp with . In this setting, users tell the data collector which attribute was sampled, and its perturbed value ensuring LDP by applying either GRR or UEbased protocols; the data analyst server would not receive any information about the remaining attributes.
Iv Longitudinal Frequency Estimates with LDP
In this section, we present the memoizationbased framework for longitudinal data collections (Subsection IVA). Next, we present the analysis of longitudinal GRR (Subsection IVB) and longitudinal UEbased protocols (Subsection IVC). Lastly, we evaluate numerically the extended longitudinal protocols (Subsection IVD) and we propose our ALLOMFREE solution (Subsection IVE).
Iva Memoizationbased data collection with LDP
In the literature, many works study how to collect and analyze categorical data longitudinally based on memoization [11, 12, 34]. The key idea behind memoization is using two sanitization processes. The first round () replaces the real value with a sanitized one with a higher epsilon (). Whenever one intends to report , shall be reused to produce other sanitized versions with lower epsilon values. Notice that the second sanitization () is a must to avoid ‘averaging attacks’, in which adversaries can reconstruct the true value from multiple sanitized versions of it. This technique allows achieving privacy over time with an upper bound value of LDP.
Let be a set of values of a given attribute and let be the privacy budget. In this paper, for both and steps, we will apply either GRR, SUE, or OUE. The unbiased estimator in Eq. (1) for the frequency of each value for is now extended to:
(9) 
in which is the number of times the value has been reported, is the total number of users, and are the parameters used by an LDP protocol for , and and are the parameters used by an LDP protocol for .
Theorem 1.
The estimation result in Eq. (9) is an unbiased estimation of for any value .
Proof 1
Let us focus on
Thus,
Proof 2
Thanks to Eq. (9) we have
Since is the number of times the value is observed, it can be defined as where is equal to 1 if the user , reports value , and 0 otherwise. We thus have . Since all the users are independent,
We thus have and, finally,
In this work, we will use the approximate variance, in which in Eq. (10), which gives:
(11)  
IvB Longitudinal GRR (LGRR): definition and LDP study
Let be a set of values of a given attribute and let be the real value. We now describe an extension of GRR for longitudinal studies; we refer to this protocol as LGRR for the rest of this paper. First, (direct encoding). Next, there are two rounds of sanitization, and applying GRR, described in the following.

: Memoize a value such that
in which and control the level of longitudinal LDP. The value shall be reused as the basis for all future reports on the real value .

: Generate a reporting such that
in which is the report to be sent to the server.
Visually, Fig. 1 illustrates the probability tree of the LGRR protocol. In the first round of sanitization, , our proposed LGRR applies GRR with and (underlined in the middle of Fig. 1), where . As discussed in Subsection IIB1, this permanent memoization satisfies LDP since , which is the upper bound.
On the other hand, with a single collection of data, the attacker’s knowledge of comes only from , which is generated using two randomization steps with GRR. This provides a higher level of privacy protection [11]. From Fig. 1, we can obtain the following conditional probabilities:
Let and (underlined in far right of Fig. 1), with the second round of sanitization, , our proposed LGRR protocol satisfies LDP since . Notice that corresponds to a single report (lower bound) and its extension to infinity reports is limited by (upper bound) since uses as input the output of . More specifically, the calculus of for LGRR is:
(12) 
in which , , and both and are selectable according with , , and , calculated as:
(13) 
IvC Longitudinal UE (LUE): definition and LDP study
We now describe UEbased protocols for longitudinal studies; we refer to this protocol as LUE for the rest of this paper. Let be a set of values of a given attribute and let be the real value. First, (unary encoding), where , a bit array where only the th position is set to one. Next, there are two rounds of sanitization, and applying UEbased protocols, described in the following.

: For each bit , in , memoize a value such that
in which and control the level of longitudinal LDP. The value shall be reused as the basis for all future reports on the real value .

: For each bit , in , generate a reporting that
in which is the report to be sent to the server.
Visually, Fig. 2 illustrates the probability tree of the LUE protocol. One natural question emerges: how to select the parameters in order to optimize the utility of this LUE protocol? One can see as a permanent sanitization and as a ‘small’ perturbation to avoid averaging attacks and keep privacy over time.
Based on SUE and OUE, we are then left with four options: two popular solutions that strictly use only OUE or SUE parameters in both sanitization steps and two proposed settings that combine both OUE and SUE. These four LUE protocols are summarized below:

both sanitizations with OUE (LOUE);

both sanitizations with SUE (LSUE);

starting with OUE and then with SUE (LOSUE);

starting with SUE and then with OUE (LSOUE);
in which, LSUE is the wellknown BasicRAPPOR protocol [11], LOUE is the stateoftheart OUE protocol [14] with memoization, and both LOSUE and LSOUE are proposed in this paper.
As presented in [14], the OUE variance in Eq. (6) is smaller than the SUE variance in Eq. (5) and, therefore, the former can provide higher utility than the latter for . On the other hand, we argue that OUE might be too strict for since the parameter is constant. Thus, we hypothesize that option III (i.e., LOSUE) is the most suitable one. Without loss of generality, the following analyses are done only for LOSUE, which can be easily extended to any of the other combinations.
In the first round of sanitization, , our solution LOSUE applies OUE with and (underlined in the middle of Fig. 2). As discussed in Section IIB2, this permanent memoization satisfies LDP since , which is the upper bound.
Following the same development as for LGRR, on the other hand, with a single collection of data, the attacker’s knowledge of comes only from , which is generated using two randomization steps with OUE and SUE, respectively. This provides a higher level of privacy protection [11]. From Fig. 2, we can obtain the following conditional probabilities according to each bit :
Let and (underlined in far right of Fig. 2), with the second round of sanitization, , our proposed LOSUE protocol satisfies LDP since . Notice that corresponds to a single report (lower bound) and its extension to infinity reports is limited by (upper bound) since uses as input the output of . More specifically, the calculus of for LOSUE (or LUE protocols in general) is:
(14) 
in which, for LOSUE, we have , , and both and are symmetric () and selectable according to and , calculated as:
(15) 
IvD Numerical evaluation of LGRR and LUE protocols
In this subsection, we evaluate numerically the approximate variance of all developed longitudinal protocols, namely, LGRR and the four UEbased options namely LOUE, LSUE, LOSUE, and LSOUE, respectively. As aforementioned, once defined both and privacy guarantees, one can obtain the parameters and depending on , and the parameters and depending on both and (and the domain size for LGRR) as given in Eq. (13) for LGRR and in Eq. (15) for LOSUE.
Next, once computed the parameters , one can calculate the approximate variance with Eq. (11) for each protocol. In other words, following our proposal, one has to set both the upper () and lower () bounds of the privacy guarantees. For example, let , one might want that the first LDP report to have high privacy such as , i.e., (we will use this probability notation to set up the privacy guarantees).
Table I exhibits numerical values of the approximate variance using Eq. (11) for all longitudinal protocols with , (as in [14]), and . For values of higher than , neither LOUE nor LSOUE could satisfy some values of because of the constant in . Yet, it is not desirable to have higher values of and, thus, we did not consider values above in our analysis. Besides, Table II exhibits numerical values for nonlongitudinal GRR, OUE, and SUE protocols, which allows evaluating how utility degrades with a second step of sanitization.
Privacy Guarantees  LGRR  LUE  
LOSUE  LSUE  LSOUE  LOUE  
0.001103  0.980969  26706  0.004411  0.004436  0.005306  0.005549  
0.000270  0.125036  3153  0.001078  0.001103  0.001234  0.001347  
0.000062  0.006327  117  0.000247  0.000270  0.000264  0.000310  
0.000011  0.000078  0.25903  0.000044  0.000062  0.000045  0.000057  
0.001592  2.088372  60218  0.006367  0.006392  0.007336  0.007611  
0.000392  0.268074  7198  0.001567  0.001592  0.001740  0.001872  
0.000092  0.013926  281  0.000368  0.000392  0.000389  0.000447  
0.000018  0.000188  0.74088  0.000072  0.000092  0.000073  0.000092  
0.002492  4.530779  135874  0.009967  0.009992  0.011012  0.011324  
0.000617  0.586823  16443  0.002467  0.002492  0.002658  0.002812  
0.000148  0.031552  673  0.000593  0.000617  0.000617  0.000690  
0.000032  0.000484  2.12772  0.000127  0.000148  0.000128  0.000156  
0.004436  10  329836  0.017744  0.017769  0.018863  0.019214  
0.001103  1.398568  40412  0.004411  0.004436  0.004620  0.004799  
0.000270  0.078202  1737  0.001078  0.001103  0.001106  0.001198  
0.000062  0.001389  6  0.000247  0.000270  0.000248  0.000291  
0.009992  30  972656  0.039967  0.039992  0.041148  0.041536  
0.002492  4.080052  120651 
Comments
There are no comments yet.