In recent years, differential privacy (DP) [1, 2] has been increasingly accepted as the current standard for data privacy [3, 4, 5, 6]. With the centralized model of DP, a trusted curator has access to compute on the entire raw data of users (e.g., the Census Bureau [7, 8]). By ‘trusted’, we mean that curators do not misuse or leak private information from individuals. However, this assumption does not always hold in real life, e.g., data breaches are all too common .
To preserve privacy at the user-side, an alternative approach, namely, local differential privacy (LDP), was initially formalized in . With LDP, rather than trusting in a data curator to have the raw data and sanitize it to output queries, each user applies a DP mechanism to their data before transmitting it to the data collector server. The local DP model allows collecting data in unprecedented ways and, therefore, it has led to several adoptions by industry (e.g., Google Chrome browser , Microsoft windows 10 operation system , Apple iOS and macOS ).
I-B Motivation and problem statement
When collecting data in practice, one is often interested in multiple attributes of a population, i.e., multidimensional data. For instance, in crowd-sourcing applications, the server may collect both demographic information (e.g., gender, nationality) and user habits in order to develop personalized solutions for specific groups. In addition, one generally aims to collect data from the same users throughout time (i.e., longitudinal studies), which is essential in many situations [11, 12]. For example, the fact that two medical acts identified at different times have been performed on the same patient or two different patients means treatment in the first case or two isolated acts in the second.
So, in this paper, we focus on the problem of private frequency (or histogram) estimation of multiple attributes throughout time with LDP. Frequency estimation is a primary objective of LDP, in which the data collector (a.k.a. the aggregator) decodes all the privatized data of the users and can then estimate the number of users for each possible value. More formally, we assume there are attributes , where each attribute with a discrete domain has a specific number of values . Each user for has a tuple , where represents the value of attribute in record . Thus, for each attribute at time , the aggregator’s goal is to estimate a -bins histogram, including the frequency of all values in .
Indeed, in both longitudinal and multidimensional settings, one needs to consider the allocation of the privacy budget, which can grow extremely quickly due to the composition theorem . However, on the one hand, most frequency estimation academic literature [14, 15, 16, 17, 18, 19, 20] focuses on a single data collection (i.e., non-longitudinal studies). On the other hand, the studies for collecting multidimensional data with LDP mainly focused on other complex tasks (e.g., analytical/range queries [21, 22, 23, 24], estimating marginals [25, 26, 27, 28, 29]) and numerical data only (e.g., [30, 31, 32, 33]).
I-C Summary of contributions
In this paper, we extend the analysis of three state-of-the-art LDP protocols, namely, generalized randomized response (GRR) , optimized unary encoding (OUE) , and symmetric unary encoding (SUE)  for both longitudinal and multidimensional frequency estimates. On the one hand, for all three protocols, we theoretically prove that randomly sampling a single attribute per user improves data utility, which is an extension of common results in the LDP literature [34, 22, 35, 27, 36].
On the other hand, in the literature, both SUE and OUE protocols have been extended (and also applied [37, 38]) to longitudinal studies based on the concept of memoization [11, 12], i.e., L-SUE and L-OUE, respectively. However, we numerically and experimentally show that combining both protocols provides higher data utility, i.e., starting with OUE and then with SUE (L-OSUE) optimizes data utility rather than using SUE or OUE twice. In addition, we also extended GRR for longitudinal studies (i.e., L-GRR), which provides higher data utility than the other protocols based on unary encoding for attributes with small domain size.
Lastly, in a multidimensional setting with different domain sizes for each attribute, a dynamic selection of longitudinal LDP protocols is preferred. Therefore, we also proposed a new solution named Adaptive LDP for LOngitudinal and Multidimensional FREquency Estimates (ALLOMFREE), which combines all the aforementioned results. More specifically, ALLOMFREE randomly samples a single attribute to send with the whole privacy budget and adaptively selects the optimal protocol, i.e., either L-GRR or L-OSUE. To validate our proposal, we conducted a comprehensive and extensive set of experiments on four real-world open datasets. Under the same privacy guarantee, results show that ALLOMFREE consistently and considerably outperforms the state-of-the-art L-SUE and L-OUE protocols in the quality of the frequency estimates.
Paper’s Outline. The remainder of this paper is organized as follows. In Section II, we review the privacy notion that we are considering, i.e., LDP and the protocols we further analyze in this paper. In Section III, we extend the analysis of GRR, OUE, and SUE to multidimensional data collections. In Section IV we present the memoization-based framework for longitudinal data collections, the extension and analysis of longitudinal GRR and longitudinal UE-based protocols; the numerical evaluation of their performance, and we present our ALLOMFREE solution. In Section V, we present experimental results, discuss our results and limitations, and review related work. Lastly, in Section VI, we present the concluding remarks and future directions.
Ii Theoretical background
In this section, we briefly present the concept of privacy considered in this work, that is, LDP (Subsection II-A), and the LDP protocols we will apply in this paper and their analysis (Subsection II-B).
Ii-a Local differential privacy (LDP)
Local differential privacy, initially formalized in , protects an individual’s privacy during the data collection process. A formal definition of LDP is given in the following:
Definition 1 (-Local Differential Privacy).
A randomized algorithm satisfies -LDP if, for any pair of input values and any possible output of :
Similar to the centralized model of DP, LDP also enjoys several important properties, e.g., immunity to post-processing ( is -LDP for any function ) and composability . That is, combining the results from locally differentially private protocols also satisfies LDP. If these protocols are applied separately in disjointed subsets of the dataset, -, …, -LDP (parallel composition). On the other hand, if these protocols are sequentially applied to the same dataset, -LDP (sequential composition).
Ii-B LDP protocols
Randomized response (RR), a surveying technique proposed by Warner , has been the building block for many LDP protocols. Let be a set of values of a given attribute and let be the privacy budget, we review three state-of-the-art LDP mechanisms for single-frequency estimation (a.k.a. frequency oracles) that will be used in this paper.
Ii-B1 Generalized randomized response (GRR)
The k-Ary RR  mechanism extends RR to the case of and is also referred to as direct encoding  or generalized RR (GRR) [40, 41, 27]. Throughout this paper, we use the term GRR for this LDP protocol. Given a value , GRR()
outputs the true value with probability, and any other value such that with probability . More formally, the perturbation function is defined as:
This satisfies -LDP since . To estimate the frequency that a value occurs for , one calculates :
in which is the number of times the value has been reported and is the total number of users. In , it is shown that
is an unbiased estimation of the true frequency
, and the variance of this estimation is. In the case of small , this variance is dominated by the first term, which gives the approximate variance as :
Replacing and into Eq. (2), the GRR variance is calculated as:
Ii-B2 Unary encoding-based
Protocols based on unary encoding (UE) consist of transforming a value into a binary representation of it. So, first, for a given value , , where , a -bit array where only the -th position is set to one. Next, the bits from are flipped, depending on parameters and
, to generate a sanitized vector, in which:
The proof that UE-based protocols satisfy -LDP for
is known in the literature and can be found in [11, 14]. In  the authors presents two ways for selecting probabilities and , which determines the protocol variance. One well-known UE-based protocol is the Basic One-time RAPPOR , referred to as symmetric UE (SUE), which selects and , where (symmetric). The estimated frequency that a value occurs for is also calculated using Eq. (1). Replacing and into Eq. (2), the SUE variance is calculated as :
Moreover, rather than selecting and to be symmetric, Wang et al.  proposed optimized UE (OUE), which selects parameters and that minimize the variance of UE-based protocols while still satisfying -LDP. Similarly, the estimation method used in Eq. (1) equally applies to OUE. Replacing and into Eq. (2), the OUE variance is calculated as :
Iii Multidimensional Frequency Estimates with LDP
In the literature, there are few works for collecting multidimensional data with LDP based on random sampling (i.e., dividing users in groups) [30, 31, 32, 33, 14, 36]. This technique reduces both dimensionality and communication costs, which will also be the focus of this paper. Let be the total number of attributes, be the domain size of each attribute, be the number of users, and be the privacy budget. An intuitive solution (Spl) is splitting the privacy budget, i.e., assigning for each attribute. The other solution (Smp) is based on uniformly sampling (without replacement) only attribute(s) out of possible ones, i.e., assigning per attribute. Notice that both solutions satisfy -LDP according to the sequential composition theorem .
For the first case, Spl, the variances () of GRR, SUE, and OUE are, respectively:
For the second case, Smp, the number of users per attribute is reduced to . Thus, the variances () of GRR, SUE, and OUE are, respectively:
Notice that if in Eq. (8), one achieves Eq. (7). Practically, the objective is reduced to finding , which minimizes for each protocol. This way, to find the optimal for each protocol, we first multiply each in Eq. (8) by . Without loss of generality, minimizing , , and is equivalent to minimizing , , and , respectively. Hence, let be the independent variable, and can be rewritten as , and can be rewritten as as functions over . It is not hard to prove that both and are increasing functions w.r.t. and, hence, we have a minimum and optimal when (a single attribute per user) for all three protocols. We highlight that this is a common result in the LDP literature obtained for different protocols and contexts [30, 31, 33, 14, 22, 35, 34, 42].
Therefore, in this paper, we adopt the multidimensional setting Smp with . In this setting, users tell the data collector which attribute was sampled, and its perturbed value ensuring -LDP by applying either GRR or UE-based protocols; the data analyst server would not receive any information about the remaining attributes.
Iv Longitudinal Frequency Estimates with LDP
In this section, we present the memoization-based framework for longitudinal data collections (Subsection IV-A). Next, we present the analysis of longitudinal GRR (Subsection IV-B) and longitudinal UE-based protocols (Subsection IV-C). Lastly, we evaluate numerically the extended longitudinal protocols (Subsection IV-D) and we propose our ALLOMFREE solution (Subsection IV-E).
Iv-a Memoization-based data collection with LDP
In the literature, many works study how to collect and analyze categorical data longitudinally based on memoization [11, 12, 34]. The key idea behind memoization is using two sanitization processes. The first round () replaces the real value with a sanitized one with a higher epsilon (). Whenever one intends to report , shall be reused to produce other sanitized versions with lower epsilon values. Notice that the second sanitization () is a must to avoid ‘averaging attacks’, in which adversaries can reconstruct the true value from multiple sanitized versions of it. This technique allows achieving privacy over time with an upper bound value of -LDP.
Let be a set of values of a given attribute and let be the privacy budget. In this paper, for both and steps, we will apply either GRR, SUE, or OUE. The unbiased estimator in Eq. (1) for the frequency of each value for is now extended to:
in which is the number of times the value has been reported, is the total number of users, and are the parameters used by an LDP protocol for , and and are the parameters used by an LDP protocol for .
The estimation result in Eq. (9) is an unbiased estimation of for any value .
Let us focus on
The variance of the estimation in Eq. (9) is:
Thanks to Eq. (9) we have
Since is the number of times the value is observed, it can be defined as where is equal to 1 if the user , reports value , and 0 otherwise. We thus have . Since all the users are independent,
We thus have and, finally,
In this work, we will use the approximate variance, in which in Eq. (10), which gives:
Iv-B Longitudinal GRR (L-GRR): definition and -LDP study
Let be a set of values of a given attribute and let be the real value. We now describe an extension of GRR for longitudinal studies; we refer to this protocol as L-GRR for the rest of this paper. First, (direct encoding). Next, there are two rounds of sanitization, and applying GRR, described in the following.
: Memoize a value such that
in which and control the level of longitudinal -LDP. The value shall be reused as the basis for all future reports on the real value .
: Generate a reporting such that
in which is the report to be sent to the server.
Visually, Fig. 1 illustrates the probability tree of the L-GRR protocol. In the first round of sanitization, , our proposed L-GRR applies GRR with and (underlined in the middle of Fig. 1), where . As discussed in Subsection II-B1, this permanent memoization satisfies -LDP since , which is the upper bound.
On the other hand, with a single collection of data, the attacker’s knowledge of comes only from , which is generated using two randomization steps with GRR. This provides a higher level of privacy protection . From Fig. 1, we can obtain the following conditional probabilities:
Let and (underlined in far right of Fig. 1), with the second round of sanitization, , our proposed L-GRR protocol satisfies -LDP since . Notice that corresponds to a single report (lower bound) and its extension to infinity reports is limited by (upper bound) since uses as input the output of . More specifically, the calculus of for L-GRR is:
in which , , and both and are selectable according with , , and , calculated as:
Iv-C Longitudinal UE (L-UE): definition and -LDP study
We now describe UE-based protocols for longitudinal studies; we refer to this protocol as L-UE for the rest of this paper. Let be a set of values of a given attribute and let be the real value. First, (unary encoding), where , a -bit array where only the -th position is set to one. Next, there are two rounds of sanitization, and applying UE-based protocols, described in the following.
: For each bit , in , memoize a value such that
in which and control the level of longitudinal -LDP. The value shall be reused as the basis for all future reports on the real value .
: For each bit , in , generate a reporting that
in which is the report to be sent to the server.
Visually, Fig. 2 illustrates the probability tree of the L-UE protocol. One natural question emerges: how to select the parameters in order to optimize the utility of this L-UE protocol? One can see as a permanent sanitization and as a ‘small’ perturbation to avoid averaging attacks and keep privacy over time.
Based on SUE and OUE, we are then left with four options: two popular solutions that strictly use only OUE or SUE parameters in both sanitization steps and two proposed settings that combine both OUE and SUE. These four L-UE protocols are summarized below:
both sanitizations with OUE (L-OUE);
both sanitizations with SUE (L-SUE);
starting with OUE and then with SUE (L-OSUE);
starting with SUE and then with OUE (L-SOUE);
As presented in , the OUE variance in Eq. (6) is smaller than the SUE variance in Eq. (5) and, therefore, the former can provide higher utility than the latter for . On the other hand, we argue that OUE might be too strict for since the parameter is constant. Thus, we hypothesize that option III (i.e., L-OSUE) is the most suitable one. Without loss of generality, the following analyses are done only for L-OSUE, which can be easily extended to any of the other combinations.
In the first round of sanitization, , our solution L-OSUE applies OUE with and (underlined in the middle of Fig. 2). As discussed in Section II-B2, this permanent memoization satisfies -LDP since , which is the upper bound.
Following the same development as for L-GRR, on the other hand, with a single collection of data, the attacker’s knowledge of comes only from , which is generated using two randomization steps with OUE and SUE, respectively. This provides a higher level of privacy protection . From Fig. 2, we can obtain the following conditional probabilities according to each bit :
Let and (underlined in far right of Fig. 2), with the second round of sanitization, , our proposed L-OSUE protocol satisfies -LDP since . Notice that corresponds to a single report (lower bound) and its extension to infinity reports is limited by (upper bound) since uses as input the output of . More specifically, the calculus of for L-OSUE (or L-UE protocols in general) is:
in which, for L-OSUE, we have , , and both and are symmetric () and selectable according to and , calculated as:
Iv-D Numerical evaluation of L-GRR and L-UE protocols
In this subsection, we evaluate numerically the approximate variance of all developed longitudinal protocols, namely, L-GRR and the four UE-based options namely L-OUE, L-SUE, L-OSUE, and L-SOUE, respectively. As aforementioned, once defined both and privacy guarantees, one can obtain the parameters and depending on , and the parameters and depending on both and (and the domain size for L-GRR) as given in Eq. (13) for L-GRR and in Eq. (15) for L-OSUE.
Next, once computed the parameters , one can calculate the approximate variance with Eq. (11) for each protocol. In other words, following our proposal, one has to set both the upper () and lower () bounds of the privacy guarantees. For example, let , one might want that the first -LDP report to have high privacy such as , i.e., (we will use this probability notation to set up the privacy guarantees).
Table I exhibits numerical values of the approximate variance using Eq. (11) for all longitudinal protocols with , (as in ), and . For values of higher than , neither L-OUE nor L-SOUE could satisfy some values of because of the constant in . Yet, it is not desirable to have higher values of and, thus, we did not consider values above in our analysis. Besides, Table II exhibits numerical values for non-longitudinal GRR, OUE, and SUE protocols, which allows evaluating how utility degrades with a second step of sanitization.