 # Answering Summation Queries for Numerical Attributes under Differential Privacy

In this work we explore the problem of answering a set of sum queries under Differential Privacy. This is a little understood, non-trivial problem especially in the case of numerical domains. We show that traditional techniques from the literature are not always the best choice and a more rigorous approach is necessary to develop low error algorithms.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

In recent years, Differential Privacy (DP) [DiffPriv] has emerged as the de-facto privacy standard for sensitive data analysis. Informally, the output of a DP mechanism does not significantly change under the presence or absence of any single tuple in the dataset. The privacy loss is captured by the privacy parameter , also referred to as the privacy budget. DP algorithms usually work by adding noise to query results. This noise is calibrated to the parameter and the query sensitivity, i.e., the maximum change in the query upon the deletion/addition of a single row in the dataset.

In this work we focus on privately releasing sum queries over numerical attributes. A sum query involves adding the values of all records meeting certain criteria. These queries can be a powerful tool for data analysts to to gain insight about a dataset. In Example 1 we present a use case for sum queries.

###### Example 1

Consider the dataset of Fig. 1 and a social scientist querying it for insight on income distributions. More specifically, the scientist queries the database for the sum of salaries of people with income less than $30k,$40k, $50k up to$1M, in which case and the answers are: and respectively.

Answering a single sum query under DP proves to be challenging due to their high sensitivity – the addition/removal of a single row can have a dramatic effect on the query. In Example 1 we see that the the sensitivity of the final sum query is . One way to reduce this sensitivity is via the addition of a truncation operator, which truncates the queries such that any value above a certain threshold only contributes to the query answer. However, truncation techniques introduce bias to the final answer, even in the absence of any noise mechanism.

In this work we focus on answering a batch of sum queries under a common privacy budget. This problem is challenging for two reasons. First, prior work on batch query answering, such as matrix mechanism, Identity, and Workload[DPBench, Matrix], is focused on workloads of queries with similar sensitivities. For example, Workload applies noise proportional to the query with the worst sensitivity to all of the queries.

Second, although post-processing techniques from prior work [Hay2010] has shown the ability to dramatically reduce the final error for a workload of queries, it is not clear how such techniques would fare for post-processing noisy answers each of which having a different bias (e.g., noisy answers have different truncation thresholds).

Our main contributions are as follows:

• We introduce methods of implementing truncation on sum queries effectively reducing their sensitivity.

• In Section 3.2, we propose 2 new DP algorithms (TiMM and TaMM) for answering batch of sum queries under the same privacy budget.

• In Section 4 we conduct a study on a U.S. Census dataset where we: (a) highlight the importance of truncation for sum queries and (b) evaluate the performance of our proposed algorithms with TaMM offering the overall best performance.

• We explore the effects of post-processing for noisy answers that are heterogeneously biased.

## 2 Background

Data Representation We consider databases where each tuple corresponds to a single individual and have a single numerical attribute. More specifically, let be a multiset of records drawn from a numerical domain . For instance, the database of Example 1 consists of records drawn from , the domain of natural numbers.

Many differentially private algorithms use the vector form of a database: . More specifically, given a set of buckets the original set of records

, is transformed to a vector of counts

, where is the number of individuals in with value . For brevity, in the remainder we use the simplified notation to refer to the vector form of database instance .

Prefix Sums As noted earlier, in this work we focus on summation queries over numerical attributes. More specifically, for database instance we consider the prefix sum query defined as .

The query returns the sum of values of tuples in with value less or equal to . We also consider sets of prefix sum queries, where each query has an increasing threshold . In the motivating Example 1, the query “Total cost to company for employees with salary at most ” is encoded by the prefix query and the full workload of queries asked is encoded by . Prefix sum queries can be vectorized to 0-1 lower triangular matrix , we call it workload matrix.

Sum queries on the vector form require every bucket to have a weight. We can define the weight of the bucket to be . Thus, we can define the weight matrix as a diagonal matrix with . The true answer is thus , and we call weighted workload matrix.

Sparse Vector Technique [dwork2014:textbook] is a differentially private mechanism which reports if a sequence of queries lie above a chosen threshold. SVT takes in as input a sequence of queries and a threshold . SVT then outputs the first query which lies above the threshold.

Recursive Mechanism [Chen13:recursive] is an algorithm for answering monotone SQL-like counting queries of high sensitivity. It internally finds a threshold to reduce the sensitivity of the query and then constructs a recursive sequence of lower sensitivity queries which can be used to approximate the input query. The parameter

Matrix Mechanism (MM) [Li15matrix] is a differentially private mechanism that allows for the private answering of batch queries. MM takes in as input a workload of queries in matrix form and a database in vector form. It computes a differentially private answer to by measuring a different set of strategy queries and reconstructing answers to from noisy answers to .

Identity and Workload [DPBench, Zhang:2018:EFD:3183713.3196921] are particular strategies in the MM framework. Identity simply adds noise to each count in and then computes the query workload as normal. Workload however first computes the true query answers then adds noise to the true answers based off the query with the highest sensitivity.

We now introduce the methods developed for answering sum queries. In Section 3.1 we discuss answering a single sum query and in Section 3.2 we propose algorithms for answering a workload of sum queries.

### 3.1 Answering a Single Query

A sum query can have very large (and even unbounded) sensitivity, which often leads to prohibitively large scale of injected noise for satisfying the privacy guarantee. One simple, yet effective, method to reduce the sensitivity is by truncating the values of tuples in the original database before answering the sum query. For a database , and a threshold , all tuples of with value higher than are replaced with the value . More specifically, let be a truncation operation on queries that is defined as follows: . Then, for any sum query , will have sensitivity . In other words, asking a truncated sum query can possibly have smaller sensitivity.

This sensitivity reduction comes with a cost in bias since truncation reduces the true answer of even in the absence of any noise mechanism. Thus, the choice of is crucial, since very small values (e.g., ) which lead to low sensitivity values, also lead to an increased bias. For example if we use as a truncation threshold for Example 1 we find that the answers to and are all . Similarly, large values of truncation will have small bias, but might not decrease the sensitivity.

At the same time, any choices of need to be done either (a) data independently (e.g., using an oracle), or (b) using the sensitive data under differential privacy. In the following we explore methods of privately choosing a threshold .

Recursive Mechanism The recursive mechanism can be used to privately find a truncation threshold. We simply then use as the truncation threshold. The results would be equivalent to implement the whole recursive mechanism on the sum query. The proof and details of the algorithm are in the Appendix.

Sparse Vector Technique Likewise we can use SVT to chose . We choose the sequence of functions , where is a counting query which counts the number of tuples with weight at most . We choose a ratio of the database which we would like to not be truncated. We then set the SVT threshold . SVT will then return where is the first query where the number of tuples less than is greater than . We then use as our truncation threshold. When choosing the sequence of counting queries we begin with where is a number far below the expected . We then let the sequence be a strictly increasing sequence. We found that linearly increasing with a reasonable interval gives a very slow performance and returns a smaller than expected. As such we use an exponential increase in . We thus define the sequence of with a parameter such that .

We now propose a general routine for answering a workload of sum queries. In Table 1 we offer descriptions of 3 baseline algorithms and 2 new algorithms. In the table, we define

as a vector of i.i.d. random variables drawn from a Laplace distribution with mean 0 and scale 1.

Baseline Algorithms The first baseline SQM, which naively splits the budget across all queries and sequentially answer them with the Laplace mechanism. Additionally, direct implementations of Identity, Workload and Matrix Mechanism (MM) are also considered. Note that sum queries can have largely different sensitivities for queries in a query set and very high sensitivity for queries involving tuples with large weights. In Example 1, there may be only a few queries involving individuals with very high salaries like 1 million. To ensure the privacy of these individuals, we need to add a very large noise to all the queries if we use Workload and some queries if we use Identity and MM. This characteristic of sum query sets makes these algorithms sub-optimal. Thus, we propose truncated versions of these algorithms.

Truncation As in the case of answering a single query, truncation is a useful technique to reduce the sensitivity of sum queries. We split the privacy budget with a parameter to assign a private budget to SVT or Recursive to find a truncation threshold . We then obtain a truncated weight matrix by changing all the values in larger than to . Thus, when we use as the weighted workload matrix, it is equivalent to applying to every query of . We call this truncation method if SVT is used and if Recursive is used. We can thus use instead as the weight matrix to implement Identity and Workload.

Truncated Matrix Mechanisms As discussed earlier, MM and HDMM[HDMM] is preferred for answering a batch of queries since it optimizes for the input workload using a strategy matrix . Ideally, in our problem, we want to jointly optimize the strategy matrix and the truncated weight matrix w.r.t. to the workload and the data. In addition, can be any matrix instead of a diagonal matrix obtained from

with a numerical threshold. We now introduce 2 heuristic algorithms to implement

MM with truncation.

Both algorithms obtain using or as described above. After obtaining , one way is to optimize using the workload matrix without using the results of truncation . We call this Truncation-independent Matrix Mechanism (TiMM). A different approach is to optimize using the weighted workload matrix . We call this Truncation-aware Matrix Mechanism (TaMM). Both procedures are described in Table 1. The full description of the algorithm and the analytical expressions for expected error are presented in the appendix.

## 4 Experiments

We experimentally evaluate our algorithms on a U.S. Census dataset, our key findings are: (a) truncation improves the error of each algorithm tested, (b) our proposed algorithm TaMM performs the best, and (c) traditional post-processing techniques do not necessarily reduce the error.

Dataset We use CPS, a dataset derived from the publicly available Current Population Survey [Zhang:2018:EFD:3183713.3196921]. More specifically, CPS contains over tuples corresponding to individuals and 4 attributes: income, race, gender, and marital status. We project on the income attribute to derive our dataset.

Queries We evaluate using workloads of prefix sum queries: , , and containing , , and queries respectively. More specifically, , for = ; ; and .

Vectorization All BQM algorithms presented in Section 3.2 operate on the vector form of the dataset and workload. We vectorize our dataset and queries using the set of bins . The workloads , and are also vectorized to and respectively. Where is a lower triangular matrix, is a matrix, and is a matrix.

Algorithms We experimented on all 5 algorithms listed in Table 1. Each algorithm is executed without the Trunc subroutine (NoTrunc), or with the Trunc subroutine with a threshold learned using SVT(), or Recursive () for a total of 15 different configurations. We ran each algorithm on a unique input for a total of 100 independent trials and we report aggregate statistics.

Algorithms using the subroutine, use a budget split to learn privately; for brevity in our results we only report for For the SVT subroutine we chose and . Recursive is implemented with . We ran our experiments on the environment of ektelo[Zhang:2018:EFD:3183713.3196921]. We used the GreedyH algorithm provided by ektelo to compute the strategy matrix . Since our workload is prefix sums, we used isotonic regression as a post-processing step for all results we present – unless explicitly stated. Across all algorithms, we fixed the privacy parameter to .

Error For query and algorithm , we report the relative error of on , defined as follows: where is the noisy answer of , is the true answer, and is positive parameter – in all experiments we use .

Results Our main experimental results are presented in Figs. 4, 3 and 2. Across all figures, the x-axis correspond prefix sum queries, for instance points at corresponds to query . In Fig. 2 the y-axis corresponds to the noisy answers, while in Figs. 4 and 3 the y-axis shows the relative error. In Fig. 3 the solid black line corresponds to the true answers. Solid colored lines represent the mean answer (or error) and the shaded areas cover 90% (5 to 95 percentile) of the algorithm performance. Across all experiments, we observed that the Identity baseline dominated the rest of the baselines, this is expected due to the large size of the workload. Thus, we mostly use Identity as the baseline of comparison with our algorithms.

Effects of Truncation In Fig. 1(a) we compare the performance of the Identity algorithm with and without truncation. We can see that for and the overall variance of the noisy answers is much smaller than NoTrunc. This is expected as the scale of the Laplace noise added is significantly smaller when there is truncation – especially so for larger queries. Additionally and despite the fact that NoTrunc is unbiased, the mean answers of NoTrunc deviates from the true answers more than either or . This is due to the large error of NoTrunc and the bias caused by isotonic regression. Among the 2 different techniques to compute the truncation threshold, has smaller error. However, since both methods depends on several free parameters, we cannot say which one is better in general. In the following and for brevity, we present results using the technique.

Truncated Matrix Mechanisms In Fig. 1(b) we compare the answers from our new truncated matrix mechanisms with that of Identity, the best performing baseline. We see that both TaMM and TiMM offer less variance in their noisy answers than Identity, while their mean answers are comparable. To further examine the performance of these algorithms, in Fig. 3 we also present their relative errors. Overall, we see that TaMM performs best across most queries and as the query size increases TiMM becomes more competitive. For very small queries the error of TiMM is approximately one order of magnitude larger than TaMM and Identity. This is due to TiMM adding noise to the weighted counts , while both Identity and TaMM add noise to the counts . Thus, the noise of Identity and TaMM is proportional to the weight of the bucket while the noise of TiMM is more evenly distributed. This makes the noise of TiMM significantly higher for small queries but similar to TaMM for larger queries. This also explains how TiMM has smaller error than TaMM for very large queries.

Effects of Isotonic Regression In Fig. 4 we present the performance of algorithms Workload and SQM for answering workload with (Fig. 3(b)) and without (Fig. 3(a)) isotonic regression. The main finding is that isotonic regression offers a bigger boost in terms of mean error on Workload than in SQM and for the majority of queries the variance of errors of SQM is worsened. More specifically, Fig. 3(b) shows that the 5 percentile errors of SQM are significantly raised after applying isotonic regression, while its mean error is only slightly improved.

As a reminder, Workload uses the same truncation threshold across all queries of , while SQM has a different threshold for each query. This results in Workload adding the same bias in each noisy answer and SQM adding different bias for each query. Due to our findings, we conjecture that isotonic regression and L2 minimization in general may increase error when noisy answers have different levels of bias in them.

## 5 Future work

In future work we hope to both test the algorithms proposed in other settings as well as develop more sophisticated algorithms. Although the proposed algorithms work on a range of high sensitivity queries, they are only tested on sum queries. Likewise all the experiments used diagonal truncation matrices due to the nature of the prefix sums. Further analysis of the affects on other types of queries and non-diagonal weight matrices would be valuable.

In Section 3.2 we introduced the idea of optimizing a strategy and truncation matrix jointly. The algorithms proposed do not reach this ideal and instead use a heuristic approach. As such developing an algorithm which optimizes both the strategy and truncation matrix jointly remains an open problem.

We saw that some post-processing techniques may worsen the performance of some algorithms, particularly when the algorithm introduces different bias to each answer. It is an interesting open problem to investigate.

## Appendix A Algorithms

### a.1 Single Query Mode

Algorithm 1, Algorithm 2 and Algorithm 3 are different algorithms used for the Single Query Mode (SQM).

Let be a dataset in multi-set , is a query workload on , where each is the prefix query of income sums at most .

is the ratio of splitting the privacy budget, is the counting query with uppber bound , is the rate of increase of the counting query bound, is the ratio of the dataset set to be kept, is the threshold for the SVT algorithm. Truncate is the function to truncate by changing all rows less than by .

### a.2 Batch Query Mode

We call all other algorithms we use the Batch Query Mode (BQM), where we operate on the workload matrix and the vector form of the data. Algorithm 4 is used for vectorization and truncation, which is a common procedure for all BQM algorithms. Algorithm 5 and Algorithm 6 are baseline algorithms used together with SQM. 2 new algorithms are Algorithm 7 and Algorithm 8. We implement all 4 algorithms together in our experiments using Algorithm 9.

We use the function Threshold to denote the selection of truncation threshold . The algorithm can be SVT or recursive mechanism. We tested both in our experiments. We will omit the parameters of the truncation algorithms as they are listed in the section of single query mode.

is a vectorization of and is the vector of corresponding attribute. (i.e . is the workload matrix for this vectorization. is a diagonal matrix with

We define as a length vector with each entry an independent sample of Lap() distribution. We use it as an input to replace for simplicity.

GreedyH is the function in ektelo, which returns a strategy matrix for the workload matrix . LeastSquare() returns the least square solution of .

In the experiments, we share truncation threshold for all 4 mechanisms in one instance to control the effect of truncation. The algorithm used in the experiment is as follows.

Detailed explanations and error analysis of the new proposed algorithms are below.

### a.3 Truncation-independent Matrix Mechanism

In Algorithm 7 (TiMM), the truncation matrix is chosen without considering the strategy matrix. It can be chosen using truncation methods for single query as described for identity strategy.

Then, we can choose a strategy matrix under the frame of matrix mechanism using workload matrix and take as the input. Thus, using Laplace mechanism, we have

 L(A,Tx)=ATx+Δ(AT)ϵ~b

where is the sentivity of , which is the maximum of the column sums of , and is the left pseudoinverse of .

We then apply the matrix mechanism to have the output

 MA(W,Tx)=WA+L(A,Tx)=WTx+Δ(AT)ϵWA+~b

The error for one query is

 Error[MA(w,Tx)] =E[(wDx−wTx−Δ(AT)ϵwA+~b)2] =(w(D−T)x)2+2Δ(AT)2ϵ2w(ATA)−1wT

where is from the matrix mechanism.

Thus, the total error is

 Error[MA(W,Tx)] =p∑i=1Error[MA(wi,Tx)] =∥w(D−T)x∥22+2Δ(AT)2ϵ2Tr(W(ATA)−1WT)

### a.4 Truncation-aware Matrix Mechanism

Algorithm 8 (TaMM) uses matrix mechanism after truncation and the strategy matrix is decided with the truncated weight matrix . Specifically, the matrix mechanism uses the weighted workload matrix as input.

We can implement a strategy matrix first and have

 L(A,x)=Ax+ΔAϵ~b

where is the sensitivity of . Then, we can apply matrix mechanism by considering as the workload matrix and the result is thus

 TA,T(W,x) =WTA+L(A,x) =WTA+(Ax+ΔAϵ~b) =WT(x+ΔAϵA+~b),

where is the left pseudoinverse of . In this case we have error for a single query to be

 Error[TA,T(w,x)] =E[(wDx−TA,T(w,x))2] =E[(wDx−wTx−ΔAϵwTA+~b))2] =(w(D−T)x)2+2(ΔA)2ϵ2wT(ATA)−1TTwT

as and from matrix mechanism.

We thus have the total error as

 Error[TA,T(W,x)] =p∑i=1Error[TA,T(wi,x)] =∥W(D−T)x∥22+2(ΔA)2ϵ2Tr(WT(ATA)−1TTWT)

### a.5 The equivalence of our truncation method and the second part of the recursive mechanism

One important fact we used in our experiments is that for sum query specifically, the second part of recursive mechanism[Chen13:recursive] is equivalent to the truncation mechanism we used, with in the recursive mechanism be considered as the truncation threshold . We have the following theorem.

###### Theorem 1

When we consider as the truncation threshold , and let the second part of the recursive mechanism is equivalent to the truncation method.

The proof is as follows. As from the recursive mechanism

 X=min{Hi+(N−i)^Δ:0≤i≤N}

while is the sum of the lower weights. Thus, is equal to the sum of a new data set with the lower weights unchanged and the weights left changed to . Thus, the minimum of these sums will be keeping the weights smaller than or equal to unchanged and change weights larger than to , exactly the same as truncation method using as the threshold.

In addition, the noise added is equal to the noise added in the truncation method when . Thus, we can say the 2 methods are equivalent in the case of sum queries.