## 1 Introduction

Differential privacy (DP) [dwork2006calibrating] consists of randomized methods that allow to publish the output of data queries, while guaranteeing that the answers are statistically unlikely to reveal information about attributes of the data . Instead of releasing the result of the query, DP acts as a guard by perturbing randomly the query response. DP is also capable of ceasing to respond to repeated queries when a preset privacy budget is reached. DP methods do not perturb or release the data directly. They are tailored to the specific query.

Clustering

algorithms are among the most common unsupervised learning techniques (see e.g.,

[xu2005survey] for a survey). In many cases, by summarizing in a small set of data patterns the emerging trends in a large detailed collection, the clustering query can provide sufficient information to improve services offering from entities that do not have permission to observe the data directly. Just like publishing an average, publishing the average behavior of a data cluster for a certain set of data in a database leaks private information, motivating the use of the DP framework for their release. With this in mind, this paper presents a novel approach for applying to the publication of the centroids computed by the -means algorithm.Prior art – Generic differentially private clustering techniques have been previously presented in e.g. [balcan2017differentially, xia2020distributed, lu2020differentially]. The authors in [balcan2017differentially] proposed an iterative -means clustering algorithm for data in high-dimensional Euclidean spaces. In [xia2020distributed], the authors proposed a local DP iterative clustering algorithm where noise is added at the user’s end before transmitting the data to the aggregator. While guaranteeing DP, these techniques may not-converge. Instead, the authors in [lu2020differentially] proposed a clustering algorithm that performs an input perturbation in each iteration, which offers convergence guarantees but drives the cost of DP higher depending on the number of iteration required for convergence.

Contribution – The paper introduces the design of DP mechanism to publish cluster centroids by adding to them Gaussian noise with an optimum covariance. The proposed method is not iterative and provides greater accuracy for a given privacy budget. The efficacy of the proposed mechanism is tested on samples drawn from the Marketing Campaign dataset [marketingdataset].

Paper organization– In Section 2, we introduce the DP framework before setting up the problem statement. In Section 3, we describe a DP mechanism for the publication of the clustering query. In Section 4, we test numerically our algorithms, before concluding the paper in Section 5.

Notation

– Boldfaced lower-case (upper-case respectively) letters denote vectors (matrices respectively) and

( respectively) denotes the^{th}element of a vector (the

^{th}entry of a matrix respectively). Calligraphic letters denote sets and their cardinality. Finally, denotes the set of integers .

## 2 Preliminaries and Problem Statement

In the following, we denote by a set of feature vectors , embedded in that are in a database . To review the basic concepts, we set the problem in general terms, denoting by the function mapping onto the query answer, with outcome denoted by , where is its domain.

### 2.1 Differential Privacy

A DP randomized algorithm applied to a given query makes it either difficult or impossible to tell if the data or , which is missing one feature vector relative to , was queried. We denote the DP query answer by , and has a random outcome , with distribution

(the probability density function for continuous random queries and the probability mass function for discrete random variables). We briefly introduce the conventional definitions that explain how differential privacy is measured and established. The first and the most widespread definition of differential privacy was introduced in

[dwork2006calibrating, dwork2006our] and is referred to as -DP. The one that we follow was introduced by [machanavajjhala2008privacy], referred to as -Probabilistic Differential privacy (PDP). It can be shown that -PDP is a strictly stronger condition than -DP.###### Definition 1 (-Probabilistic Differential privacy).

The so-called privacy leakage function is the log-likelihood ratio between the two hypotheses that the query outcome is the answer generated by the data or the data that differ by one element. Mathematically:

(1) |

A randomized mechanism is PDP for iff:

(2) |

###### Theorem 2 (PDP implies DP [mcclure2015relaxations]).

If a randomized mechanism is -PDP, then it is also -DP, i.e.,

Definition 1 has an direct statistical interpretation: if values of are close to zero, even when one adopts the optimum statistical test for the hypotheses that the randomized answer is produced by the datasets or , the test produces results that mostly are incorrect or unreliable. Of course, this comes at a cost in terms of accuracy of the answer.

### 2.2 The -Means Clustering Query

Let be the set of indices of the data points in , , which we can organize as a the matrix . The task of the -means algorithm is to split the dataset, into subsets (clusters) and to assign a label to each point corresponding to the nearest cluster centroid to itself. In other words, the query we are interested in is given by . The problem can formally be posed as an optimization problem of the form:

(3) |

where is the partition in clusters and is the centroid of a cluster , obtained by averaging the points . The objective in eq. 3 minimizes the cost of clustering assignment with the constraint as shown, so that every point in the database is assigned a cluster label whose centroid is the closest.

## 3 An Optimized -DP Gaussian Mechanism for -Means Clustering

Prior to introducing our optimization in Section 3.2, in the next section we how the most common DP mechanism for continuous queries would perform.

### 3.1 White Gaussian Noise Mechanism

The Gaussian noise output perturbation mechanism is a popular option in DP for publishing a variety of statistics. It entails adding a sample of i.i.d. random noise prior to publishing a vector query. When applied to the cluster centroids:

(4) |

where is a sample of random noise and is the DP answer.

###### Theorem 3 (Cluster centroids are -Dp [dwork2006calibrating]).

The mechanism in eq. 4, when provides -DP for any two neighboring datasets and :

(5) |

and is the query sensitivity given by:

### 3.2 Colored Gaussian Noise Mechanism

Our idea in this paper stems from the fact that it is possible easily to generalize the i.i.d. Gaussian noise mechanism to the case of correlated Gaussian noise:

(6) |

where is the noise precision matrix, and optimize once the -PDP tradeoff is computed. Intuitive, a different choice of the covariance can better capture how the centroids are collectively placed in . The following theorem states the privacy guarantees of this mechanism:

###### Theorem 4 (Colored Gaussian Noise Mechanism is -Dp).

The additive noise mechanism in eq. 6 provides -DP for any two neighboring datasets and .

###### Proof.

In order to show that the mechanism in eq. 6 is -DP, we first show that it is -PDP. Consider:

(7) |

With and , we have:

(8) |

The privacy loss function is a linear transformation of a Gaussian random vector and thus, it is a Gaussian random variable with expectation

and variance

. In order to prove -PDP, we have to prove that the privacy leakage function exceeds with probability at most , i.e.:(9) |

where is the local sensitivity given by:

(10) |

Thus, the colored noise additive mechanism in eq. 6 is -PDP with mean and covariance . Finally, the said mechanism is also -DP from Theorem 2. ∎

The design of the optimal noise vector hinges on the design of its covariance matrix. Let us define:

Minimizing the mechanism error (or the distortion) , meeting the DP guarantees, requires trace () of the noise covariance (the inverse of ), i.e. solving the following optimization:

(11) | ||||

s.t. |

In the following lemma, we provide the closed form solution:

###### Lemma 5 (Optimal Choice of ).

Let the matrix contain as its columns all possible , and let us assume that is full row rank. Let us assume that the first columns of , corresponding to the set have the smallest norms and are linearly independent, forming the matrix we refer to as . The optimization problem in eq. 11 has a unique solution and it evaluates to:

(12) |

where has only non-zero values which correspond to the constraints associated with the set and:

(13) |

where are the non-zero Lagrange multipliers for the problem in eq. 11. Their values are:

(14) |

###### Proof.

The Lagrangian of the optimization in eq. 11 is:

(15) |

(16) |

Let be the matrix containing all vectors for all . Note that

(17) |

where contains the Lagrange multipliers. The problem has a unique solution if is invertible at the optimum . A necessary condition is that and that there are at least non-zero Lagrange multipliers at the optimum point. In this case, the stationary point of the problem is:

(18) |

Substituting we have:

(19) |

which makes it clear that the non-zero multipliers (those for which the constraint is tight) should be the smallest columns in in terms of -norm. Since we need at least of them, in that are linearly independent, we can place them in the matrix , which is square and invertible, and assume that these are the first columns of without loss of generality. The individual constraints in eq. 11 :

(20) |

can be rewritten as:

(21) |

and the solution of (21) will give . With some algebra, one can express (21) as follows:

(22) |

where the entries of are and those of are . Solving for , one obtains the entries of as and can calculate the optimal that satisfies the conditions from eq. 18. ∎

## 4 Numerical Simulations

To illustrate the methodology proposed, we consider the Marketing Campaign dataset [marketingdataset] which contains data about when customers accepted offers of five marketing campaigns with additional information such as income, size of household, amount spent on various products, among others (see [marketingdataset] for a detailed description). The dataset contains a total of features (excluding the ID of a customer) with 3 categorical columns (marital status, education, and date of registration) which were encoding into numeric form. The objective of the exercise is to predict the demographics of the customers who will respond to a marketing campaign and thereby increasing profits. Such campaigns have become commonplace, and often violate customer privacy [tucker2014social]. We first find that there are four clusters in this dataset using the elbow method, and this is further reiterated in the scatter plot of the points (embedded in 2-d using Multidimensional Scaling) in fig. (a)a. In table 1, we show the population count of the individual clusters. We then add colored noise to the cluster centroids and reevaluate the labels by assigning the closest noisy cluster centroid to a point as its new label. In fig. (b)b, we show the noisy clusters in the plot of Customer Spending against their income. The classes can be characterized as medium-income and low-spending (class 0), medium-income and medium-spending (class 1), high-income and high-spending (class 2), and low-income and low-spending (class 3).

Cluster | 0 | 1 | 2 | 3 |
---|---|---|---|---|

True | 586 | 510 | 506 | 610 |

Noisy | 581 | 483 | 459 | 679 |

In fig. 2, we compare the performance of white and colored noise mechanisms for various

pairs, and we observe a performance improvement for the colored noise mechanism over the white noise mechanism and the mechanism mentioned in

[balcan2017differentially]^{1}

^{1}1For the mechanism from [balcan2017differentially], the curve corresponds to .. Finally, in figs. (a)a to (b)b, we plot the distribution of the number of deals purchased and total number of promotions accepted by customers belonging to the different clusters, with and without noise added. We can observe that the DP colored noise does not affect the distribution of these counts. As such, the noisy clustering mechanism will still lead to highly accurate inferences that may be drawn from these parameters, such as the conclusion that married customers accept deals and promotions in relatively higher numbers, that the high-income and high-spending customers do not care for deals but are more likely to accept promotion, and that customers in class 0 are more likely to accept deals.

## 5 Conclusion

In this paper, we proposed and analyzed a differentially private randomized mechanism for the -means clustering query. The method consisted of adding Gaussian noise with an optimum covariance. The method outperforms the traditional Gaussian noise mechanism, and existing iterative DP clustering methods, as shown via numerical simulations against a marketing campaign dataset. Finally, we show that the mechanism preserves the count distributions of the various metrics of the dataset, thereby leading to accurate inferences (relative to inferences drawn from non-noisy clustering).