k-Median Clustering via Metric Embedding: Towards Better Initialization with Differential Privacy

06/26/2022
by   Chenglin Fan, et al.
63

When designing clustering algorithms, the choice of initial centers is crucial for the quality of the learned clusters. In this paper, we develop a new initialization scheme, called HST initialization, for the k-median problem in the general metric space (e.g., discrete space induced by graphs), based on the construction of metric embedding tree structure of the data. From the tree, we propose a novel and efficient search algorithm, for good initial centers that can be used subsequently for the local search algorithm. Our proposed HST initialization can produce initial centers achieving lower errors than those from another popular initialization method, k-median++, with comparable efficiency. The HST initialization can also be extended to the setting of differential privacy (DP) to generate private initial centers. We show that the error from applying DP local search followed by our private HST initialization improves previous results on the approximation error, and approaches the lower bound within a small factor. Experiments justify the theory and demonstrate the effectiveness of our proposed method. Our approach can also be extended to the k-means problem.

READ FULL TEXT VIEW PDF

page 6

page 11

11/08/2021

An Improved Local Search Algorithm for k-Median

We present a new local-search algorithm for the k-median clustering prob...
09/30/2020

On Approximability of Clustering Problems Without Candidate Centers

The k-means objective is arguably the most widely-used cost function for...
04/01/2020

k-Median clustering under discrete Fréchet and Hausdorff distances

We give the first near-linear time (1+)-approximation algorithm for k-me...
10/26/2019

Facility Location Problem in Differential Privacy Model Revisited

In this paper we study the uncapacitated facility location problem in th...
03/29/2016

Local Search Yields a PTAS for k-Means in Doubling Metrics

The most well known and ubiquitous clustering problem encountered in nea...
04/20/2021

Locally Private k-Means in One Round

We provide an approximation algorithm for k-means clustering in the one-...
02/27/2019

Reconciliation k-median: Clustering with Non-Polarized Representatives

We propose a new variant of the k-median problem, where the objective fu...

1 Introduction

Clustering is an important problem in unsupervised learning that has been widely studied in statistics, data mining, network analysis, etc. 

(Punj and Stewart, 1983; Dhillon and Modha, 2001; Banerjee et al., 2005; Berkhin, 2006; Abbasi and Younis, 2007)

. The goal of clustering is to partition a set of data points into clusters such that items in the same cluster are expected to be similar, while items in different clusters should be different. This is concretely measured by the sum of distances (or squared distances) between each point to its nearest cluster center. One conventional notion to evaluate a clustering algorithms is: with high probability,

where is the centers output by the algorithm and is a cost function defined for on dataset . is the cost of optimal (oracle) clustering solution on . When everything is clear from context, we will use for short. Here, is called multiplicative error and is called additive error. Alternatively, we may also use the notion of expected cost.

Two popularly studied clustering problems are 1) the -median problem, and 2) the -means problem. The origin of -median dates back to the 1970’s (e.g., Kaufman et al. (1977)), where one tries to find the best location of facilities that minimizes the cost measured by the distance between clients and facilities. Formally, given a set of points and a distance measure, the goal is to find center points minimizing the sum of absolute distances of each sample point to its nearest center. In -means, the objective is to minimize the sum of squared distances instead. Particularly,

-median is usually the one used for clustering on graph/network data. In general, there are two popular frameworks for clustering. One heuristic is the Lloyd’s algorithm 

(Lloyd, 1982), which is built upon an iterative distortion minimization approach. In most cases, this method can only be applied to numerical data, typically in the (continuous) Euclidean space. Clustering in general metric spaces (discrete spaces) is also important and useful when dealing with, for example, the graph data, where Lloyd’s method is no longer applicable. A more broadly applicable approach, the local search method (Kanungo et al., 2002; Arya et al., 2004), has also been widely studied. It iteratively finds the optimal swap between the center set and non-center data points to keep lowering the cost. Local search can achieve a constant approximation ratio () to the optimal solution for -median (Arya et al., 2004). In this paper, we will focus on clustering under this the general metric space setting.

Initialization of cluster centers. It is well-known that the performance of clustering can be highly sensitive to initialization. If clustering starts with good initial centers (i.e., with small approximation error), the algorithm may use fewer iterations to find a better solution. The -median++ algorithm (Arthur and Vassilvitskii, 2007) iteratively selects data points as initial centers, favoring distant points in a probabilistic way. Intuitively, the initial centers tend to be well spread over the data points (i.e., over different clusters). The produced initial center is proved to have multiplicative error. Follow-up works of -means++ further improved its efficiency and scalability, e.g., Bahmani et al. (2012); Bachem et al. (2016); Lattanzi and Sohler (2019). In this work, we propose a new initialization framework, called Hierarchically Well-Separated Tree (HST) initialization, based on metric embedding techniques. Our method is built upon a novel search algorithm on metric embedding trees, with comparable approximation error and running time as -median++. Moreover, importantly, our initialization scheme can be conveniently combined with the notion of differential privacy (DP).

Clustering with Differential Privacy. The concept of differential privacy (Dwork, 2006; McSherry and Talwar, 2007) has been popular to rigorously define and resolve the problem of keeping useful information for model learning, while protecting privacy for each individual. Private -means problem has been widely studied, e.g., Feldman et al. (2009); Nock et al. (2016); Feldman et al. (2017), mostly in the continuous Euclidean space. The paper (Balcan et al., 2017) considered identifying a good candidate set (in a private manner) of centers before applying private local search, which yields multiplicative error and additive error. Later on, the Euclidean -means errors are further improved to and by Stemmer and Kaplan (2018), with more advanced candidate set selection. Huang and Liu (2018) gave an optimal algorithm in terms of minimizing Wasserstein distance under some data separability condition.

For private -median clustering, Feldman et al. (2009) considered the problem in high dimensional Euclidean space. The strategy of Balcan et al. (2017) to form a candidate center set could as well be adopted to -median, which leads to multiplicative error and additive error in high dimensional Euclidean space. However, one main limitation of these methods is that they cannot be applied to general metric spaces (e.g., on graphs). In discrete space, Gupta et al. (2010) proposed a private method for the classical local search heuristic, which applies to both -medians and -means. To cast privacy on each swapping step, the authors applied the exponential mechanism of McSherry and Talwar (2007). Their method produced an -differentially private solution with cost , where is the diameter of the point set. In this work, we will show that our HST initialization can improve DP local search for -median (Gupta et al., 2010) in terms of both approximation error and efficiency.

The main contributions of this work include :

  • We introduce the HST (Fakcharoenphol et al., 2004) to the -median clustering problem for initialization. We design an efficient sampling strategy to select the initial center set from the tree, with an approximation factor in the non-private setting, which is when (e.g., bounded data). This improves the error of -means/median++ in e.g., the lower dimensional Euclidean space.

  • We propose a differentially private version of HST initialization under the setting of Gupta et al. (2010) in discrete metric space. The so-called DP-HST algorithm finds initial centers with multiplicative error and additive error. Moreover, running DP-local search starting from this initialization gives multiplicative error and additive error, which improves previous results towards the well-known lower bound on the additive error of DP -median (Gupta et al., 2010) within a small factor. To our knowledge, this is the first initialization method with differential privacy guarantee and improved error rate in general metric space.

  • We conduct experiments on simulated and real-world datasets to demonstrate the effectiveness of our methods. In both non-private and private settings, our proposed HST-based initialization approach achieves smaller initial cost than -median++ (i.e., finds better initial centers), which may also lead to improvements in the final clustering quality.

2 Background and Setup

2.1 Differential Privacy (DP)

Definition 2.1 (Differential Privacy (DP)  (Dwork, 2006)).

If for any two adjacent data sets and with symmetric difference of size one, for any , an algorithm satisfies

then algorithm is said to be -differentially private.

Intuitively, differential privacy requires that after removing any data point (graph node in our case), the output of should not be too different from that of the original dataset . Smaller indicates stronger privacy, which, however, usually sacrifices utility. Thus, one of the central topics in differential privacy literature is to balance the utility-privacy trade-off.

To achieve DP, one approach is to add noise to the algorithm output. The Laplace mechanism adds Laplace() noise to the output, which is known to achieve -DP. The exponential mechanism is also a tool for many DP algorithms. Let be the set of feasible outputs. The utility function is what we aim to maximize. The exponential mechanism outputs an element with probability , where is the input dataset and is the sensitivity of . Both mechanisms will be used in our paper.

2.2 Metric -Median Clustering

Following Arya et al. (2004); Gupta et al. (2010), the problem of metric -median clustering (DP and non-DP) studied in our paper is stated as below.

Definition 2.2 (-median).

Given a universe point set and a metric , the goal of -median to pick with to minimize

(3)

Let be a set of demand points. The goal of DP -median is to minimize

(4)

At the same time, the output is required to be -differentially private to . We may drop “” and use “” or “” if there is no risk of ambiguity.

To better understand the motivation of the DP clustering, we provide a real-world example as follows.

Example 2.1.

Consider to be the universe of all users in a social network (e.g., Twitter). Each user (account) is public, but also has some private information that can only be seen by the data holder. Let be users grouped by some feature that might be set as private. Suppose a third party plans to collaborate with the most influential users in for e.g., commercial purposes, thus requesting the cluster centers of . In this case, we need a strategy to safely release the centers, while protecting the individuals in from being identified (since the membership of is private).

The local search procedure for -median proposed by Arya et al. (2004) is summarized in Algorithm LABEL:alg:rand_local_search. First we randomly pick points in as the initial centers. In each iteration, we search over all and , and do the swap such that improves the cost of the most (if more than factor where is a hyper-parameter). We repeat the procedure until no such swap exists. Arya et al. (2004) showed that the output centers achieves 5 approximation error to the optimal solution, i.e., .

algocf[t]    

2.3 -median++ Initialization

Although local search is able to find a solution with constant error, it takes per iteration Resende and Werneck (2007) in expected steps (in total ) when started from random center set, which would be slow for large datasets. Indeed, we do not need such complicated algorithm to reduce the cost at the beginning, i.e., when the cost is large. To accelerate the process, efficient initialization methods find a “roughly” good center set as the starting point for local search. In the paper, we compare our new initialization scheme mainly with a popular (and perhaps most well-known) initialization method, the -median++ (Arthur and Vassilvitskii, 2007) (see Algorithm LABEL:alg:k-median++).

algocf[h]    

Here, the function is the shortest distance from a data point to the closest (center) point in set . Arthur and Vassilvitskii (2007) showed that the output centers by -median++ achieves approximation error with time complexity . Starting from the initialization, we only need to run steps of the computationally heavy local search to reach a constant error solution. Thus, initialization may greatly improve the clustering efficiency.

3 Initialization via Hierarchically Well-Separated Tree (HST)

In this section, we propose our novel initialization scheme for -median clustering, and provide our analysis in the non-private case solving (3). The idea is based on the metric embedding theory. We will start with an introduction to the main tool used in our approach.

3.1 Hierarchically Well-Separated Tree (HST)

In this paper, for an -level tree, we will count levels in descending order down the tree. We use to denote the level of , and be the number of nodes at level

. The Hierarchically Well-Separated Tree (HST) is based on the padded decompositions of a general metric space in a hierarchical manner 

(Fakcharoenphol et al., 2004). Let be a metric space with , and we will refer to this metric space without specific clarification. A –padded decomposition of is a probabilistic distribution of partitions of such that the diameter of each cluster is at most , i.e., , , . The formal definition of HST is given as below.

Definition 3.1.

Assume and denote . An -Hierarchically Well-Separated Tree (-HST) with depth is an edge-weighted rooted tree , such that an edge between any pair of two nodes of level and level has length at most .

In this paper, we consider -HST for simplicity, as only affects the constants in our theoretical analysis. As presented in Algorithm LABEL:alg:new_hst, the construction starts by applying a permutation on , such that in following steps the points are picked in a random sequence. We first find a padded decomposition of with parameter . The center of each partition in serves as a root node in level . Then, we re-do a padded decomposition for each partition , to find sub-partitions with diameter , and set the corresponding centers as the nodes in level , and so on. Each partition at level is obtained with . This process proceeds until a node has a single point, or a pre-specified tree depth is reached.

     

Figure 1: An illustrative example of a 3-level padded decomposition and its corresponding 2-HST. Left: The thickness of the ball represents the level. The color corresponds to the levels in the HST in the right panel. “”’s are the center nodes of partitions (balls), and “”’s are non-center data points. Right: The resulting 2-HST generated from the padded decomposition.

In Figure 1, we provide an example of -level 2-HST (left panel), along with its underlying padded decompositions (right panel). Besides this basic implementation for better illustration, Blelloch et al. (2017) proposed an efficient HST construction in time, where and are the number of nodes and the number of edges in a graph, respectively.

algocf[t]    

algocf[t]    

The first step of our method is to embed the data points into an HST (see Algorithm LABEL:alg:new_initial-NDP). Next, we will describe our proposed new strategy to search for the initial centers on the tree (w.r.t. the tree metric). Before moving on, it is worth mentioning that, there are polynomial time algorithms for computing an exact -median solution in the tree metric (Tamir (1996); Shah (2003)). However, the dynamic programming algorithms have high complexity (e.g., ), making them unsuitable for the purpose of fast initialization. Moreover, it is unknown how to apply them effectively to the private case. As will be shown, our new algorithm 1) is very efficient, 2) gives approximation error in the tree metric, and 3) can be effectively extended to DP easily (Section 4).

3.2 HST Initialization Algorithm

Let and suppose is a level- 2-HST in , where we assume is an integer. For a node at level , we use to denote the subtree rooted at . Let be the number of data points in . The search strategy for the initial centers, NDP-HST initialization (“NDP” stands for “Non-Differentially Private”), is presented in Algorithm LABEL:alg:new_initial-NDP with two phases.

Subtree search. The first step is to identify the subtrees that contain the centers. To begin with, initial centers are picked from who have the largest . This is intuitive, since to get a good clustering, we typically want the ball surrounding each center to include more data points. Next, we do a screening over : if there is any ancestor-descendant pair of nodes, we remove the ancestor from . If the current size of is smaller than , we repeat the process until centers are chosen (we do not re-select nodes in and their ancestors). This way, contains root nodes of disjoint subtrees.

algocf[h]    

Leaf search. After we find the set of subtrees, the next step is to find the center in each subtree using Algorithm LABEL:alg:new_initial_prune (“FIND-LEAF”). We employ a greedy search strategy, by finding the child node with largest score level by level, until a leaf is found. This approach is intuitive since the diameter of the partition ball exponentially decays with the level. Therefore, we are in a sense focusing more and more on the region with higher density (i.e., with more data points).

The complexity of our search algorithm is given as follows.

Proposition 3.1 (Complexity).

Algorithm LABEL:alg:new_initial-NDP takes time in the Euclidean space.

Remark 3.1.

The complexity of HST initialization is in general comparable to of -median++. Our algorithm would be faster if , i.e., the number of centers is large. Similar comparison also holds for general metrics.

3.3 Approximation Error of HST Initialization

Firstly, we show that the initial center set produced by NDP-HST is already a good approximation to the optimal -median solution. Let denote the “2-HST metric” between and in the 2-HST , where is the tree distance between nodes and in . By Definition 3.1 and since , in the analysis we assume equivalently that the edge weight of the -th level . The crucial step of our analysis is to examine the approximation error in terms of the 2-HST metric, after which the error can be adapted to the general metrics by the following Lemma (Bartal, 1996).

Lemma 3.2.

In a metric space with and diameter , it holds that . In the Euclidean space , .

Recall from Algorithm LABEL:alg:new_initial-NDP. We define

(5)
(6)
(7)

For simplicity, we will use to denote . Here, (7) is the cost of the global optimal solution with 2-HST metric. The last equivalence in (7) holds because the optimal centers set can always located in disjoint subtrees, as each leaf only contain one point. (5) is the -median cost with 2-HST metric of the output of Algorithm LABEL:alg:new_initial-NDP. (6) is the oracle cost after the subtrees are chosen. That is, it represents the optimal cost to pick one center from each subtree in . Firstly, we bound the approximation error of subtree search and leaf search, respectively.

Lemma 3.3 (Subtree search).

.

Lemma 3.4 (Leaf search).

.

Combining Lemma 3.3 and Lemma 3.4, we obtain

Theorem 3.5 (2-HST error).

Running Algorithm LABEL:alg:new_initial-NDP, we have .

Thus, HST-initialization produces an approximation to in the 2-HST metric. Define as (3) for our HST centers, and the optimal cost w.r.t. as

(8)

We have the following result based on Lemma 3.2.

Theorem 3.6.

In general metric space, the expected -median cost of Algorithm LABEL:alg:new_initial-NDP is .

Remark 3.2.

In the Euclidean space, Makarychev et al. (2019) proved random projections suffice for -median to achieve error. Thus, if (e.g., bounded data), by Lemma 3.2, HST initialization is able to achieve error, which is better than of -median++ when is small.

NDP-HST Local Search. We are interested in the approximation quality of standard local search (Algorithm LABEL:alg:rand_local_search), when initialized by our NDP-HST.

Theorem 3.7.

NDP-HST local search achieves approximation error in expected number of iterations for input in general metric space.

Before ending this section, we remark that the initial centers found by NDP-HST can be used for -means clustering analogously. For general metrics, where is the optimal -means cost. See Appendix B for the detailed (and similar) analysis.

4 HST Initialization with Differential Privacy

In this section, we consider initialization method with differential privacy (DP). Recall (4) that is the universe of data points, and is a demand set that needs to be clustered with privacy. Since is public, simply running initialization algorithms on would preserve the privacy of . However, 1) this might be too expensive; 2) in many cases one would probably want to incorporate some information about in the initialization, since could be a very imbalanced subset of . For example, may only contain data points from one cluster, out of tens of clusters in . In this case, initialization on is likely to pick initial centers in multiple clusters, which would not be helpful for clustering on . Next, we show how our proposed HST initialization can be easily combined with differential privacy that at the same time contains information about the demand set , leading to improved approximation error (Theorem 4.3). Again, suppose is an -level 2-HST of universe in a general metric space. Denote for a node point . Our private HST initialization (DP-HST) is similar to the non-private Algorithm LABEL:alg:new_initial-NDP. To gain privacy, we perturb by adding i.i.d. Laplace noise:

where is a Laplace random number with rate . We will use the perturbed for node sampling instead of the true value , as described in Algorithm LABEL:alg:new_initial_DP. The DP guarantee of this initialization scheme is straightforward by the composition theory (Dwork, 2006).

algocf[h]    

Theorem 4.1.

Algorithm LABEL:alg:new_initial_DP is -differentially private.

Proof.

For each level , the subtrees are disjoint to each other. The privacy used in -th level is , and the total privacy is . ∎

We now consider the approximation error. As the structure of the analysis is similar to the non-DP case, we present the main result here and defer the detailed proofs to Appendix A.

Theorem 4.2.

Algorithm LABEL:alg:new_initial_DP finds initial centers such that

DP-HST Local Search. Similarly, we can use private HST initialization to improve the performance of private -median local search, which is presented in Algorithm LABEL:alg:DP-HST. After initialization, the DP local search procedure follows Gupta et al. (2010) using the exponential mechanism.

algocf[h]    

Theorem 4.3.

Algorithm LABEL:alg:DP-HST achieves -differential privacy. With probability , the output centers admit

in iterations.

The DP local search with random initialization (Gupta et al., 2010) has 6 multiplicative error and additive error. Our result improves the term to in the additive error. Meanwhile, the number of iterations needed is improved from to (see Section 5.3 for an empirical justification). Notably, it has been shown in Gupta et al. (2010) that for -median problem, the lower bounds on the multiplicative and additive error of any -DP algorithm are and , respectively. Our result matches the lower bound on the multiplicative error, and the additive error is only worse than the bound by a factor of which would be small in many cases. To our knowledge, Theorem 4.3 is the first result in literature to improve the error of DP local search in general metric space.

5 Experiments

5.1 Datasets and Algorithms

Figure 2: Example of synthetic graphs: subgraph of 50 nodes. Left: . Right: . Darker and thicker edged have smaller distance. When , the graph is more separable.

Discrete Euclidean space.   Following previous work ., we test

-median clustering on the MNIST hand-written digit dataset 

(LeCun et al., 1998) with 10 natural clusters (digit 0 to 9). We set as 10000 randomly chosen data points. We choose the demand set using two strategies: 1) “balance”, where we randomly choose 500 samples from ; 2) “imbalance”, where contains 500 random samples from only from digit “0” and “8” (two clusters). We note that, the imbalanced

is a very practical setting in real-world scenarios, where data are typically not uniformly distributed. On this dataset, we test clustering with both

and distance as the underlying metric.

Metric space induced by graph.   Random graphs have been widely considered in testing -median methods (Balcan et al., 2013; Todo et al., 2019). The construction of graphs follows a similar approach as the synthetic pmedinfo graphs provided by the popular OR-Library (Beasley, 1990). The metric for this experiment is the shortest (weighted) path distance. To generate a size graph, we first randomly split the nodes into clusters. Within each cluster, each pair of nodes is connected with probability and weight drawn from standard uniform distribution. For each pair of clusters, we randomly connect some nodes from each cluster, with weights following uniform . A larger makes the graph more separable, i.e., clusters are farther from each other. In Figure 2, we plot two example graphs (subgraphs of 50 nodes) with and . We present two cases: and . For this task, has 3000 nodes, and the private set (500 nodes) is chosen using similar “balanced” and “imbalanced” scheme as described above. In the imbalanced case, we choose randomly from only two clusters.

Algorithms. We compare the following clustering algorithms in both non-DP and DP setting: (1) NDP-rand: Local search with random initialization; (2) NDP-kmedian++: Local search with -median++ initialization (Algorithm LABEL:alg:k-median++); (3) NDP-HST: Local search with NDP-HST initialization (Algorithm LABEL:alg:new_initial-NDP), as described in Section 3; (4) DP-rand: Standard DP local search algorithm (Gupta et al., 2010), which is Algorithm LABEL:alg:DP-HST with initial centers randomly chosen from ; (5) DP-kmedian++: DP local search with -median++ initialization run on ; (6) DP-HST: DP local search with HST-initialization (Algorithm LABEL:alg:DP-HST). For non-DP tasks, we set . For DP clustering, we use .

For non-DP methods, we set in Algorithm LABEL:alg:rand_local_search and the maximum number of iterations as 20. To examine the quality of initialization as well as the final centers, We report both the cost at initialization and the cost of the final output. For DP methods, we run the algorithms for steps and report the results with . We test . The average cost over iterations is reported for more robustness. All results are averaged over 10 independent repetitions.

5.2 Results

The results on MNIST dataset are given in Figure 3. The comparisons are similar for both and :

  • From the left column, the initial centers found by HST has lower cost than -median++ and random initialization, for both non-DP and DP setting, and for both balanced and imbalanced demand set . This confirms that the proposed HST initialization is more powerful than -median++ in finding good initial centers.

  • From the right column, we also observe lower final cost of HST followed by local search in DP clustering. In the non-DP case, the final cost curves overlap, which means that despite HST offers better initial centers, local search can always find a good solution eventually.

  • The advantage of DP-HST, in terms of both the initial and the final cost, is more significant when is an imbalanced subset of . As mentioned before, this is because our DP-HST initialization approach also privately incorporates the information of .

The results on graphs are reported in Figure 4, which give similar conclusions. In all cases, our proposed HST scheme finds better initial centers with smaller cost than -median++. Moreover, HST again considerably outperforms -median++ in the private and imbalanced setting, for both (highly separable) and (less separable). The advantages of HST over -median++ are especially significant in the harder tasks when , i.e., the clusters are nearly mixed up.

           

     

     

Figure 3: Initial and final -median cost on MNIST dataset. 1st column: distance. 2nd column: distance.

                       

Figure 4: Initial and final -median cost on graph dataset. 1st column: distance. 2nd column: distance.

5.3 Improved Iteration Cost of DP-HST

In Theorem 4.3, we show that under differential privacy constraints, the proposed DP-HST (Algorithm LABEL:alg:DP-HST) improves both the approximation error and the number of iterations required to find a good solution of classical DP local search (Gupta et al., 2010). In this section, we provide some numerical results to justify the theory.

First, we need to properly measure the iteration cost of DP local search. This is because, unlike the non-private clustering, the -median cost after each iteration in DP local search is not decreasing monotonically, due to the probabilistic exponential mechanism. To this end, for the cost sequence with length , we compute its moving average sequence with window size . Attaining the minimal value of the moving average indicates that the algorithm has found a “local optimum”, i.e., it has reached a “neighborhood” of solutions with small clustering cost. Thus, we use the number of iterations to reach such local optimum as the measure of iteration cost. The results are provided in Figure 5. We see that on all the tasks (MNIST with and distance, and graph dataset with and ), DP-HST has significantly smaller iterations cost. In Figure 6, we further report the -median cost of the best solution in iterations found by each DP algorithm. We see that DP-HST again provide the smallest cost. This additional set of experiments again validates the claims of Theorem 4.3, that DP-HST is able to found better initial centers in fewer iterations.

Figure 5: Iteration cost to reach a locally optimal solution, on MNIST and graph datasets with different . The demand set is an imbalanced subset of the universe.

Figure 6: The -median cost of the best solution found by each differentially private algorithm. The demand set is an imbalanced subset of the universe. Same comparison holds on graph data.

5.4 Running Time Comparison with -median++

In Proposition 3.1, we show that our HST initialization algorithm admits complexity when considering the Euclidean space. With a smart implementation of Algorithm LABEL:alg:k-median++ where each data point tracks its distance to the current closest candidate center in , -median++ has running time. Therefore, the running time of our algorithm is in general comparable to -median++. Our method would run faster if .

     

Figure 7: Empirical time comparison of HST initialization v.s. -median++, on MNIST dataset with distance. Left: The running time against , on a subset of data points. Right: The running time against , with centers.

In Figure 7, we plot the empirical running time of HST initialization against -median++, on MNIST dataset with distance (similar comparison holds for ). From the left subfigure, we see that -median++ becomes slower with increasing , and our method is more efficient when . In the right panel, we observe that the running time of both methods increases with larger sample size . Our HST algorithm has a slightly faster increasing rate, which is predicted by the complexity comparison ( v.s. ). However, this difference in factor would not be too significant unless the sample size is extremely large. Overall, our results suggest that in general, the proposed HST initialization would have similar efficiency as -median++ in common practical scenarios.

6 Conclusion

In this paper, we propose a new initialization framework for the metric -median problem in general (discrete) metric space. Our approach is called HST initialization, which leverages tools from metric embedding theory. Our novel tree search approach has comparable efficiency and approximation error to the popular -median++ initialization. Moreover, we propose the differentially private (DP) HST initialization algorithm, which adapts to the private demand point set, leading to better clustering performance. When combined with subsequent DP local search heuristic, our algorithm is able to improve the additive error of DP local search and our result is close to the theoretical lower bound within a small factor. Experiments with Euclidean metrics and graph metrics verify the effectiveness of our method, which improves the cost of both the initial centers and the final -median output.

References

  • Abbasi and Younis (2007) Ameer Ahmed Abbasi and Mohamed F. Younis. A survey on clustering algorithms for wireless sensor networks. Comput. Commun., 30(14-15):2826–2841, 2007.
  • Arthur and Vassilvitskii (2007) David Arthur and Sergei Vassilvitskii. k-means++: the advantages of careful seeding. In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 1027–1035, New Orleans, LA, 2007.
  • Arya et al. (2004) Vijay Arya, Naveen Garg, Rohit Khandekar, Adam Meyerson, Kamesh Munagala, and Vinayaka Pandit. Local search heuristics for k-median and facility location problems. SIAM J. Comput., 33(3):544–562, 2004.
  • Bachem et al. (2016) Olivier Bachem, Mario Lucic, S. Hamed Hassani, and Andreas Krause. Approximate k-means++ in sublinear time. In Dale Schuurmans and Michael P. Wellman, editors,

    Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI)

    , pages 1459–1467, Phoenix, AZ, 2016.
  • Bahmani et al. (2012) Bahman Bahmani, Benjamin Moseley, Andrea Vattani, Ravi Kumar, and Sergei Vassilvitskii. Scalable k-means++. Proc. VLDB Endow., 5(7):622–633, 2012.
  • Balcan et al. (2013) Maria-Florina Balcan, Steven Ehrlich, and Yingyu Liang. Distributed k-means and k-median clustering on general communication topologies. In Advances in Neural Information Processing Systems (NIPS), pages 1995–2003, Lake Tahoe, NV, 2013.
  • Balcan et al. (2017) Maria-Florina Balcan, Travis Dick, Yingyu Liang, Wenlong Mou, and Hongyang Zhang. Differentially private clustering in high-dimensional euclidean spaces. In

    Proceedings of the 34th International Conference on Machine Learning (ICML)

    , pages 322–331, Sydney, Australia, 2017.
  • Banerjee et al. (2005) Arindam Banerjee, Srujana Merugu, Inderjit S. Dhillon, and Joydeep Ghosh. Clustering with bregman divergences. J. Mach. Learn. Res., 6:1705–1749, 2005.
  • Bartal (1996) Yair Bartal. Probabilistic approximations of metric spaces and its algorithmic applications. In Proceedings of the 37th Annual Symposium on Foundations of Computer Science (FOCS), pages 184–193, Burlington, VT, 1996.
  • Beasley (1990) John E Beasley. OR-Library: distributing test problems by electronic mail. Journal of the Operational Research Society, 41(11):1069–1072, 1990.
  • Berkhin (2006) Pavel Berkhin. A survey of clustering data mining techniques. In Grouping Multidimensional Data, pages 25–71. Springer, 2006.
  • Blelloch et al. (2017) Guy E. Blelloch, Yan Gu, and Yihan Sun. Efficient construction of probabilistic tree embeddings. In Proceedings of the 44th International Colloquium on Automata, Languages, and Programming (ICALP), pages 26:1–26:14, Warsaw, Poland, 2017.
  • Dhillon and Modha (2001) Inderjit S. Dhillon and Dharmendra S. Modha. Concept decompositions for large sparse text data using clustering. Mach. Learn., 42(1/2):143–175, 2001.
  • Dwork (2006) Cynthia Dwork. Differential privacy. In Proceedings of the 33rd International Colloquium on Automata, Languages and Programming (ICALP),Part II, pages 1–12, Venice, Italy, 2006.
  • Fakcharoenphol et al. (2004) Jittat Fakcharoenphol, Satish Rao, and Kunal Talwar. A tight bound on approximating arbitrary metrics by tree metrics. J. Comput. Syst. Sci., 69(3):485–497, 2004.
  • Feldman et al. (2009) Dan Feldman, Amos Fiat, Haim Kaplan, and Kobbi Nissim. Private coresets. In

    Proceedings of the 41st Annual ACM Symposium on Theory of Computing (STOC)

    , pages 361–370, Bethesda, MD, 2009.
  • Feldman et al. (2017) Dan Feldman, Chongyuan Xiang, Ruihao Zhu, and Daniela Rus. Coresets for differentially private k-means clustering and applications to privacy in mobile sensor networks. In Proceedings of the 16th ACM/IEEE International Conference on Information Processing in Sensor Networks (IPSN), pages 3–15, Pittsburgh, PA, 2017.
  • Gupta et al. (2010) Anupam Gupta, Katrina Ligett, Frank McSherry, Aaron Roth, and Kunal Talwar.

    Differentially private combinatorial optimization.

    In Proceedings of the Twenty-First Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 1106–1125, Austin, TX, 2010.
  • Huang and Liu (2018) Zhiyi Huang and Jinyan Liu. Optimal differentially private algorithms for k-means clustering. In Proceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems (PODS), pages 395–408, Houston, TX, 2018.
  • Kanungo et al. (2002) Tapas Kanungo, David M. Mount, Nathan S. Netanyahu, Christine D. Piatko, Ruth Silverman, and Angela Y. Wu. A local search approximation algorithm for k-means clustering. In Proceedings of the 18th Annual Symposium on Computational Geometry (CG), pages 10–18, Barcelona, Spain, 2002.
  • Kaufman et al. (1977) Leon Kaufman, Marc Vanden Eede, and Pierre Hansen. A plant and warehouse location problem. Journal of the Operational Research Society, 28(3):547–554, 1977.
  • Lattanzi and Sohler (2019) Silvio Lattanzi and Christian Sohler. A better k-means++ algorithm via local search. In Proceedings of the 36th International Conference on Machine Learning (ICML), pages 3662–3671, Long Beach, CA, 2019.
  • LeCun et al. (1998) Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  • Lloyd (1982) Stuart P. Lloyd. Least squares quantization in PCM. IEEE Trans. Inf. Theory, 28(2):129–136, 1982.
  • Makarychev et al. (2019) Konstantin Makarychev, Yury Makarychev, and Ilya P. Razenshteyn. Performance of johnson-lindenstrauss transform for k-means and k-medians clustering. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing (STOC), pages 1027–1038, Phoenix, AZ, 2019.
  • McSherry and Talwar (2007) Frank McSherry and Kunal Talwar. Mechanism design via differential privacy. In Proceedings of the 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS), pages 94–103, Providence, RI, 2007.
  • Nock et al. (2016) Richard Nock, Raphaël Canyasse, Roksana Boreli, and Frank Nielsen. k-variates++: more pluses in the k-means++. In Proceedings of the 33nd International Conference on Machine Learning (ICML), pages 145–154, New York City, NY, 2016.
  • Punj and Stewart (1983) Girish Punj and David W Stewart. Cluster analysis in marketing research: Review and suggestions for application. Journal of Marketing Research, 20(2):134–148, 1983.
  • Resende and Werneck (2007) Mauricio G. C. Resende and Renato Fonseca F. Werneck. A fast swap-based local search procedure for location problems. Ann. Oper. Res., 150(1):205–230, 2007.
  • Shah (2003) Rahul Shah. Faster algorithms for k-median problem on trees with smaller heights. Technical report, 2003.
  • Stemmer and Kaplan (2018) Uri Stemmer and Haim Kaplan. Differentially private k-means with constant multiplicative error. In Advances in Neural Information Processing Systems (NeurIPS), pages 5436–5446, Montréal, Canada, 2018.
  • Tamir (1996) Arie Tamir. An algorithm for the p-median and related problems on tree graphs. Oper. Res. Lett., 19(2):59–64, 1996.
  • Todo et al. (2019) Keisuke Todo, Atsuyoshi Nakamura, and Mineichi Kudo. A fast approximate algorithm for k-median problem on a graph. In Proceedings of the 15th International Workshop on Mining and Learning with Graphs (MLG), Anchorage, AK, 2019.

Appendix A Proofs

The following composition result of differential privacy will be used in our proof.

Theorem A.1 (Composition Theorem [Dwork, 2006]).

If Algorithms are differentially private respectively, then the union is -DP.

a.1 Proof of Lemma 3.3

Proof.

Consider the intermediate output of Algorithm LABEL:alg:new_initial-NDP, , which is the set of roots of the minimal subtrees each containing exactly one output center . Suppose one of the optimal “root set” that minimizes (6) is . If , the proof is done. Thus, we prove the case for . Note that are disjoint subtrees. We have the following reasoning.

  • Case 1: for some , is a descendant node of . Since the optimal center point is a leaf node by the definition of (6), we know that there must exist one child node of that expands a subtree which contains . Therefore, we can always replace by one of its child nodes. Hence, we can assume that is not a descendant of .

    Note that, we have if . Algorithm LABEL:alg:new_initial-NDP sorts all the nodes based on cost value, and it would have more priority to pick than if and is not a child node of .

  • Case 2: for some , is a descendant of . In this case, optimal center point , which is a leaf of , must also be a leaf node of . We can simply replace with the swap which does not change . Hence, we can assume that is not a descendant of .

  • Case 3: Otherwise. By the construction of , we know that when . Consider the swap between and . By the definition of tree distance, we have , since does not contain any center of the optimal solution determined by (which is also the optimal “root set” for ).

Thus, we only need to consider Case 3. Let us consider the optimal clustering with center set be (each center is a leaf of subtree whose root be ), and be the leaves assigned to . Let denote the set of leaves in whose distance to is strictly smaller than its distance to any centers in . Let denote the union of paths between leaves of to its closest center in . Let be the nodes in with highest level satisfying . The score of is . That means the swap with a center into can only reduce to (the tree distance between any leaf in and its closest center in is at most ). We just use to represent for later part of this proof for simplicity. By our reasoning, summing all the swaps over gives

Also, based on our discussion on Case 1, it holds that

Summing them together, we have . ∎

a.2 Proof of Lemma 3.4

Proof.

Since the subtrees in are disjoint, it suffices to consider one subtree with root . With a little abuse of notation, let denote the optimal -median cost within the point set with one center in 2-HST:

(9)

which is the optimal cost within the subtree. Suppose has more than one children , otherwise the optimal center is clear. Suppose the optimal solution of chooses a leaf node in , and our HST initialization algorithm picks a leaf of . If , then HST chooses the optimal one where the argument holds trivially. Thus, we consider . We have the following two observations:

  • Since one needs to pick a leaf of to minimize , we have where denotes the children nodes of .

  • By our greedy strategy, .

Since , we have

since our algorithm picks subtree roots with highest scores. Then we have . Since the subtrees in are disjoint, the union of centers for , forms the optimal centers with size . Note that, for any data point , the tree distance for that is a leaf node of , is the same. That is, the choice of leaf in as the center does not affect the -median cost under 2-HST metric. Therefore, union bound over subtree costs completes the proof. ∎

a.3 Proof of Proposition 3.1

Proof.

It is known that the 2-HST can be constructed in  [Bartal, 1996]. The subtree search in Algorithm LABEL:alg:new_initial-NDP involves at most sorting all the nodes in the HST based on the score, which takes . We use a priority queue to store the nodes in . When we insert a new node into queue, its parent node (if existing in the queue) would be removed from the queue. The number of nodes is and each operation (insertion, deletion) in a priority queue based on score has complexity. Lastly, the total time to obtain is , as the FIND-LEAF only requires a top down scan in disjoint subtrees of . Summing parts together proves the claim. ∎

a.4 Proof of Theorem 4.2

Similarly, we prove the error in general metric by first analyzing the error in 2-HST metric. Then the result follows from Lemma 3.2. Let , and be defined analogously to (5), (6) and (7), where “” in the summation is changed into “” since is the demand set. That is,

(10)
(11)
(12)

We have the following.

Lemma A.2.

with probability .

Proof.

The result follows by combining the following Lemma A.4, Lemma A.5, and applying union bound. ∎

Lemma A.3.

For any node in , with probability , .

Proof.

Since , we have

As , we have

Hence, for some constant ,

Lemma A.4 (DP Subtree Search).

With probability , .

Proof.

The proof is similar to that of Lemma 3.3. Consider the intermediate output of Algorithm LABEL:alg:new_initial-NDP, , which is the set of roots of the minimal disjoint subtrees each containing exactly one output center . Suppose one of the optimal “root set” that minimizes (6) is . Assume . By the same argument as the proof of Lemma 3.3, we consider for some such that , where is not a descendent of and is either a descendent of . By the construction of , we know that when . Consider the swap between and . By the definition of tree distance, we have , since does not contain any center of the optimal solution determined by (which is also the optimal “root set” for ). Let us consider the optimal clustering with center set be (each center is a leaf of subtree whose root be ), and be the leaves assigned to . Let denote the set of leaves in whose distance to is strictly smaller than its distance to any centers in . Let denote the union of paths between leaves of to its closest center in . Let be the nodes in with highest level satisfying . The score of is . That means the swap with a center into can only reduce to (the tree distance between any leaf in and its closest center in is at most ). We just use to represent for later part of this proof for simplicity. Summing all the swaps over , we obtain

Applying union bound with Lemma A.3, with probability , we have

Consequently, we have with probability, ,

Lemma A.5 (DP Leaf Search).

With probability , Algorithm LABEL:alg:new_initial_DP produces initial centers with .

Proof.

The proof strategy follows Lemma 3.4. We first consider one subtree with root . Let denote the optimal -median cost within the point set with one center in 2-HST:

(13)

Suppose has more than one children , and the optimal solution of chooses a leaf node in , and our HST initialization algorithm picks a leaf of . If , then HST chooses the optimal one where the argument holds trivially. Thus, we consider . We have the following two observations:

  • Since one needs to pick a leaf of to minimize , we have where