1 Introduction
Clustering is an important problem in unsupervised learning that has been widely studied in statistics, data mining, network analysis, etc.
(Punj and Stewart, 1983; Dhillon and Modha, 2001; Banerjee et al., 2005; Berkhin, 2006; Abbasi and Younis, 2007). The goal of clustering is to partition a set of data points into clusters such that items in the same cluster are expected to be similar, while items in different clusters should be different. This is concretely measured by the sum of distances (or squared distances) between each point to its nearest cluster center. One conventional notion to evaluate a clustering algorithms is: with high probability,
where is the centers output by the algorithm and is a cost function defined for on dataset . is the cost of optimal (oracle) clustering solution on . When everything is clear from context, we will use for short. Here, is called multiplicative error and is called additive error. Alternatively, we may also use the notion of expected cost.
Two popularly studied clustering problems are 1) the median problem, and 2) the means problem. The origin of median dates back to the 1970’s (e.g., Kaufman et al. (1977)), where one tries to find the best location of facilities that minimizes the cost measured by the distance between clients and facilities. Formally, given a set of points and a distance measure, the goal is to find center points minimizing the sum of absolute distances of each sample point to its nearest center. In means, the objective is to minimize the sum of squared distances instead. Particularly,
median is usually the one used for clustering on graph/network data. In general, there are two popular frameworks for clustering. One heuristic is the Lloyd’s algorithm
(Lloyd, 1982), which is built upon an iterative distortion minimization approach. In most cases, this method can only be applied to numerical data, typically in the (continuous) Euclidean space. Clustering in general metric spaces (discrete spaces) is also important and useful when dealing with, for example, the graph data, where Lloyd’s method is no longer applicable. A more broadly applicable approach, the local search method (Kanungo et al., 2002; Arya et al., 2004), has also been widely studied. It iteratively finds the optimal swap between the center set and noncenter data points to keep lowering the cost. Local search can achieve a constant approximation ratio () to the optimal solution for median (Arya et al., 2004). In this paper, we will focus on clustering under this the general metric space setting.Initialization of cluster centers. It is wellknown that the performance of clustering can be highly sensitive to initialization. If clustering starts with good initial centers (i.e., with small approximation error), the algorithm may use fewer iterations to find a better solution. The median++ algorithm (Arthur and Vassilvitskii, 2007) iteratively selects data points as initial centers, favoring distant points in a probabilistic way. Intuitively, the initial centers tend to be well spread over the data points (i.e., over different clusters). The produced initial center is proved to have multiplicative error. Followup works of means++ further improved its efficiency and scalability, e.g., Bahmani et al. (2012); Bachem et al. (2016); Lattanzi and Sohler (2019). In this work, we propose a new initialization framework, called Hierarchically WellSeparated Tree (HST) initialization, based on metric embedding techniques. Our method is built upon a novel search algorithm on metric embedding trees, with comparable approximation error and running time as median++. Moreover, importantly, our initialization scheme can be conveniently combined with the notion of differential privacy (DP).
Clustering with Differential Privacy. The concept of differential privacy (Dwork, 2006; McSherry and Talwar, 2007) has been popular to rigorously define and resolve the problem of keeping useful information for model learning, while protecting privacy for each individual. Private means problem has been widely studied, e.g., Feldman et al. (2009); Nock et al. (2016); Feldman et al. (2017), mostly in the continuous Euclidean space. The paper (Balcan et al., 2017) considered identifying a good candidate set (in a private manner) of centers before applying private local search, which yields multiplicative error and additive error. Later on, the Euclidean means errors are further improved to and by Stemmer and Kaplan (2018), with more advanced candidate set selection. Huang and Liu (2018) gave an optimal algorithm in terms of minimizing Wasserstein distance under some data separability condition.
For private median clustering, Feldman et al. (2009) considered the problem in high dimensional Euclidean space. The strategy of Balcan et al. (2017) to form a candidate center set could as well be adopted to median, which leads to multiplicative error and additive error in high dimensional Euclidean space. However, one main limitation of these methods is that they cannot be applied to general metric spaces (e.g., on graphs). In discrete space, Gupta et al. (2010) proposed a private method for the classical local search heuristic, which applies to both medians and means. To cast privacy on each swapping step, the authors applied the exponential mechanism of McSherry and Talwar (2007). Their method produced an differentially private solution with cost , where is the diameter of the point set. In this work, we will show that our HST initialization can improve DP local search for median (Gupta et al., 2010) in terms of both approximation error and efficiency.
The main contributions of this work include :

We introduce the HST (Fakcharoenphol et al., 2004) to the median clustering problem for initialization. We design an efficient sampling strategy to select the initial center set from the tree, with an approximation factor in the nonprivate setting, which is when (e.g., bounded data). This improves the error of means/median++ in e.g., the lower dimensional Euclidean space.

We propose a differentially private version of HST initialization under the setting of Gupta et al. (2010) in discrete metric space. The socalled DPHST algorithm finds initial centers with multiplicative error and additive error. Moreover, running DPlocal search starting from this initialization gives multiplicative error and additive error, which improves previous results towards the wellknown lower bound on the additive error of DP median (Gupta et al., 2010) within a small factor. To our knowledge, this is the first initialization method with differential privacy guarantee and improved error rate in general metric space.

We conduct experiments on simulated and realworld datasets to demonstrate the effectiveness of our methods. In both nonprivate and private settings, our proposed HSTbased initialization approach achieves smaller initial cost than median++ (i.e., finds better initial centers), which may also lead to improvements in the final clustering quality.
2 Background and Setup
2.1 Differential Privacy (DP)
Definition 2.1 (Differential Privacy (DP) (Dwork, 2006)).
If for any two adjacent data sets and with symmetric difference of size one, for any , an algorithm satisfies
then algorithm is said to be differentially private.
Intuitively, differential privacy requires that after removing any data point (graph node in our case), the output of should not be too different from that of the original dataset . Smaller indicates stronger privacy, which, however, usually sacrifices utility. Thus, one of the central topics in differential privacy literature is to balance the utilityprivacy tradeoff.
To achieve DP, one approach is to add noise to the algorithm output. The Laplace mechanism adds Laplace() noise to the output, which is known to achieve DP. The exponential mechanism is also a tool for many DP algorithms. Let be the set of feasible outputs. The utility function is what we aim to maximize. The exponential mechanism outputs an element with probability , where is the input dataset and is the sensitivity of . Both mechanisms will be used in our paper.
2.2 Metric Median Clustering
Following Arya et al. (2004); Gupta et al. (2010), the problem of metric median clustering (DP and nonDP) studied in our paper is stated as below.
Definition 2.2 (median).
Given a universe point set and a metric , the goal of median to pick with to minimize
(3) 
Let be a set of demand points. The goal of DP median is to minimize
(4) 
At the same time, the output is required to be differentially private to . We may drop “” and use “” or “” if there is no risk of ambiguity.
To better understand the motivation of the DP clustering, we provide a realworld example as follows.
Example 2.1.
Consider to be the universe of all users in a social network (e.g., Twitter). Each user (account) is public, but also has some private information that can only be seen by the data holder. Let be users grouped by some feature that might be set as private. Suppose a third party plans to collaborate with the most influential users in for e.g., commercial purposes, thus requesting the cluster centers of . In this case, we need a strategy to safely release the centers, while protecting the individuals in from being identified (since the membership of is private).
The local search procedure for median proposed by Arya et al. (2004) is summarized in Algorithm LABEL:alg:rand_local_search. First we randomly pick points in as the initial centers. In each iteration, we search over all and , and do the swap such that improves the cost of the most (if more than factor where is a hyperparameter). We repeat the procedure until no such swap exists. Arya et al. (2004) showed that the output centers achieves 5 approximation error to the optimal solution, i.e., .
algocf[t]
2.3 median++ Initialization
Although local search is able to find a solution with constant error, it takes per iteration Resende and Werneck (2007) in expected steps (in total ) when started from random center set, which would be slow for large datasets. Indeed, we do not need such complicated algorithm to reduce the cost at the beginning, i.e., when the cost is large. To accelerate the process, efficient initialization methods find a “roughly” good center set as the starting point for local search. In the paper, we compare our new initialization scheme mainly with a popular (and perhaps most wellknown) initialization method, the median++ (Arthur and Vassilvitskii, 2007) (see Algorithm LABEL:alg:kmedian++).
algocf[h]
Here, the function is the shortest distance from a data point to the closest (center) point in set . Arthur and Vassilvitskii (2007) showed that the output centers by median++ achieves approximation error with time complexity . Starting from the initialization, we only need to run steps of the computationally heavy local search to reach a constant error solution. Thus, initialization may greatly improve the clustering efficiency.
3 Initialization via Hierarchically WellSeparated Tree (HST)
In this section, we propose our novel initialization scheme for median clustering, and provide our analysis in the nonprivate case solving (3). The idea is based on the metric embedding theory. We will start with an introduction to the main tool used in our approach.
3.1 Hierarchically WellSeparated Tree (HST)
In this paper, for an level tree, we will count levels in descending order down the tree. We use to denote the level of , and be the number of nodes at level
. The Hierarchically WellSeparated Tree (HST) is based on the padded decompositions of a general metric space in a hierarchical manner
(Fakcharoenphol et al., 2004). Let be a metric space with , and we will refer to this metric space without specific clarification. A –padded decomposition of is a probabilistic distribution of partitions of such that the diameter of each cluster is at most , i.e., , , . The formal definition of HST is given as below.Definition 3.1.
Assume and denote . An Hierarchically WellSeparated Tree (HST) with depth is an edgeweighted rooted tree , such that an edge between any pair of two nodes of level and level has length at most .
In this paper, we consider HST for simplicity, as only affects the constants in our theoretical analysis. As presented in Algorithm LABEL:alg:new_hst, the construction starts by applying a permutation on , such that in following steps the points are picked in a random sequence. We first find a padded decomposition of with parameter . The center of each partition in serves as a root node in level . Then, we redo a padded decomposition for each partition , to find subpartitions with diameter , and set the corresponding centers as the nodes in level , and so on. Each partition at level is obtained with . This process proceeds until a node has a single point, or a prespecified tree depth is reached.
In Figure 1, we provide an example of level 2HST (left panel), along with its underlying padded decompositions (right panel). Besides this basic implementation for better illustration, Blelloch et al. (2017) proposed an efficient HST construction in time, where and are the number of nodes and the number of edges in a graph, respectively.
algocf[t]
algocf[t]
The first step of our method is to embed the data points into an HST (see Algorithm LABEL:alg:new_initialNDP). Next, we will describe our proposed new strategy to search for the initial centers on the tree (w.r.t. the tree metric). Before moving on, it is worth mentioning that, there are polynomial time algorithms for computing an exact median solution in the tree metric (Tamir (1996); Shah (2003)). However, the dynamic programming algorithms have high complexity (e.g., ), making them unsuitable for the purpose of fast initialization. Moreover, it is unknown how to apply them effectively to the private case. As will be shown, our new algorithm 1) is very efficient, 2) gives approximation error in the tree metric, and 3) can be effectively extended to DP easily (Section 4).
3.2 HST Initialization Algorithm
Let and suppose is a level 2HST in , where we assume is an integer. For a node at level , we use to denote the subtree rooted at . Let be the number of data points in . The search strategy for the initial centers, NDPHST initialization (“NDP” stands for “NonDifferentially Private”), is presented in Algorithm LABEL:alg:new_initialNDP with two phases.
Subtree search. The first step is to identify the subtrees that contain the centers. To begin with, initial centers are picked from who have the largest . This is intuitive, since to get a good clustering, we typically want the ball surrounding each center to include more data points. Next, we do a screening over : if there is any ancestordescendant pair of nodes, we remove the ancestor from . If the current size of is smaller than , we repeat the process until centers are chosen (we do not reselect nodes in and their ancestors). This way, contains root nodes of disjoint subtrees.
algocf[h]
Leaf search. After we find the set of subtrees, the next step is to find the center in each subtree using Algorithm LABEL:alg:new_initial_prune (“FINDLEAF”). We employ a greedy search strategy, by finding the child node with largest score level by level, until a leaf is found. This approach is intuitive since the diameter of the partition ball exponentially decays with the level. Therefore, we are in a sense focusing more and more on the region with higher density (i.e., with more data points).
The complexity of our search algorithm is given as follows.
Proposition 3.1 (Complexity).
Algorithm LABEL:alg:new_initialNDP takes time in the Euclidean space.
Remark 3.1.
The complexity of HST initialization is in general comparable to of median++. Our algorithm would be faster if , i.e., the number of centers is large. Similar comparison also holds for general metrics.
3.3 Approximation Error of HST Initialization
Firstly, we show that the initial center set produced by NDPHST is already a good approximation to the optimal median solution. Let denote the “2HST metric” between and in the 2HST , where is the tree distance between nodes and in . By Definition 3.1 and since , in the analysis we assume equivalently that the edge weight of the th level . The crucial step of our analysis is to examine the approximation error in terms of the 2HST metric, after which the error can be adapted to the general metrics by the following Lemma (Bartal, 1996).
Lemma 3.2.
In a metric space with and diameter , it holds that . In the Euclidean space , .
Recall from Algorithm LABEL:alg:new_initialNDP. We define
(5)  
(6)  
(7) 
For simplicity, we will use to denote . Here, (7) is the cost of the global optimal solution with 2HST metric. The last equivalence in (7) holds because the optimal centers set can always located in disjoint subtrees, as each leaf only contain one point. (5) is the median cost with 2HST metric of the output of Algorithm LABEL:alg:new_initialNDP. (6) is the oracle cost after the subtrees are chosen. That is, it represents the optimal cost to pick one center from each subtree in . Firstly, we bound the approximation error of subtree search and leaf search, respectively.
Lemma 3.3 (Subtree search).
.
Lemma 3.4 (Leaf search).
.
Theorem 3.5 (2HST error).
Running Algorithm LABEL:alg:new_initialNDP, we have .
Thus, HSTinitialization produces an approximation to in the 2HST metric. Define as (3) for our HST centers, and the optimal cost w.r.t. as
(8) 
We have the following result based on Lemma 3.2.
Theorem 3.6.
In general metric space, the expected median cost of Algorithm LABEL:alg:new_initialNDP is .
Remark 3.2.
NDPHST Local Search. We are interested in the approximation quality of standard local search (Algorithm LABEL:alg:rand_local_search), when initialized by our NDPHST.
Theorem 3.7.
NDPHST local search achieves approximation error in expected number of iterations for input in general metric space.
Before ending this section, we remark that the initial centers found by NDPHST can be used for means clustering analogously. For general metrics, where is the optimal means cost. See Appendix B for the detailed (and similar) analysis.
4 HST Initialization with Differential Privacy
In this section, we consider initialization method with differential privacy (DP). Recall (4) that is the universe of data points, and is a demand set that needs to be clustered with privacy. Since is public, simply running initialization algorithms on would preserve the privacy of . However, 1) this might be too expensive; 2) in many cases one would probably want to incorporate some information about in the initialization, since could be a very imbalanced subset of . For example, may only contain data points from one cluster, out of tens of clusters in . In this case, initialization on is likely to pick initial centers in multiple clusters, which would not be helpful for clustering on . Next, we show how our proposed HST initialization can be easily combined with differential privacy that at the same time contains information about the demand set , leading to improved approximation error (Theorem 4.3). Again, suppose is an level 2HST of universe in a general metric space. Denote for a node point . Our private HST initialization (DPHST) is similar to the nonprivate Algorithm LABEL:alg:new_initialNDP. To gain privacy, we perturb by adding i.i.d. Laplace noise:
where is a Laplace random number with rate . We will use the perturbed for node sampling instead of the true value , as described in Algorithm LABEL:alg:new_initial_DP. The DP guarantee of this initialization scheme is straightforward by the composition theory (Dwork, 2006).
algocf[h]
Theorem 4.1.
Algorithm LABEL:alg:new_initial_DP is differentially private.
Proof.
For each level , the subtrees are disjoint to each other. The privacy used in th level is , and the total privacy is . ∎
We now consider the approximation error. As the structure of the analysis is similar to the nonDP case, we present the main result here and defer the detailed proofs to Appendix A.
Theorem 4.2.
Algorithm LABEL:alg:new_initial_DP finds initial centers such that
DPHST Local Search. Similarly, we can use private HST initialization to improve the performance of private median local search, which is presented in Algorithm LABEL:alg:DPHST. After initialization, the DP local search procedure follows Gupta et al. (2010) using the exponential mechanism.
algocf[h]
Theorem 4.3.
Algorithm LABEL:alg:DPHST achieves differential privacy. With probability , the output centers admit
in iterations.
The DP local search with random initialization (Gupta et al., 2010) has 6 multiplicative error and additive error. Our result improves the term to in the additive error. Meanwhile, the number of iterations needed is improved from to (see Section 5.3 for an empirical justification). Notably, it has been shown in Gupta et al. (2010) that for median problem, the lower bounds on the multiplicative and additive error of any DP algorithm are and , respectively. Our result matches the lower bound on the multiplicative error, and the additive error is only worse than the bound by a factor of which would be small in many cases. To our knowledge, Theorem 4.3 is the first result in literature to improve the error of DP local search in general metric space.
5 Experiments
5.1 Datasets and Algorithms
Discrete Euclidean space. Following previous work ., we test
median clustering on the MNIST handwritten digit dataset
(LeCun et al., 1998) with 10 natural clusters (digit 0 to 9). We set as 10000 randomly chosen data points. We choose the demand set using two strategies: 1) “balance”, where we randomly choose 500 samples from ; 2) “imbalance”, where contains 500 random samples from only from digit “0” and “8” (two clusters). We note that, the imbalancedis a very practical setting in realworld scenarios, where data are typically not uniformly distributed. On this dataset, we test clustering with both
and distance as the underlying metric.Metric space induced by graph. Random graphs have been widely considered in testing median methods (Balcan et al., 2013; Todo et al., 2019). The construction of graphs follows a similar approach as the synthetic pmedinfo graphs provided by the popular ORLibrary (Beasley, 1990). The metric for this experiment is the shortest (weighted) path distance. To generate a size graph, we first randomly split the nodes into clusters. Within each cluster, each pair of nodes is connected with probability and weight drawn from standard uniform distribution. For each pair of clusters, we randomly connect some nodes from each cluster, with weights following uniform . A larger makes the graph more separable, i.e., clusters are farther from each other. In Figure 2, we plot two example graphs (subgraphs of 50 nodes) with and . We present two cases: and . For this task, has 3000 nodes, and the private set (500 nodes) is chosen using similar “balanced” and “imbalanced” scheme as described above. In the imbalanced case, we choose randomly from only two clusters.
Algorithms. We compare the following clustering algorithms in both nonDP and DP setting: (1) NDPrand: Local search with random initialization; (2) NDPkmedian++: Local search with median++ initialization (Algorithm LABEL:alg:kmedian++); (3) NDPHST: Local search with NDPHST initialization (Algorithm LABEL:alg:new_initialNDP), as described in Section 3; (4) DPrand: Standard DP local search algorithm (Gupta et al., 2010), which is Algorithm LABEL:alg:DPHST with initial centers randomly chosen from ; (5) DPkmedian++: DP local search with median++ initialization run on ; (6) DPHST: DP local search with HSTinitialization (Algorithm LABEL:alg:DPHST). For nonDP tasks, we set . For DP clustering, we use .
For nonDP methods, we set in Algorithm LABEL:alg:rand_local_search and the maximum number of iterations as 20. To examine the quality of initialization as well as the final centers, We report both the cost at initialization and the cost of the final output. For DP methods, we run the algorithms for steps and report the results with . We test . The average cost over iterations is reported for more robustness. All results are averaged over 10 independent repetitions.
5.2 Results
The results on MNIST dataset are given in Figure 3. The comparisons are similar for both and :

From the left column, the initial centers found by HST has lower cost than median++ and random initialization, for both nonDP and DP setting, and for both balanced and imbalanced demand set . This confirms that the proposed HST initialization is more powerful than median++ in finding good initial centers.

From the right column, we also observe lower final cost of HST followed by local search in DP clustering. In the nonDP case, the final cost curves overlap, which means that despite HST offers better initial centers, local search can always find a good solution eventually.

The advantage of DPHST, in terms of both the initial and the final cost, is more significant when is an imbalanced subset of . As mentioned before, this is because our DPHST initialization approach also privately incorporates the information of .
The results on graphs are reported in Figure 4, which give similar conclusions. In all cases, our proposed HST scheme finds better initial centers with smaller cost than median++. Moreover, HST again considerably outperforms median++ in the private and imbalanced setting, for both (highly separable) and (less separable). The advantages of HST over median++ are especially significant in the harder tasks when , i.e., the clusters are nearly mixed up.
5.3 Improved Iteration Cost of DPHST
In Theorem 4.3, we show that under differential privacy constraints, the proposed DPHST (Algorithm LABEL:alg:DPHST) improves both the approximation error and the number of iterations required to find a good solution of classical DP local search (Gupta et al., 2010). In this section, we provide some numerical results to justify the theory.
First, we need to properly measure the iteration cost of DP local search. This is because, unlike the nonprivate clustering, the median cost after each iteration in DP local search is not decreasing monotonically, due to the probabilistic exponential mechanism. To this end, for the cost sequence with length , we compute its moving average sequence with window size . Attaining the minimal value of the moving average indicates that the algorithm has found a “local optimum”, i.e., it has reached a “neighborhood” of solutions with small clustering cost. Thus, we use the number of iterations to reach such local optimum as the measure of iteration cost. The results are provided in Figure 5. We see that on all the tasks (MNIST with and distance, and graph dataset with and ), DPHST has significantly smaller iterations cost. In Figure 6, we further report the median cost of the best solution in iterations found by each DP algorithm. We see that DPHST again provide the smallest cost. This additional set of experiments again validates the claims of Theorem 4.3, that DPHST is able to found better initial centers in fewer iterations.
5.4 Running Time Comparison with median++
In Proposition 3.1, we show that our HST initialization algorithm admits complexity when considering the Euclidean space. With a smart implementation of Algorithm LABEL:alg:kmedian++ where each data point tracks its distance to the current closest candidate center in , median++ has running time. Therefore, the running time of our algorithm is in general comparable to median++. Our method would run faster if .
In Figure 7, we plot the empirical running time of HST initialization against median++, on MNIST dataset with distance (similar comparison holds for ). From the left subfigure, we see that median++ becomes slower with increasing , and our method is more efficient when . In the right panel, we observe that the running time of both methods increases with larger sample size . Our HST algorithm has a slightly faster increasing rate, which is predicted by the complexity comparison ( v.s. ). However, this difference in factor would not be too significant unless the sample size is extremely large. Overall, our results suggest that in general, the proposed HST initialization would have similar efficiency as median++ in common practical scenarios.
6 Conclusion
In this paper, we propose a new initialization framework for the metric median problem in general (discrete) metric space. Our approach is called HST initialization, which leverages tools from metric embedding theory. Our novel tree search approach has comparable efficiency and approximation error to the popular median++ initialization. Moreover, we propose the differentially private (DP) HST initialization algorithm, which adapts to the private demand point set, leading to better clustering performance. When combined with subsequent DP local search heuristic, our algorithm is able to improve the additive error of DP local search and our result is close to the theoretical lower bound within a small factor. Experiments with Euclidean metrics and graph metrics verify the effectiveness of our method, which improves the cost of both the initial centers and the final median output.
References
 Abbasi and Younis (2007) Ameer Ahmed Abbasi and Mohamed F. Younis. A survey on clustering algorithms for wireless sensor networks. Comput. Commun., 30(1415):2826–2841, 2007.
 Arthur and Vassilvitskii (2007) David Arthur and Sergei Vassilvitskii. kmeans++: the advantages of careful seeding. In Proceedings of the Eighteenth Annual ACMSIAM Symposium on Discrete Algorithms (SODA), pages 1027–1035, New Orleans, LA, 2007.
 Arya et al. (2004) Vijay Arya, Naveen Garg, Rohit Khandekar, Adam Meyerson, Kamesh Munagala, and Vinayaka Pandit. Local search heuristics for kmedian and facility location problems. SIAM J. Comput., 33(3):544–562, 2004.

Bachem et al. (2016)
Olivier Bachem, Mario Lucic, S. Hamed Hassani, and Andreas Krause.
Approximate kmeans++ in sublinear time.
In Dale Schuurmans and Michael P. Wellman, editors,
Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI)
, pages 1459–1467, Phoenix, AZ, 2016.  Bahmani et al. (2012) Bahman Bahmani, Benjamin Moseley, Andrea Vattani, Ravi Kumar, and Sergei Vassilvitskii. Scalable kmeans++. Proc. VLDB Endow., 5(7):622–633, 2012.
 Balcan et al. (2013) MariaFlorina Balcan, Steven Ehrlich, and Yingyu Liang. Distributed kmeans and kmedian clustering on general communication topologies. In Advances in Neural Information Processing Systems (NIPS), pages 1995–2003, Lake Tahoe, NV, 2013.

Balcan et al. (2017)
MariaFlorina Balcan, Travis Dick, Yingyu Liang, Wenlong Mou, and Hongyang
Zhang.
Differentially private clustering in highdimensional euclidean
spaces.
In
Proceedings of the 34th International Conference on Machine Learning (ICML)
, pages 322–331, Sydney, Australia, 2017.  Banerjee et al. (2005) Arindam Banerjee, Srujana Merugu, Inderjit S. Dhillon, and Joydeep Ghosh. Clustering with bregman divergences. J. Mach. Learn. Res., 6:1705–1749, 2005.
 Bartal (1996) Yair Bartal. Probabilistic approximations of metric spaces and its algorithmic applications. In Proceedings of the 37th Annual Symposium on Foundations of Computer Science (FOCS), pages 184–193, Burlington, VT, 1996.
 Beasley (1990) John E Beasley. ORLibrary: distributing test problems by electronic mail. Journal of the Operational Research Society, 41(11):1069–1072, 1990.
 Berkhin (2006) Pavel Berkhin. A survey of clustering data mining techniques. In Grouping Multidimensional Data, pages 25–71. Springer, 2006.
 Blelloch et al. (2017) Guy E. Blelloch, Yan Gu, and Yihan Sun. Efficient construction of probabilistic tree embeddings. In Proceedings of the 44th International Colloquium on Automata, Languages, and Programming (ICALP), pages 26:1–26:14, Warsaw, Poland, 2017.
 Dhillon and Modha (2001) Inderjit S. Dhillon and Dharmendra S. Modha. Concept decompositions for large sparse text data using clustering. Mach. Learn., 42(1/2):143–175, 2001.
 Dwork (2006) Cynthia Dwork. Differential privacy. In Proceedings of the 33rd International Colloquium on Automata, Languages and Programming (ICALP),Part II, pages 1–12, Venice, Italy, 2006.
 Fakcharoenphol et al. (2004) Jittat Fakcharoenphol, Satish Rao, and Kunal Talwar. A tight bound on approximating arbitrary metrics by tree metrics. J. Comput. Syst. Sci., 69(3):485–497, 2004.

Feldman et al. (2009)
Dan Feldman, Amos Fiat, Haim Kaplan, and Kobbi Nissim.
Private coresets.
In
Proceedings of the 41st Annual ACM Symposium on Theory of Computing (STOC)
, pages 361–370, Bethesda, MD, 2009.  Feldman et al. (2017) Dan Feldman, Chongyuan Xiang, Ruihao Zhu, and Daniela Rus. Coresets for differentially private kmeans clustering and applications to privacy in mobile sensor networks. In Proceedings of the 16th ACM/IEEE International Conference on Information Processing in Sensor Networks (IPSN), pages 3–15, Pittsburgh, PA, 2017.

Gupta et al. (2010)
Anupam Gupta, Katrina Ligett, Frank McSherry, Aaron Roth, and Kunal Talwar.
Differentially private combinatorial optimization.
In Proceedings of the TwentyFirst Annual ACMSIAM Symposium on Discrete Algorithms (SODA), pages 1106–1125, Austin, TX, 2010.  Huang and Liu (2018) Zhiyi Huang and Jinyan Liu. Optimal differentially private algorithms for kmeans clustering. In Proceedings of the 37th ACM SIGMODSIGACTSIGAI Symposium on Principles of Database Systems (PODS), pages 395–408, Houston, TX, 2018.
 Kanungo et al. (2002) Tapas Kanungo, David M. Mount, Nathan S. Netanyahu, Christine D. Piatko, Ruth Silverman, and Angela Y. Wu. A local search approximation algorithm for kmeans clustering. In Proceedings of the 18th Annual Symposium on Computational Geometry (CG), pages 10–18, Barcelona, Spain, 2002.
 Kaufman et al. (1977) Leon Kaufman, Marc Vanden Eede, and Pierre Hansen. A plant and warehouse location problem. Journal of the Operational Research Society, 28(3):547–554, 1977.
 Lattanzi and Sohler (2019) Silvio Lattanzi and Christian Sohler. A better kmeans++ algorithm via local search. In Proceedings of the 36th International Conference on Machine Learning (ICML), pages 3662–3671, Long Beach, CA, 2019.
 LeCun et al. (1998) Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
 Lloyd (1982) Stuart P. Lloyd. Least squares quantization in PCM. IEEE Trans. Inf. Theory, 28(2):129–136, 1982.
 Makarychev et al. (2019) Konstantin Makarychev, Yury Makarychev, and Ilya P. Razenshteyn. Performance of johnsonlindenstrauss transform for kmeans and kmedians clustering. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing (STOC), pages 1027–1038, Phoenix, AZ, 2019.
 McSherry and Talwar (2007) Frank McSherry and Kunal Talwar. Mechanism design via differential privacy. In Proceedings of the 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS), pages 94–103, Providence, RI, 2007.
 Nock et al. (2016) Richard Nock, Raphaël Canyasse, Roksana Boreli, and Frank Nielsen. kvariates++: more pluses in the kmeans++. In Proceedings of the 33nd International Conference on Machine Learning (ICML), pages 145–154, New York City, NY, 2016.
 Punj and Stewart (1983) Girish Punj and David W Stewart. Cluster analysis in marketing research: Review and suggestions for application. Journal of Marketing Research, 20(2):134–148, 1983.
 Resende and Werneck (2007) Mauricio G. C. Resende and Renato Fonseca F. Werneck. A fast swapbased local search procedure for location problems. Ann. Oper. Res., 150(1):205–230, 2007.
 Shah (2003) Rahul Shah. Faster algorithms for kmedian problem on trees with smaller heights. Technical report, 2003.
 Stemmer and Kaplan (2018) Uri Stemmer and Haim Kaplan. Differentially private kmeans with constant multiplicative error. In Advances in Neural Information Processing Systems (NeurIPS), pages 5436–5446, Montréal, Canada, 2018.
 Tamir (1996) Arie Tamir. An algorithm for the pmedian and related problems on tree graphs. Oper. Res. Lett., 19(2):59–64, 1996.
 Todo et al. (2019) Keisuke Todo, Atsuyoshi Nakamura, and Mineichi Kudo. A fast approximate algorithm for kmedian problem on a graph. In Proceedings of the 15th International Workshop on Mining and Learning with Graphs (MLG), Anchorage, AK, 2019.
Appendix A Proofs
The following composition result of differential privacy will be used in our proof.
Theorem A.1 (Composition Theorem [Dwork, 2006]).
If Algorithms are differentially private respectively, then the union is DP.
a.1 Proof of Lemma 3.3
Proof.
Consider the intermediate output of Algorithm LABEL:alg:new_initialNDP, , which is the set of roots of the minimal subtrees each containing exactly one output center . Suppose one of the optimal “root set” that minimizes (6) is . If , the proof is done. Thus, we prove the case for . Note that are disjoint subtrees. We have the following reasoning.

Case 1: for some , is a descendant node of . Since the optimal center point is a leaf node by the definition of (6), we know that there must exist one child node of that expands a subtree which contains . Therefore, we can always replace by one of its child nodes. Hence, we can assume that is not a descendant of .
Note that, we have if . Algorithm LABEL:alg:new_initialNDP sorts all the nodes based on cost value, and it would have more priority to pick than if and is not a child node of .

Case 2: for some , is a descendant of . In this case, optimal center point , which is a leaf of , must also be a leaf node of . We can simply replace with the swap which does not change . Hence, we can assume that is not a descendant of .

Case 3: Otherwise. By the construction of , we know that when . Consider the swap between and . By the definition of tree distance, we have , since does not contain any center of the optimal solution determined by (which is also the optimal “root set” for ).
Thus, we only need to consider Case 3. Let us consider the optimal clustering with center set be (each center is a leaf of subtree whose root be ), and be the leaves assigned to . Let denote the set of leaves in whose distance to is strictly smaller than its distance to any centers in . Let denote the union of paths between leaves of to its closest center in . Let be the nodes in with highest level satisfying . The score of is . That means the swap with a center into can only reduce to (the tree distance between any leaf in and its closest center in is at most ). We just use to represent for later part of this proof for simplicity. By our reasoning, summing all the swaps over gives
Also, based on our discussion on Case 1, it holds that
Summing them together, we have . ∎
a.2 Proof of Lemma 3.4
Proof.
Since the subtrees in are disjoint, it suffices to consider one subtree with root . With a little abuse of notation, let denote the optimal median cost within the point set with one center in 2HST:
(9) 
which is the optimal cost within the subtree. Suppose has more than one children , otherwise the optimal center is clear. Suppose the optimal solution of chooses a leaf node in , and our HST initialization algorithm picks a leaf of . If , then HST chooses the optimal one where the argument holds trivially. Thus, we consider . We have the following two observations:

Since one needs to pick a leaf of to minimize , we have where denotes the children nodes of .

By our greedy strategy, .
Since , we have
since our algorithm picks subtree roots with highest scores. Then we have . Since the subtrees in are disjoint, the union of centers for , forms the optimal centers with size . Note that, for any data point , the tree distance for that is a leaf node of , is the same. That is, the choice of leaf in as the center does not affect the median cost under 2HST metric. Therefore, union bound over subtree costs completes the proof. ∎
a.3 Proof of Proposition 3.1
Proof.
It is known that the 2HST can be constructed in [Bartal, 1996]. The subtree search in Algorithm LABEL:alg:new_initialNDP involves at most sorting all the nodes in the HST based on the score, which takes . We use a priority queue to store the nodes in . When we insert a new node into queue, its parent node (if existing in the queue) would be removed from the queue. The number of nodes is and each operation (insertion, deletion) in a priority queue based on score has complexity. Lastly, the total time to obtain is , as the FINDLEAF only requires a top down scan in disjoint subtrees of . Summing parts together proves the claim. ∎
a.4 Proof of Theorem 4.2
Similarly, we prove the error in general metric by first analyzing the error in 2HST metric. Then the result follows from Lemma 3.2. Let , and be defined analogously to (5), (6) and (7), where “” in the summation is changed into “” since is the demand set. That is,
(10)  
(11)  
(12) 
We have the following.
Lemma A.2.
with probability .
Proof.
Lemma A.3.
For any node in , with probability , .
Proof.
Since , we have
As , we have
Hence, for some constant ,
∎
Lemma A.4 (DP Subtree Search).
With probability , .
Proof.
The proof is similar to that of Lemma 3.3. Consider the intermediate output of Algorithm LABEL:alg:new_initialNDP, , which is the set of roots of the minimal disjoint subtrees each containing exactly one output center . Suppose one of the optimal “root set” that minimizes (6) is . Assume . By the same argument as the proof of Lemma 3.3, we consider for some such that , where is not a descendent of and is either a descendent of . By the construction of , we know that when . Consider the swap between and . By the definition of tree distance, we have , since does not contain any center of the optimal solution determined by (which is also the optimal “root set” for ). Let us consider the optimal clustering with center set be (each center is a leaf of subtree whose root be ), and be the leaves assigned to . Let denote the set of leaves in whose distance to is strictly smaller than its distance to any centers in . Let denote the union of paths between leaves of to its closest center in . Let be the nodes in with highest level satisfying . The score of is . That means the swap with a center into can only reduce to (the tree distance between any leaf in and its closest center in is at most ). We just use to represent for later part of this proof for simplicity. Summing all the swaps over , we obtain
Applying union bound with Lemma A.3, with probability , we have
Consequently, we have with probability, ,
∎
Lemma A.5 (DP Leaf Search).
With probability , Algorithm LABEL:alg:new_initial_DP produces initial centers with .
Proof.
The proof strategy follows Lemma 3.4. We first consider one subtree with root . Let denote the optimal median cost within the point set with one center in 2HST:
(13) 
Suppose has more than one children , and the optimal solution of chooses a leaf node in , and our HST initialization algorithm picks a leaf of . If , then HST chooses the optimal one where the argument holds trivially. Thus, we consider . We have the following two observations:

Since one needs to pick a leaf of to minimize , we have where