Bandit-PAM: Almost Linear Time k-Medoids Clustering via Multi-Armed Bandits

06/11/2020 ∙ by Mo Tiwari, et al. ∙ University of Illinois at Urbana-Champaign Harvard University Stanford University 8

Clustering is a ubiquitous task in data science. Compared to the commonly used k-means clustering algorithm, k-medoids clustering algorithms require the cluster centers to be actual data points and support arbitrary distance metrics, allowing for greater interpretability and the clustering of structured objects. Current state-of-the-art k-medoids clustering algorithms, such as Partitioning Around Medoids (PAM), are iterative and are quadratic in the dataset size n for each iteration, being prohibitively expensive for large datasets. We propose Bandit-PAM, a randomized algorithm inspired by techniques from multi-armed bandits, that significantly improves the computational efficiency of PAM. We theoretically prove that Bandit-PAM reduces the complexity of each PAM iteration from O(n^2) to O(n log n) and returns the same results with high probability, under assumptions on the data that often hold in practice. We empirically validate our results on several large-scale real-world datasets, including a coding exercise submissions dataset from, the 10x Genomics 68k PBMC single-cell RNA sequencing dataset, and the MNIST handwritten digits dataset. We observe that Bandit-PAM returns the same results as PAM while performing up to 200x fewer distance computations. The improvements demonstrated by Bandit-PAM enable k-medoids clustering on a wide range of applications, including identifying cell types in large-scale single-cell data and providing scalable feedback for students learning computer science online. We also release Python and C++ implementations of our algorithm.



There are no comments yet.


page 1

page 2

page 3

page 4

Code Repositories


BanditPAM ( C++ implementation and Python package

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Many modern data science applications require the clustering of very-large-scale data. Due to its computational efficiency, the -means clustering algorithm MacQueen and others (1967); Lloyd (1982) has been one of the most widely-used clustering algorithms. -means alternates between assigning points to their nearest cluster centers and recomputing those centers. Central to its success is the specific choice of the cluster center: for a set of points, -means defines the cluster center as the point with the smallest average squared Euclidean distance to all other points in the set. Under such a definition, the cluster center is the arithmetic mean of the cluster’s points and can be computed efficiently.

While commonly used in practice, -means clustering suffers from several drawbacks. First, while one can efficiently compute the cluster centers under squared Euclidean distance, it is not straightforward to generalize to other distance metrics Overton (1983); Jain and Dubes (1988); Bradley et al. (1997). However, a different distance may be desirable in different applications. For example, and cosine distance are often used in sparse data, such as in recommendation systems Leskovec et al. (2020) and single-cell RNA-seq analysis Ntranos et al. (2016); additional examples include string edit distance in text data Navarro (2001), and graph metrics in social network data Mishra et al. (2007). Second, the cluster center in -means clustering is in general not a point in the dataset and may not be interpretable in many applications. This is especially problematic when the data is structured, such as parse trees in context-free grammars, sparse data in recommendation systems Leskovec et al. (2020)

, or images in computer vision where the mean image is visually random noise

Leskovec et al. (2020).

Alternatively, -medoids clustering algorithms Kaufman and Rousseeuw (1987, 1990) use the medoid to define the cluster center for a set of points, where for an arbitrary distance function, the medoid is the point in the set that minimizes the average distance to all the other points. Note that the distance metric can be arbitrary—indeed, it need not be a distance metric at all and could be an asymmetric dissimilarity measure—which addresses the first shortcoming of -means outlined above. Moreover, unlike -means, the cluster centers in -medoids, i.e. the medoids, are restricted to be points in the dataset, thus addressing the second shortcoming of -means clustering described above.

Despite its advantages, -medoids clustering is less popular than -means due to its computational efficiency. Indeed, state-of-art -medoids clustering algorithms are iterative and are quadratic in the data size, whereas -means is linear in dataset size in each iteration.

Mathematically, for data points and a user-specified distance function , the -medoids problem is to find a set of medoids to minimize the overall distance of points from their closest medoids:


This problem is, unfortunately, NP-hard in general Schubert and Rousseeuw (2019). Partitioning Around Medoids (PAM) Kaufman and Rousseeuw (1987, 1990)

is one of the most widely used heuristic algorithms for

-medoids clustering. PAM is split into two subroutines: BUILD and SWAP. First, in the BUILD step, PAM aims to find an initial set of medoids by greedily and iteratively selecting points that minimize the -medoids clustering loss (1). Next, in the SWAP step, PAM considers all possible pairs of medoid and non-medoid points and swaps the pair that reduces the loss the most. The SWAP step is repeated until no further improvements can be made with more swaps.

PAM has been empirically shown to produce better results than other popular -medoids clustering algorithms Reynolds et al. (2006); Schubert and Rousseeuw (2019). However, the BUILD step and each of the SWAP steps require distance evaluations and can be prohibitively expensive to run, especially for large datasets or when the distance evaluations are themselves expensive (e.g. edit distance between two long strings).

Randomized algorithms like CLARA Kaufman and Rousseeuw (1990) and CLARANS Ng and Han (2002) have been proposed to improve computational efficiency, but at the cost of deteriorated clustering quality. More recently, Schubert et al. Schubert and Rousseeuw (2019) proposed a deterministic algorithm, dubbed FastPAM1, that guarantees the same output as PAM but improves the complexity to when the cluster sizes are similar. However, the factor improvement becomes less relevant when the sample size is large and the number of medoids is small compared to . Throughout the rest of this work, we treat fixed and assume .


In this work, we propose a novel randomized -medoids algorithm, called Bandit-PAM, that significantly improves the computational efficiency of PAM while returning the same result with high probability. We theoretically prove that Bandit-PAM reduces the complexity on the sample size from to , both for the BUILD step and each SWAP step, under reasonable assumptions that hold in many real-world datasets. We empirically validate our results on several large-scale real-world datasets and observe that Bandit-PAM provides a reduction of distance evaluations of up to 200x while returning the same result as PAM. We also release a high-performance C++ implementation of Bandit-PAM, which brings a 3.2x wall-clock-time speedup over the state-of-the-art FastPAM implementation Schubert and Zimek (2019) on the full MNIST dataset without precomputing and caching the pairwise distances as in prior state-of-the-art approaches.

Intuitively, Bandit-PAM works by recasting each step of PAM from a deterministic computational problem to a

statistical estimation problem

. In the BUILD step assignment of the th medoid, for example, we need to choose the point amongst all non-medoids that will lead to the lowest overall loss (1) if chosen as the next medoid. Mathematically, we wish to find that minimizes:


where is a function that depends on and . Eq. (2) shows that the loss of a new medoid assignment can be written as the summation of the value of the function evaluated on all points in the dataset. Though approaches such as PAM compute exactly for each , Bandit-PAM adaptively estimates this quantity by sampling reference points for the most promising candidates. Indeed, computing exactly for every is not required; promising candidates can be estimated with higher accuracy (more reference point ’s) and less promising ones can be discarded early without requiring further unnecessary computation.

To design the adaptive sampling strategy, we show that the BUILD step and each SWAP iteration can be formulated as a best-arm identification problem from the multi-armed bandits (MAB) literature Audibert and Bubeck (2010); Even-Dar et al. (2002); Jamieson et al. (2014); Jamieson and Nowak (2014). In the typical version of the best-arm identification problem, we have arms. At each time step we decide to pull an arm , and receive a reward with . The goal is to identify the arm with the largest expected reward with high probability while expending the fewest number of total arm pulls. In the BUILD step, we view each candidate medoid as an arm in a best-arm identification problem. The arm parameter corresponds , and by pulling an arm, we observe the loss evaluated on a randomly sampled data point . Using this reduction, the best candidate medoid can be estimated using existing best-arm algorithms like the Upper Confidence Bound (UCB) algorithm Lai and Robbins (1985).

Related work:

Many other -medoids algorithms exist, in addition to CLARA, CLARANS, and FastPAM as described above. Park et al. Park and Jun (2009) proposed a -means-like algorithm that alternates between reassigning the points to their closest medoid and recomputing the medoid for each cluster until the -medoids clustering loss can no longer be improved. Other proposals include optimizations for Euclidean space and tabu search heuristics Estivill-Castro and Houle (2001). Recent work has also focused on distributed PAM, where the dataset cannot fit on one machine Song et al. (1967). All of these algorithms, however, scale quadratically in dataset size or concede the final clustering quality for improvements in runtime.

The idea of algorithm acceleration by converting a computational problem into a statistical estimation problem and designing the adaptive sampling procedure via multi-armed bandits has witnessed a few recent successes Chang et al. (2005); Kocsis and Szepesvári (2006); Li et al. (2016); Jamieson and Talwalkar (2016); Bagaria et al. (2018a); Zhang et al. (2019). In the context of -medoids clustering, previous work Bagaria et al. (2018b); Baharav and Tse (2019) has considered finding the single medoid of a set points (i.e. the -medoid problem). In these works, the -medoid problem was also formulated as a best-arm identification problem, with each point being an arm and its average distance to other points being the arm parameter.

While the -medoid problem considered in prior work can be solved exactly, the -medoids problem is NP-Hard and is therefore only tractable with heuristic solutions. Hence, this paper focuses on improving the computational efficiency of an existing heuristic solution, PAM, that has been empirically observed to be superior to other techniques. Moreover, instead of having a single best-arm identification problem, we reformulate PAM as a sequence of best-arm problems. We treat different objects as arms in different steps of PAM; in the BUILD step, each point corresponds to an arm, whereas in the SWAP step, each medoid-and-non-medoid pair corresponds to an arm. We further notice that the intrinsic difficulties of this sequence of best-arm problems are different, which can be exploited to further speed up the algorithm, as demonstrated in Section 5 and Appendix 1.2.

2 Preliminaries

For data points and a user-specified distance function , the -medoids problem aims to find a set of medoids to minimize the overall distance of points from their closest medoids:


Note that does not need to satisfy symmetry, triangle inequality, or positivity. For the rest of the paper, we use to denote the set and to represent the cardinality of a set . For two scalars , we let and .

2.1 Partitioning Around Medoids (PAM)

The original PAM algorithm Kaufman and Rousseeuw (1987, 1990) first initializes the set of medoids via the BUILD step and then repeatedly performs the SWAP step to improve the loss (3) until convergence.


PAM initializes a set of medoids by greedily assigning medoids one-by-one so as to minimize the overall loss (3). The first point added in this manner is the medoid of all points. Given the current set of medoids , the next point to add can be written as



PAM then swaps the medoid-nonmedoid pair that would reduce the loss (3) the most among all possible such pairs. Let be the current set of medoids. Then the best pair to swap is


The second term in both (4) and (5), namely and , can be determined by caching the smallest and the second smallest distances from each point to the previous set of medoids, namely in (4) and in (5). Therefore, in both (4) and (5), we only need to compute the distance once for each summand. As a result, PAM needs distance computations for the greedy searches in the entire BUILD step and distance computations for each SWAP iteration.

3 Bandit-PAM

At the core of the PAM algorithm is the BUILD search (4), which is repeated times for initialization, and the SWAP search (5), which is repeated until convergence. We first show that both searches share a similar mathematical structure, and then show that such a structure can be optimized efficiently using a bandit-based randomized algorithm, thus giving rise to Bandit-PAM. Rewriting the BUILD search (4) and the SWAP search (5) in terms of the change in total loss yields

BUILD: (6)
SWAP: (7)

One may notice that the above two problems share the following similarities. First, both are searching over a finite set of parameters: points in the BUILD search and swaps in the SWAP search. Second, both objective functions have the form of an average of an function evaluated over a finite set of reference points. We formally describe the shared structure:


for target points , reference points , and an objective function that depends on the target point . Then both the BUILD search and the SWAP search can be written as instances of Problem (8) with:


Crucially, in the SWAP search, each pair of medoid-and-non-medoid points is treated as one target point in in this new formulation.

3.1 Adaptive search for the shared problem

Recall that the computation of is . A naive, explicit method would require computations of to solve Problem (8). However, as shown in previous works Bagaria et al. (2018b, a), a randomized search would return the correct result with high confidence in computations of . Specifically, for each target in Problem (8), let denote its objective function. Computing exactly takes computations of , but we can instead estimate with fewer computations by drawing independent samples uniformly with replacement from . Then, and can be estimated as , where determines the estimation accuracy. To estimate the solution to Problem (8) with high confidence, we can then choose to sample different targets in to different degrees of accuracy. Intuitively, promising targets with small values of should be estimated with high accuracy, while less promising ones can be discarded without being evaluated on too many reference points.

The specific adaptive estimation procedure is described in Algorithm 1. It can be viewed as a batch version of the conventional UCB algorithm Lai and Robbins (1985); Zhang et al. (2019) and is easier to implement. The algorithm uses the set to track all potential solutions to Problem (8); is initialized as the set of all target points . For each potential solution , the algorithm maintains its mean objective estimate

as well as a confidence interval

, where the latter depends on the exclusion probability as well as the dispersion parameter .

In each iteration, a new batch of reference points is evaluated for all potential solutions in , making the estimate of more accurate. Based on the current estimate, if a target’s lower confidence bound is still greater than the upper confidence bound of the most promising target , we remove it from the set of possible solutions . This process continues until there is only one point in or until we have sampled more reference points than in the whole reference set. In the latter case, we know that the difference between the remaining targets in is so subtle that an exact computation is more efficient. We then compute those targets’ objectives exactly and return the best target in the set.

1: Set of potential solutions to Problem (8)
2: Number of reference points evaluated
3:For all , set , Initial mean and confidence interval for each arm
4:while  and  do
5:     Draw a batch of samples with replacement from reference
6:     for all  do
7:          Update running mean
8:          Update confidence interval      
9:      Remove points that can no longer be solution
11:if  = 1 then
12:     return
14:     Compute exactly for all
15:     return
Algorithm 1 Adaptive-Search ( batchsize, , )

3.2 Algorithmic details

Estimation of each : Bandit-PAM uses Algorithm 1 in both the BUILD step and each SWAP iteration, with input parameters specified in (9) and (10). In practice, is not known a priori and we estimate for each from the data. In the first batch of sampled reference points in Algorithm 1, we estimate each as:



denotes standard deviation. Intuitively, this allows for smaller confidence intervals in later iterations, especially in the BUILD step, when we expect the average arm returns to become smaller as we add more medoids (since we are taking the minimum over a larger set on the RHS of Eq. (

4)). We also allow for arm-dependent , as opposed to a fixed global

, which allows for narrower confidence intervals for arms whose returns are heavily concentrated (e.g., distant outliers). Empirically, this results in significant speedups and results in fewer arms being computed exactly (Line 14 in Algorithm

1). In all experiments, the batchsize is set to 100 and the error probability is set to . Empirically, these values of batch size and this setting of are such that Bandit-PAM recovers the same results in PAM in almost all cases.

Combination with FastPAM1: We also combine Bandit-PAM with the FastPAM1 optimization Schubert and Rousseeuw (2019). We discuss this optimization in Appendix 1.2.

4 Analysis of the Algorithm

The goal of Bandit-PAM is to track the optimization trajectory of the standard PAM algorithm, ultimately identifying the same set of medoids with high probability. In this section, we formalize this statement and provide bounds on the number of distance computations required by Bandit-PAM.

We will assume that both PAM and Bandit-PAM place a hard constraint on the maximum number of SWAP steps that are allowed. Notice that, as long as Bandit-PAM finds the correct solution to the search problem (6) at each BUILD step and to the search problem (7) at each SWAP step, it will reproduce the sequence of BUILD and SWAP steps of PAM identically, returning the same set of medoids in the end. The hard constraint guarantees that, even if the trajectories of PAM and Bandit-PAM deviate from each other, at most calls to Algorithm 1 will be performed.

Consider one call to Algorithm 1 and suppose is the optimal target point (i.e., the one with minimum ). For another target point , let . To state the next result, we will assume that, for a randomly sampled reference point, say

, the random variable

is -sub-Gaussian; i.e. that , for some known parameter . In addition, we assume that the data is generated in a way such that the mean rewards

follow a sub-Gaussian distribution (see Sec. 

6 for a discussion).

Theorem 1.

If Bandit-PAM is run on a dataset with , then it returns the same set of medoids as PAM with probability . Furthermore, the total number of distance computations required satisfies

When the number of desired medoids is a constant and the number of allowed SWAP steps is small (which is often sufficient in practice, as discussed in Sec. 6), Theorem 1 implies that only distance computations are necessary to reproduce the results of PAM with high probability.

In order to prove Theorem 1, we prove a more detailed result for each call that Bandit-PAM makes to Algorithm 1. For this more specific case, we assume that, for target point , is -sub-Gaussian, where is a parameter specific to (and can change across different calls to Algorithm 1). As it turns out, in practice one can estimate each by performing a small number of distance computations. Allowing to be estimated separately for each arm is beneficial in practice, as discussed in Sec. 6. The following theorem is proved in Appendix 3.

Theorem 2.

For , with probability at least , Algorithm 1 returns the correct solution to (6) (for a BUILD step) or (7) (for a SWAP step), using a total of distance computations, where

While the assumption that is known for every may seem excessive, it is worth pointing out that Algorithm 1 does not need to know all s exactly and an upper bound is sufficient. Notice that, if a random variable is -sub-Gaussian, it is also -sub-Gaussian for . Hence, if we have a universal upper bound for all , the algorithm can be run with replacing each . In that case, a direct consequence of Theorem 2 is that the total number of distance computations per call to Algorithm 1 satisfies


Furthermore, as proved in Appendix 2 of Bagaria et al. Bagaria et al. (2018b), such an instance-wise bound converts to an bound when ’s follow a sub-Gaussan distribution. Moreover, from Theorem 2, the probability that Algorithm 1 does not return the target point with the smallest value of is at most . By the union bound, the probability that Bandit-PAM does not return the same set of medoids as PAM is at most . Moreover, since at most calls to Algorithm 1 are made, from (12) we see that the total number of distance computations required by Bandit-PAM satisfies . This proves Theorem 1.

5 Empirical Results

We run experiments on three real-world datasets to validate the expected behavior of Bandit-PAM: the MNIST hand-written digits dataset LeCun et al. (1998), the 10x Genomics 68k PBMCs scRNA-seq dataset Zheng et al. (2017) (, and the Hour Of Code #4 (HOC4) coding exercise submission dataset (, all of which are publicly available.


The MNIST dataset LeCun et al. (1998)

consists of 70,000 black-and-white images of handwritten digits, where each digit is represented as a 784 dimensional vector. On MNIST, We consider two distance metrics, namely

distance and cosine distance. The scRNA-seq dataset contains the gene expression levels of 10,170 different genes in each of 40,000 cells after standard filtering. On scRNA-seq, we consider distance, which is recommended Ntranos et al. (2016). The HOC4 dataset from (2013) consists of 3,360 unique solutions to a block-based programming exercise on Solutions to the programming exercise are represented as abstract syntax trees (ASTs), and we consider the tree edit distance to quantify the similar between solutions.


In Subsec. 5.1, we show that Bandit-PAM returns the same results as PAM. We also compare the clustering loss with that of other popular -medoids clustering algorithms (3), including FastPAM Schubert and Rousseeuw (2019), CLARANS Ng and Han (2002), and Voronoi Iteration Park and Jun (2009). In Subsec. 5.2, we demonstrate that Bandit-PAM scales linearly in the number of samples for all datasets and all metrics considered. Each parameter setting was repeated times with data subsampled from the original dataset. 95% confidence intervals are provided.

Figure 1: (a) Clustering loss relative to the PAM loss. Data is subsampled from MNIST, sample size varies from to , and 95% confidence intervals are provided. Bandit-PAM always returns the same solution as PAM and hence has loss ratio . FastPAM has a comparable performance, while the other two algorithms are significantly worse. (b-c) Average number of distance calls per iteration vs sample size for MNIST and distance with (b) and (c) . The plot is shown on a log-log scale. Lines of best fit (black) are plotted, as are reference lines demonstrating the expected scaling of PAM (red).

5.1 Clustering/loss quality

Figure 1 (a) shows the relative losses of algorithms with respect to the loss of PAM. Bandit-PAM and three other baselines, namely FastPAM Schubert and Rousseeuw (2019), CLARANS Ng and Han (2002), and Voronoi Iteration Park and Jun (2009). We note that FastPAM is different from the FastPAM1 mentioned before; FastPAM takes for each SWAP step but does not guarantee the same solution as PAM. Bandit-PAM always returns the same solution as PAM and hence has loss ratio . FastPAM has a comparable performance, while the other two algorithms are significantly worse.

5.2 Scaling with for different datasets, distance metric, and values

Figure 2: Average number of distance calls per iteration vs sample size , for (a) MNIST and cosine distance, (b) scRNA-seq and distance, and (c) HOC4 and tree edit distance. The plot is shown on a log-log scale. Lines of best fit (black) are plotted, as are reference lines demonstrating the expected scaling of PAM (red).

We next consider the number of distance calls per iteration as the sample size increases. The number of distance calls per iteration is defined as the total distance calls divided by the number of SWAP steps plus 1, where the plus 1 corresponds to the BUILD step. We choose to look at this quantity to account for different number of SWAPs for different runs, in order to provide a fair comparison.

If the complexity is linear, then the slope would be in the log-log plot. Indeed, as shown in Figure 1 (b-c), the slope for and are 0.979 and 0.930, respectively, indicating the scaling is linear in for different values of .

In addition, as shown in Figure 2, the slopes of the log-log plot are 1.018, 0.899 and 1.046 for MNIST with cosine distance, scRNA-seq with distance, and HOC4 with tree edit distance, respectively, validating our theory that Bandit-PAM takes almost linear number of distance evaluations per iteration for different datasets and different distance metrics.

6 Discussion and Conclusions

In all experiments, we have observed that the numbers of SWAPs are very small, typically fewer than 10, justifying the assumption of having an upper limit on the PAM SWAP step prior to running the algorithm in Sec. 4.

We also observe that for all datasets, the randomly sampled distances have an empirical distribution similar to Gaussian distribution (Appendix Figures 4-5), justifying the sub-Gaussian assumption in Sec. 4. In addition, we observe that the the sub-Gaussian parameters are different for different steps and different points (Appendix Figures 2), justifying the adaptive estimation of the sub-Gaussianity parameters in SubSec. 3.2.

In addition, the distribution of the true arm parameters also mostly have a heavy-tailed distribution (Appendix Figure 3), justifying the distributional assumption of ’s in Sec. 4.

Our application to the HOC4 dataset also suggests a method for scaling personalized feedback to individual students in online courses. If limited resources are available, instructors can choose to provide feedback on just the medoids of submitted solutions instead of exhaustively providing feedback on every unique solution, of which there may be several thousand. Instructors can then refer individual students to the feedback provided for their closest medoid. We anticipate that this approach can be applied generally for students of Massive Open Online Courses (MOOCs), thereby enabling more equitable access to education and personalized feedback for students.

Broader Impact

In this work, we proposed an algorithm that accelerated finding solutions to the -medoids problem while producing comparable – and usually equivalent – final cluster assignments. Our work enables the discovery of high-quality medoid assignments in very large datasets, including some on which prior algorithms were prohibitively expensive. A potential negative consequence of this is that practitioners may be incentivized to gather and store larger amounts of data now that it can be meaningfully processed, in a phenomenon more generally described as induced demand Hymel et al. (2010). This incentive realignment could potentially result in negative externalities such as an increase in energy consumption and carbon footprints.

We also anticipate, however, that Bandit-PAM will enable several beneficial applications in biomedicine, education, and fairness. For example, the evolutionary pathways of infectious diseases could possibly be constructed from the medoids of genetic sequences available at a given point in time, if prior temporal information about these sequences’ histories is not available. Similarly, the medoids of patients infected in a disease outbreak may elucidate the origins of outbreaks, as did prior analyses of cholera outbreaks using Voronoi Iteration Cameron and Jones (1983). As discussed in Section 6, our application to the HOC4 data also demonstrates the utility of Bandit-PAM in online education. In particular, especially with recent interest in online learning, we hope that our work will improve the quality of online learning for students worldwide.


  • [1] J. Audibert and S. Bubeck (2010) Best arm identification in multi-armed bandits. In Arxiv, Cited by: §1.
  • [2] V. Bagaria, G. M. Kamath, and D. N. Tse (2018) Adaptive monte-carlo optimization. arXiv preprint arXiv:1805.08321. Cited by: §1, §2, §3.1.
  • [3] V. Bagaria, G. Kamath, V. Ntranos, M. Zhang, and D. Tse (2018) Medoids in almost-linear time via multi-armed bandits. In

    International Conference on Artificial Intelligence and Statistics

    pp. 500–509. Cited by: §1, §1.3, §2, §3.1, §4.
  • [4] T. Baharav and D. Tse (2019) Ultra fast medoid identification via correlated sequential halving. In Advances in Neural Information Processing Systems, pp. 3650–3659. Cited by: §1, §2.
  • [5] P. S. Bradley, O. L. Mangasarian, and W. N. Street (1997) Clustering via concave minimization. In Advances in Neural Information Processing Systems, pp. 368–374. Cited by: §1.
  • [6] D. Cameron and I. Jones (1983) John snow, the broad street pump and modern epidemiology. In International Journal of Epidemiology, Vol. 12, pp. 393–396. Cited by: Broader Impact.
  • [7] H. S. Chang, M. C. Fu, J. Hu, and S. I. Marcus (2005)

    An adaptive sampling algorithm for solving markov decision processes

    Operations Research 53 (1), pp. 126–139. Cited by: §1.
  • [8] (2013) Research at Note: External Links: Link Cited by: §5.
  • [9] V. Estivill-Castro and M. E. Houle (2001) Robust distance-based clustering with applications to spatial data mining. Algorithmica 30 (2), pp. 216–242. Cited by: §1.
  • [10] E. Even-Dar, S. Mannor, and Y. Mansour (2002) PAC bounds for multi-armed bandit and markov decision processes. In

    International Conference on Computational Learning Theory

    pp. 255–270. Cited by: §1.
  • [11] K. Hymel, K. Small, and K. Van Dender (2010) Induced demand and reboundeffects in road transport. In Transportation Research B, Methodological, Vol. 44, pp. 1220–1241. Cited by: Broader Impact.
  • [12] A. K. Jain and R. C. Dubes (1988) Algorithms for clustering data. Prentice-Hall, Inc.. Cited by: §1.
  • [13] K. Jamieson, M. Malloy, R. Nowak, and S. Bubeck (2014) Lil’ucb: an optimal exploration algorithm for multi-armed bandits. In Conference on Learning Theory, pp. 423–439. Cited by: §1.
  • [14] K. Jamieson and R. Nowak (2014) Best-arm identification algorithms for multi-armed bandits in the fixed confidence setting. In 2014 48th Annual Conference on Information Sciences and Systems (CISS), pp. 1–6. Cited by: §1.
  • [15] K. Jamieson and A. Talwalkar (2016)

    Non-stochastic best arm identification and hyperparameter optimization

    In Artificial Intelligence and Statistics, pp. 240–248. Cited by: §1.
  • [16] L. Kaufman and P. J. Rousseeuw (1987) Clustering by means of medoids. statistical data analysis based on the l1 norm. Y. Dodge, Ed, pp. 405–416. Cited by: §1, §1, §2.1.
  • [17] L. Kaufman and P. J. Rousseeuw (1990) Partitioning around medoids (program pam).

    Finding groups in data: an introduction to cluster analysis

    , pp. 68–125.
    Cited by: §1, §1, §1, §2.1.
  • [18] L. Kocsis and C. Szepesvári (2006) Bandit based monte-carlo planning. In

    European conference on machine learning

    pp. 282–293. Cited by: §1.
  • [19] B. Kveton, C. Szepesvari, and M. Ghavamzadeh (2019) Perturbed-history exploration in stochastic multi-armed bandits. In Proceedings of the 28th International Joint Conference on Artificial Intelligence, Cited by: §2.
  • [20] B. Kveton, C. Szepesvari, S. Vaswani, Z. Wen, M. Ghavamzadeh, and T. Lattimore (2019) Garbage in, reward out: bootstrapping exploration in multi-armed bandits. In International Conference on Machine Learning, pp. 3601–3610. Cited by: §2.
  • [21] T. L. Lai and H. Robbins (1985) Asymptotically efficient adaptive allocation rules. Advances in applied mathematics 6 (1), pp. 4–22. Cited by: §1, §3.1.
  • [22] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §5, §5.
  • [23] J. Leskovec, A. Rajaraman, and J. D. Ullman (2020) Mining of massive data sets. Cambridge university press. Cited by: §1.
  • [24] L. Li, K. Jamieson, G. DeSalvo, A. Rostamizadeh, and A. Talwalkar (2016) Hyperband: a novel bandit-based approach to hyperparameter optimization. arXiv preprint arXiv:1603.06560. Cited by: §1.
  • [25] S. Lloyd (1982) Least squares quantization in pcm. IEEE transactions on information theory 28 (2), pp. 129–137. Cited by: §1.
  • [26] M. Luecken and F. Theis (2019) Current best practices in single-cell rna-seq analysis: a tutorial. Molecular Systems Biology 155:e8746 (6). Cited by: §1.3.
  • [27] J. MacQueen et al. (1967) Some methods for classification and analysis of multivariate observations. In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, Vol. 1, pp. 281–297. Cited by: §1.
  • [28] N. Mishra, R. Schreiber, I. Stanton, and R. E. Tarjan (2007) Clustering social networks. In International Workshop on Algorithms and Models for the Web-Graph, pp. 56–67. Cited by: §1.
  • [29] G. Navarro (2001) A guided tour to approximate string matching. ACM computing surveys (CSUR) 33 (1), pp. 31–88. Cited by: §1.
  • [30] R. T. Ng and J. Han (2002) CLARANS: a method for clustering objects for spatial data mining. IEEE transactions on knowledge and data engineering 14 (5), pp. 1003–1016. Cited by: §1, §5, §5.1.
  • [31] V. Ntranos, G. M. Kamath, J. M. Zhang, L. Pachter, and N. T. David (2016) Fast and accurate single-cell rna-seq analysis by clustering of transcript-compatibility counts. Genome biology 17 (1), pp. 112. Cited by: §1, §5.
  • [32] M. L. Overton (1983) A quadratically convergent method for minimizing a sum of euclidean norms. Mathematical Programming 27 (1), pp. 34–63. Cited by: §1.
  • [33] H. Park and C. Jun (2009) A simple and fast algorithm for k-medoids clustering. Expert systems with applications 36 (2), pp. 3336–3341. Cited by: §1, §5, §5.1.
  • [34] A. P. Reynolds, G. Richards, B. de la Iglesia, and V. J. Rayward-Smith (2006)

    Clustering rules: a comparison of partitioning and hierarchical clustering algorithms

    Journal of Mathematical Modelling and Algorithms 5 (4), pp. 475–504. Cited by: §1.
  • [35] E. Schubert and P. J. Rousseeuw (2019) Faster k-medoids clustering: improving the pam, clara, and clarans algorithms. In International Conference on Similarity Search and Applications, pp. 171–187. Cited by: §1.1, §1, §1, §1, §3.2, §5, §5.1.
  • [36] E. Schubert and A. Zimek (2019)

    ELKI: a large open-source library for data analysis-elki release 0.7. 5" heidelberg"

    arXiv preprint arXiv:1902.03616. Cited by: §1.
  • [37] H. Song, J. Lee, and W. Han (1967) PAMAE: parallel k -medoids clustering with high accuracy and efficiency. In Proc. 23 ACM SIGKDD Int’l Conf. on Knowledge Discovery and Data Mining, Vol. 1, pp. 281–297. Cited by: §1.
  • [38] C. Wang, Y. Yu, B. Hao, and G. Cheng (2020) Residual bootstrap exploration for bandit algorithms. In Arxiv, Cited by: §2.
  • [39] M. Zhang, J. Zou, and D. Tse (2019) Adaptive monte carlo multiple testing via multi-armed bandits. In International Conference on Machine Learning, pp. 7512–7522. Cited by: §1, §3.1.
  • [40] G. X. Zheng, J. M. Terry, P. Belgrader, P. Ryvkin, Z. W. Bent, R. Wilson, S. B. Ziraldo, T. D. Wheeler, G. P. McDermott, J. Zhu, et al. (2017) Massively parallel digital transcriptional profiling of single cells. Nature communications 8 (1), pp. 1–12. Cited by: §5.


1 Additional Discussions

1.1 FastPAM1 optimization

Algorithm 1 can also be combined with the FastPAM1 optimization from [35] to reduce the number of computations in each SWAP iteration. For a given candidate swap , we rewrite from Eq. (10) as:


where denotes the set of points whose closest medoid is and and are the distance from to its nearest and second nearest medoid, respectively, before the swap is performed. We cache the values , and the cluster assignments so that Eq. (13) no longer depends on and instead depend only on , which is cached. This allows for an speedup in each SWAP iteration since we do not need to recompute Equation 13 for each of the distinct medoids (s).

1.2 Value of re-estimating each

Appendix Figure 2:

Boxplot showing the min, max, and each quartile for the set of all

estimates for the full MNIST dataset, in the BUILD step.

The theoretical results in Section 4 and empirical results in Section 5 suggest that Bandit-PAM scales almost linearly in dataset size for a variety of real-world datasets and commonly used metrics. One may also ask if Lines 7-8 of Algorithm 1, in which we re-estimate each from the data, are necessary. In some sense, we treat the set of {} as adaptive in two different ways: is calculated on a per-arm basis (hence the subscript ), as well recalculated in each BUILD and SWAP iteration. In practice, we observe that re-estimating each for each sequential call to Algorithm 1 significantly improves the performance of our algorithm. Figure 2 describes the distribution of estimate for the MNIST data at different stages of the BUILD step. The median drops dramatically after the first medoid has been assigned and then steadily decreases, as indicated by the orange lines, and suggests that each should be recalculated at every assignment step. Furthermore, the whiskers demonstrate significant variation amongst the in a given assignment step and suggest that having arm-dependent parameters is necessary. Without these modifications to our algorithm, we find that the confidence intervals used by Bandit-PAM (Line 8) are unnecessarily large and cause computation to be expended needlessly as it becomes harder to identify good arms. Intuitively, this is due to the much larger confidence intervals that make it harder to distinguish between arms’ mean returns. For a more detailed discussion of the distribution of and examples where the assumptions of Theorem 1 are violated, we refer the reader to Appendix 1.3.

1.3 Violation of distributional assumptions

In this section, we investigate the robustness of Bandit-PAM to violations of the assumptions in Theorem 1 on an example dataset and provide intuitive insights into the degradation of scaling. We create a new dataset from the scRNA dataset by projecting each point onto the top 10 principal components of the dataset; we call the dataset of projected points scRNA-PCA. Such a transformation is commonly used in prior work; the most commonly used distance metric between points is then the distance [26].

Figure 3 shows the distribution of arm parameters for various (dataset, metric) pairs in the first BUILD step. In this step, the arm parameter corresponds to the mean distance from the point (the arm) to every other point. We note that the true arm parameters in scRNA-PCA are more heavily concentrated about the minimum than in the other datasets. Intuitively, we have projected the points from a 10,170-dimensional space into a 10-dimensional one and have lost significant information in doing so. This makes many points appear "similar" in the projected space.

Figures 4 and 5 show the distribution of arm rewards for 4 arms (points) in MNIST and scRNA-PCA, respectively, in the first BUILD step. We note that the examples from scRNA-PCA display much larger tails, suggesting that their sub-Gaussianity parameters are very high.

Together, these observations suggest that the scRNA-PCA dataset may violate the assumptions of Theorems 1 and 2 and hurt the scaling of Bandit-PAM with . Figure 6 demonstrates the scaling of Bandit-PAM with on scRNA-PCA. The slope of the line of best fit is 1.204, suggesting that Bandit-PAM scales as approximately in dataset size. We note that this is higher than the exponents suggested for other datasets by Figures 1 and 2, likely to the different distributional characteristics of the arm means and their spreads.

We note that, in general, it may be possible to characterize the distribution of arm returns at and the distribution of , the sub-Gaussianity parameter, at every step of Bandit-PAM, from properties of the data-generating distribution, as done for several distributions in [3]. We leave this more general problem, as well as its implications for the complexity of our Bandit-PAM , to future work.

Appendix Figure 3: Histogram of true arm parameters, , for 1000 randomly sampled arms in the first BUILD step of various datasets. For scRNA-PCA with (bottom right), the arm returns are much more sharply peaked about the mininum than for the other datasets. In plots where the bin widths are less than 1, the frequencies can be greater than 1.
Appendix Figure 4: Example distribution of rewards for 4 points in MNIST in the first BUILD step. The minimums and maximums are indicated with red markers.
Appendix Figure 5: Example distribution of rewards for 4 points in scRNA-PCA in the first BUILD step. The minimums and maximums are indicated with red markers. The distributions shown here are more heavy-tailed than in Figure 4. In plots where the bin widths are less than 1, the frequencies can be greater than 1.
Appendix Figure 6: Average number of distance calls per iteration vs , for scRNA-PCA and distance on a log-log scale. The line of best fit (black) are plotted, as are reference lines demonstrating the expected scaling of PAM (red).

2 Future Work

There are several ways in which Bandit-PAM could be improved or made more impactful. In this work, we chose to implement a UCB-based algorithm to find the medoids of a dataset. Other best-arm-identification approaches, however, could also be used for this problem. An alternate approach using bootstrap-based bandits could also be valuable, especially in relaxing the distributional assumptions on the data that the quantities of interest are -sub-Gaussian [38, 20, 19]. It may also be possible to generalize a recent single-medoid approach, Correlation-Based Sequential Halving [4], to more than 1 medoid. Though we do not have reason to suspect an algorithmic speedup (as measured by big-O), we may see constant factor improvements or improvements in wall clock time.

Throughout this work, we assumed that computing the distance between two points was an operation. This obfuscates the dependence on the dimensionality of the data, . If we consider computing the distance between two points an computation, the complexity of Bandit-PAM could be expressed as log in the BUILD step and each SWAP iteration. Recent work [2] suggests that this could be further improved; instead of computing the difference in each of the coordinates, we may be able to adaptively sample which of the coordinates to use in our distance computations and reduce the dependence on dimensionality from to , especially in the case of sparse data.

Finally, it may be possible to improve the theoretical bounds presented in Theorem 1. We also note that it may be possible to prove the optimality of Bandit-PAM in regards to algorithmic complexity, up to constant factors, using techniques from [3] that were developed for sample-efficiency guarantees in hypothesis testing.

3 Proof of Theorem 2


First, we show that, with probability , all confidence intervals computed throughout the algorithm are true confidence intervals, in the sense that they contain the true parameter . To see this, notice that for a fixed and a fixed iteration of the algorithm, is the average of i.i.d. samples of a -sub-Gaussian distribution. From Hoeffding’s inequality,

Notice that there are at most such confidence intervals computed across all target points (i.e., arms) and all steps of the algorithm. If we set , we see that for every and for every step of the algorithm with probability at least , by the union bound.

Let . Notice that if all confidence intervals throughout the algorithm are correct, it is impossible for to be removed from the set of candidate target points. Moreover, it is clear that the main while loop in the algorithm can only run times and that the algorithm must terminate. Hence, (or some with ) must be returned upon termination.

Let be the total number of arm pulls computed for each of the arms remaining in the set of candidate arms at some point in the algorithm. Notice that, for any suboptimal arm that has not left the set of candidate arms, we must have . Moreover, if , we have that

and implying that must be removed from the set of candidate arms at the end of that iteration. Hence, the number of distance computations required for target point is at most

Notice that this holds simultaneously for all with probability . We conclude that the total number of distance computations satisfies

where we used the fact that the maximum number of distance computations per target point is . ∎