Medoids in almost linear time via multi-armed bandits

11/02/2017 ∙ by Vivek Bagaria, et al. ∙ berkeley college Stanford University 0

Computing the medoid of a large number of points in high-dimensional space is an increasingly common operation in many data science problems. We present an algorithm Med-dit which uses O(n log n) distance evaluations to compute the medoid with high probability. Med-dit is based on a connection with the multi-armed bandit problem. We evaluate the performance of Med-dit empirically on the Netflix-prize and the single-cell RNA-Seq datasets, containing hundreds of thousands of points living in tens of thousands of dimensions, and observe a 5-10x improvement in performance over the current state of the art. Med-dit is available at https://github.com/bagavi/Meddit

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

An important building block in many modern data analysis problems such as clustering is the efficient computation of a representative point for a large set of points in high dimension. A commonly used representative point is the centroid

of the set, the point which has the smallest average distance from all other points in the set. For example, k-means clustering

(Ste56, ; macqueen1967some, ; Llo82, ) computes the centroid of each cluster under the squared Euclidean distance. In this case, the centroid is the arithmetic mean of the points in the cluster, which can be computed very efficiently. Efficient computation of centroid is central to the success of k-means, as this computation has to be repeated many times in the clustering process.

While commonly used, centroid suffers from several drawbacks. First, the centroid is in general not a point in the dataset and thus may not be interpretable in many applications. This is especially true when the data is structured like in images, or when it is sparse like in recommendation systems (LesRajUll, )

. Second, the centroid is sensitive to outliers: several far away points will significantly affect the location of the centroid. Third, while the centroid can be efficiently computed under squared Euclidean distance, there are many applications where using this distance measure is not suitable. Some examples would be applications where the data is categorical like medical records

(UKbiobank, ); or situations where the data points have different support sizes such as in recommendation systems (LesRajUll, ); or cases where the data points are on a high-dimensional probability simplex like in single cell RNA-Seq analysis (NtrKamZhaPacTse, ); or cases where the data lives in a space with no well known Euclidean space like while clustering on graphs from social networks.

An alternative to the centroid is the medoid; this is the point in the set that minimizes the average distance to all the other points. It is used for example in -medoids clustering (kaufman1987clustering, ). On the real line, the medoid is the median of the set of points. The medoid overcomes the first two drawbacks of the centroid: the medoid is by definition one of the points in the dataset, and it is less sensitive to outliers than the centroid. In addition, centroid algorithms are usually specific to the distance used to define the centroid. On the other hand, medoid algorithms usually work for arbitrary distances.

The naive method to compute the medoid would require computing all pairwise distances between points in the set. For a set with points, this would require the computation of distances, which would be computationally prohibitive when there are hundreds of thousands of points and each point lives in a space with dimensions in tens of thousands.

In the one-dimensional case, the medoid problem reduces to the problem of finding the median, which can be solved in linear time through Quick-select (Hoa61, ). However, in higher dimensions, no linear-time algorithm is known. RAND (EppWan, )

is an algorithm that estimates the average distance of each point to all the other points by sampling a random subset of other points. It takes a total of

distance computations to approximate the medoid within a factor of with high probability, where is the maximum distance between two points in the dataset. We remark that this is an approximation algorithm, and moreover may not be known apriori. RAND was leveraged by TOPRANK (OkaCheLi, ) which uses the estimates obtained by RAND to focus on a small subset of candidate points, evaluates the average distance of these points exactly, and picks the minimum of those. TOPRANK needs distance computations to find the exact medoid with high probability under a distributional assumption on the average distances. trimed (NewFle, ) presents a clever algorithm to find the medoid with distance evaluations, which works very well when the points live in a space of dimensions less than (i.e., ) under a distributional assumption on the points (in particular on the distribution of points around the medoid). However, the exponential dependence on dimension makes it impractical when , which is the case we consider here. Note that their result requires the distance measure to satisfy the triangle inequality.

Figure 1: We plot the probability that RAND and Med-dit do not return the true medoid as a function of the number of distance evaluations per point over monte-carlo trials. We note that Med-dit needs much fewer distance evaluations than RAND for comparable performance. The computation of the true medoid here is also computationally prohibitive and is discussed in Section 4.

In this paper, we present Med(oid)-(Ban)dit, a sampling-based algorithm that finds the exact medoid with high probability. It takes distance evaluations under natural assumptions on the distances, which are justified by the characteristics observed in real data (Section 2). In contrast to trimed, the number of distance computations is independent of the dimension ; moreover, there is no specific requirement for the distance measure, e.g. the triangle inequality or the symmetry. Thus, Med-dit

is particularly suitable for high-dimensional data and general distance measures.

In Figure 1, we showcase the performance of RAND and Med-dit on the largest cluster of the single cell RNA-Seq gene expression dataset of (10xdata, ). This consists of gene-expressions of each of cells, points in dimensions. We note that Med-dit evaluates around times fewer distances than RAND to achieve similar performance. At this scale running TOPRANK and trimed are computationally prohibitive.

The main idea behind Med-dit is noting that the problem of computing the medoid can be posed as that of computing the best arm in a multi-armed bandit (MAB) setting (even2002pac, ; jamieson2014best, ; LaiRob, ). One views each point in the medoid problem as an arm whose unknown parameter is its average distance to all the other points. Pulling an arm corresponds to evaluating the distance of that point to a randomly chosen point, which provides an estimate of the arm’s unknown parameter. We leverage the extensive literature on the multi-armed bandit problem to propose Med-dit, which is a variant of the Upper Confidence Bound (UCB) Algorithm (LaiRob, ).

Like the other sampling based algorithms RAND and TOPRANK, Med-dit also aims to estimate the average distance of every point to the other points by evaluating its distance to a random subset of points. However, unlike these algorithms where each point receives a fixed amount of sampling decided apriori, the sampling in Med-dit is done adaptively. More specifically, Med-dit

maintains confidence intervals for the average distances of all points and adaptively evaluates distances of only those points which could

potentially be the medoid. Loosely speaking, points whose lower confidence bound on the average distance are small form the set of points that could potentially be the medoid. As the algorithm proceeds, additional distance evaluations narrow the confidence intervals. This rapidly shrinks the set of points that could potentially be the medoid, enabling the algorithm to declare a medoid while evaluating a few distances.

In Section 2 we delve more into the problem formulation and assumptions. In Section 3 we present Med-dit and analyze its performance. We extensively validate the performance of Med-dit empirically on two large-scale datasets: a single cell RNA-Seq gene expression dataset of (10xdata, ), and the Netflix-prize dataset of (netflixprize, ) in Section 4.

2 Problem Formulation

Consider points lying in some space equipped with the distance function . Note that we do not assume the triangle inequality or the symmetry for the distance function . Therefore the analysis here encompasses directed graphs, or distances like Bregman divergence and squared Euclidean distance. We use the standard notation of to refer to the set . Let and let the average distance of a point be

The medoid problem can be defined as follows.

Definition 1.

(The medoid problem) For a set of points , the medoid is the point in that has the smallest average distance to other points. Let be the medoid. The index and the average distance of the emdoid are given by

The problem of finding the medoid of is called its medoid problem.

In the worst case over distances and points, we would need to evaluate distances to compute the medoid. Consider the following example.

Example 1.

Consider a setting where there are points. We pick a point uniformly at random from and set all distances to point to . For every other point , we pick a point uniformly at random from and set . As a result, each point apart from has a distance of to at least one point. Thus picking the point with distance would require at least distance evaluations. We note that this distance does not satisfy the triangle inequality, and thus is not a metric.

The issue with Example 1 is that, the empirical distributions of distances of any point (apart from the true medoid) have heavy tails. However, such worse-case instances rarely arise in practice. Consider the following real-world example.

Example 2.

We consider the distance between each pair of points in a cluster – points in dimensional probability simplex – taken from a single cell RNA-Seq expression dataset from (10xdata, ). Empirically the

distance for a given point follows a Gaussian-like distribution with a small variance, as shown in Figure

2

. Modelling distances among high-dimensional points by a Gaussian distribution is also discussed in

(LinSte, ). Clearly the distances do not follow a heavy-tailed distribution like Example 1.

Figure 2: Histogram of distance of randomly chosen points (from

points). We note that the standard deviations are around

.

2.1 The Distance Sampling Model

Example 2 suggests that for any point , we can estimate its average distance by the empirical mean of its distance from a subset of randomly sampled points. Let us formally state this. For any point , let be the set of distances associated with . Define the random distance samples to be i.i.d. sampled with replacement from . We note that

Hence the average distance can be estimated as

To capture the concentration property illustrated in Example 2

, we assume the random variables

, i.e. sampling with replacement from , are -sub-Gaussian 111 is a -sub-Gaussian if . for all .

2.2 Connection with Best-Arm Problem

For points whose average distance is close to that of the medoid, we need an accurate estimate of their average distance, and for points whose average distance is far away, coarse estimates suffice. Thus the problem reduces to adaptively choosing the number of distance evaluations for each point to ensure that on one hand, we have a good enough estimate of the average distance to compute the true medoid, while on the other hand, we minimise the total number of distance evaluations.

This problem has been addressed as the best-arm identification problem in the multi-armed bandit (MAB) literature (see the review article of (jamieson2014best, ) for instance). In a typical setting, we have arms. At the time , we decide to pull an arm , and receive a reward with . The goal is to identify the arm with the largest expected reward with high probability while pulling as few arms as possible.

The medoid problem can be formulated as a best-arm problem as follows: let each point in the ensemble be an arm and let the average distance be the loss of the -th arm. Medoid corresponds to the arm with the smallest loss, i.e the best arm. At each iteration , pulling arm is equivalent to sampling (with replacement) from with expected value . As stated previosuly, the goal is to find the best arm (medoid) with as few pulls (distance computations) as possible.

3 Algorithm

We have points . Recall that the average distance for arm is . At iteration of the algorithm, we evaluate a distance to point . Let be the number of distances of point evaluated upto time . We use a variant of the Upper Confidence Bound (UCB) (LaiRob, ) algorithm to sample distance of a point with another point chosen uniformly at random.

We compute the empirical mean and use it as an estimate of at time , in the initial stages of the algorithm. More concretely when ,

Next we recall from Section 2 that the sampled distances are independent and -sub-Gaussian (as points are sampled with replacement). Thus for all , for all , with probability at least , , where

(1)

When , we compute the exact average distance of point and set the confidence interval to zero. Algorithm 1 describes the Med-dit algorithm.

1:  Evaluate distances of each point to a randomly chosen point and build a -confidence interval for the mean distance of each point : .
2:  while true do
3:     At iteration , pick point that minimises .
4:     if distances of point are evaluated less than times then
5:        evaluate the distance of to a randomly picked point; update the confidence interval of .
6:     else
7:        Set to be the empirical mean of distances of point by computing its distance to all other points and set .
8:     if there exists a point such that ,  then
9:        return  .
Algorithm 1 Med-dit
Theorem 1.

For , let . If we pick in Algorithm 1, then with probability , it returns the true medoid with the number of distance evaluations such that,

(2)

If is , then Theorem 1 gives that Med-dit takes distance evaluations. In addition, if one assumed a Gaussian prior on the average distance of points (as one would expect from Figure 3), then Med-dit would also take distance evaluations in expectation over the prior. See Appendix 1.

If we use the same assumption as TOP-RANK, the are i.i.d. Uniform , then the number of distance evaluations needed by Med-dit is , compared to for TOP-RANK. Under these assumptions RAND takes distance evaluations to return the true medoid.

We assume that is known in the proof below, whereas we estimate them in the empirical evaluations. See Section 4 for details.

Remark 1.

With a small modification to Med-dit, the sub-Gaussian assumption in Section 2 can be further relaxed to a finite variance assumption (bubeck2013bandits, ). One the other hand, one could theoretically adapt Med-dit to other best-arm algorithms to find the medoid with distance computations at the sacrifice of a large constant (karnin2013almost, ). Refer to Appendix 2 for details.

Figure 3: The distribution of mean distances of all the points in the cell RNA-Seq dataset.
Proof.

Let be the total number of distance evaluations when the algorithm stops. As defined above, let be the number of distance evaluations of point upto time .

We first assume that are true -confidence interval (Recall that ) and show the result. Then we prove this statement.

Let be the true medoid. We note that if we choose to update arm at time , then we have

For this to occur, at least one of the following three events must occur:

To see this, note that if none of occur, we have

where , , and follow because , , and do not hold respectively.

We note that as we compute confidence intervals at most times for each point. Thus we have at most computations of confidence intervals in total.

Thus and do not occur during any iteration with probability , because

(3)

This also implies that with probability the algorithm does not stop unless the event , a deterministic condition stops occurring.

Let be the iteration of the algorithm when it evaluates a distance to point for the last time. From the previous discussion, we have that the algorithm stops evaluating distances to points when the following holds.

Thus with probability , the algorithm returns as the medoid with at most distance evaluations, where

To complete the proof, we next show that are true -confidence intervals of . We observe that if point is picked by Med-dit less than times at time t, then is equal to the number of times the point is picked. Further is the true -confidence interval from Eq (1).

However, if point is picked for the -th time at iteration ( line of Algorithm 1) then the empirical mean is computed by evaluating all distances (many distances again). Hence . As we know the mean distance of point exactly, is still the true confidence interval.

We remark that if a point is picked to be updated the -th time, as , we have that ,

This gives us that the stopping criterion is satisfied and point is declared as the medoid.

4 Empirical Results

We empirically evaluate the performance of Med-dit on two real world large-scale high-dimensional datasets: the Netflix-prize dataset by (netflixprize, ), and 10x single cell RNA-Seq dataset (10xdata, ).

We picked these large datasets because they have sparse high-dimensional feature vectors, and are at the throes of active efforts to cluster using non-Euclidean distances. In both these datasets, we use

randomly chosen points to estimate the sub-Gaussianity parameter by cross-validation. We set -.

Running trimed and TOPRANK was computationally prohibitive due to the large size and high dimensionality () of the above two datasets. Therefore, we compare Med-dit only to RAND. We also ran Med-dit on three real world datasets tested in NewFle to compare it to trimed and TOPRANK.

4.1 Single Cell RNA-Seq

Rapid advances in sequencing technology has enabled us to sequence DNA/RNA at a cell-by-cell basis and the single cell RNA-Seq datasets can contain up to a million cells (10xdata, ; indrop, ; dropseq, ; 10xold, ).

This technology involves sequencing an ensemble of single cells from a tissue, and obtaining the gene expressions corresponding to each cell. These gene expressions are subsequently used to cluster these cells in order to discover subclasses of cells which would have been hard to physically isolate and study separately. This technology therefore allows biologists to infer the diversity in a given tissue.

We note that tens of thousands gene expressions are measured in each cell, and million of cells are sequenced in each dataset. Therefore single cell RNA-Seq has presented us with a very important large-scale clustering problem with high dimensional data (10xold, ). Moreover, the distance metric of interest here would be

distance as the features of each point are probability distribution, Euclidean (or squared euclidean) distances do not capture the subtleties as discussed in

(batu2000testing, ; NtrKamZhaPacTse, ). Further, very few genes are expressed by most cells and the data is thus sparse.

4.1.1 10xGenomics Mouse Dataset

We test the performance of Med-dit on the single cell RNA-Seq dataset of (10xdata, ). This dataset from 10xGenomics consists of genes-expression of

million neurons cells from the cortex, hippocampus, and subventricular zone of a mice brain. These are clustered into

clusters. Around of the gene expressions in this dataset are zero.

We test Med-dit on subsets of this dataset of two sizes:

  • small dataset - cells (randomly chosen): We use this as we can compute the true medoid on this dataset by brute force and thus can compare the performance of Med-dit and RAND.

  • large dataset - cells (cluster 1) is the largest cluster in this dataset. We use the most commonly returned point as the true medoid for comparision. We note that this point is the same for both Med-dit and RAND, and has the smallest distance among the top points returned in trials of both.

Figure 4: We computed the true medoid of the data-set by brute force. The -axis shows the probability that the point with the smallest estimated mean distance is not the true medoid as a function of the number of pulls per arm. We note that Med-dit has a stopping condition while RAND does not. However we ignore the stopping condition for Med-dit here. Med-dit stops after distance evaluations per point without failing in any of the trials, while RAND takes around distance evaluations to reach a probability of error.

Performance Evaluation : We compare the performance of RAND and Med-dit in Figures 1 and 4 on the large and the small dataset respectively. We note that in both cases, RAND needs - times more distance evaluations to achieve error rate than what Med-dit stops at (without an error in trials).

On the cell dataset, we note that Med-dit stopped within distance evaluations per arm in each of the trials (with distance evaluations per arm on average) and never returned the wrong answer. RAND needs around distance evaluations per point to obtain error rate.

On the cell dataset, we note that Med-dit stopped within distance evaluations per arm in each of the trials (with distance evaluations per arm on average) and never returned the wrong answer. RAND needs around distance evaluations per point to obtain error rate.

4.2 Recommendation Systems

With the explosion of e-commerce, recommending products to users has been an important avenue. Research into this took off with Netflix releasing the Netflix-prize dataset (netflixprize, ).

Usually in the datasets involved here, we have either rating given by users to different products (movies, books, etc), or the items bought by different users. The task at hand is to recommend products to a user based on behaviour of similar users. Thus a common approach is to cluster similar users and to use trends in a cluster to recommend products.

We note that in such datasets, we typically have the behaviour of millions of users and tens of thousands of items. Thus recommendation systems present us with an important large-scale clustering problem in high dimensions (DarMarWalGho, ). Further, most users buy (or rate) very few items on the inventory and the data is thus sparse. Moreover, the number of items bought (or rated) by users vary significantly and hence distance metrics that take this into account, like cosine distance and Jaccard distance, are of interest while clustering as discussed in (LesRajUll, , Chapter 9).

4.2.1 Netflix-prize Dataset

Figure 5: Left: the user Netflix-prize dataset. We computed the true medoid of the dataset by brute force. The -axis shows the probability that the point with the smallest estimated mean distance is not the true medoid as a function of the number of pulls per arm. We note that Med-dit has a stopping condition while RAND does not. However we ignore the stopping condition for Med-dit here. Med-dit stops after distance evaluations per point without failing in any of the trials, while RAND takes around distance evaluations to reach a probability of error. Right: the user Netflix-prize dataset. The -axis shows the probability that the point with the smallest estimated mean distance is not the true medoid as a function of the number of pulls per arm. We note that Med-dit has a stopping condition while RAND does not. However we ignore the stopping condition for Med-dit here. Med-dit stops after distance evaluations per point with probability of error, while RAND takes around distance evaluations to reach a probability of error. The computation of the true medoid here is computationally prohibitive and for comparision we use the most commonly returned point as the true medoid (which was same for both Med-dit and RAND)

We test the performance of Med-dit on the Netflix-prize dataset of (netflixprize, ). This dataset from Netflix consists of ratings of movies by Netflix users. Only of the entries in the matrix are non-zero. As discussed in (LesRajUll, , Chapter 9), cosine distance is a popular metric considered here.

As in Section 4.1.1, we use a small and a large subset of the dataset of size and a respectively picked at random. For the former we compute the true medoid by brute force, while for the latter we use the most commonly returned point as the true medoid for comparision. We note that this point is the same for both Med-dit and RAND, and has the smallest distance among the top points returned in trials of both.

Performance Evaluation:

On the user dataset (Figure 5 left), we note that Med-dit stopped within distance evaluations per arm in each of the trials (with distance evaluations per arm on average) and never returned the wrong answer. RAND needs around distance evaluations per point to obtain error rate.

On the users on Netflix-Prize dataset (Figure 5 right), we note that Med-dit stopped within 150 distance evaluations per arm in each of the 1000 trials (with 120 distance evaluations per arm on average) and returned the wrong answer only once. RAND needs around 500 distance evaluations per point to obtain error rate.

4.3 Performance on Previous Datasets

We show the performance of Med-dit on data-sets — Europe by (UnbalanceSet, ) , Gnutella by (ripeanu2002mapping, ) , MNIST by (lecun1998mnist, )— used in prior works in Table 1. The number of distance evaluations for TOPRANK and trimed are taken from (NewFle, ). We note that the performance of Med-dit on real world datasets which have low-dimensional feature vectors (like Europe) is worse than that of trimed, while it is better on datasets (like MNIST) with high-dimensional feature vectors.

We conjecture that this is due to the fact the distances in high dimensions are average of simple functions (over a each dimension) and hence have Gaussian-like behaviour due to the Central Limit Theorem (and the delta method

(van1998asymptotic, )).

Dataset TOPRANK trimed med-dit
Europe
Gnutella , NA
MNIST
Table 1: This table shows the performance on real-world data-sets picked from NewFle . We note that performance of Med-dit on datasets, where the features are in low dimensions is not as good as that of trimed, in high dimensions Med-dit performs much better

.

4.4 Inner workings of Med-dit

Empirical running time: We computed the true mean distance , and variance , for all the points in the small dataset by brute force. Figure 3 shows the histrogram of ’s. Using in Theorem 1, the theoretical average number of distance evaluations is . Empirically over experiments, the average number of distance computations is . This suggests that the theoretic analysis captures the essence of the algorithm.

Progress of Med-dit : We see that at any iteration of Med-dit, only the points whose lower confidence interval are smaller than the smallest upper confidence interval among all points have their distances evaluated. At any iteration, we say that such points are under consideration. We note that a point that goes out of consideration can be under consideration at a later iteration (this happens if the smallest upper confidence interval increases). We illustrate the fraction of points under consideration at different iterations (indexed by the number of distance evaluations per point) in the left panel of Figure 6. We pick snapshots of the algorithm and illustrate the true distance distributions of all points under consideration

at that epoch in the right panel of Figure

6. We note that both the mean and the standard deviation of these distributions decrease as the algorithm progresses.222 When the number of points under consideration is more than , we draw the distribution using the true distances of random samples.

Confidence intervals at stopping time : Med-dit seems to compute the mean distance of a few points exactly. These are usually points with the mean distance close to the that of the medoid. The points whose mean distances are farther away from that of the medoid have fewer distances evaluated to have the confidence intervals rule them out as the medoid. The confidence interval of the top points in a run of the experiment are shown in Figure 7 for the cell RNA-Seq dataset. This allows the algorithm to save on distance computations while returning the correct answer.

Speed-accuracy tradeoff Setting the value of between to will result in smaller confidence intervals around the estimates , which will ergo reduce the number of distance evaluations. One could also run Med-dit for a fixed number of iterations and return the point with the smallest mean estimate. Both of these methods improve the running time at the cost of accuracy. Practically, the first method has a better trade-off between accuracy and running time whereas the second method has a deterministic stopping time.

Figure 6: In the left panel, we show the number of points under consideration at different iterations in Med-dit. We note that though the number of points under consideration show a decreasing trend, they do not decrease monotonically. In the right panel, we show the distribution of the true distances among the arms that are under consideration at various snapshots of the algorithm. We note that both the mean and the variance of the distributions keeps decreasing. The last snapshot shows only the distance of the point declared a medoid.
Figure 7: -Confidence Intervals of the points with smallest estimated mean distance in the cell dataset at the termination of Med-dit. The true means are shown by the orange line, while the estimated means are shown by the pink line. We note that the distances of points are computed to all other points. We also note that mean of the medoid is less than the -lower confidence bound of all the other points.

References

  • [1] 10xGenomics. 1.3 Million Brain Cells from E18 Mice. 10x Genomics, 2017. available at https://support.10xgenomics.com/single-cell-gene-expression/datasets/1M_neurons.
  • [2] Tugkan Batu, Lance Fortnow, Ronitt Rubinfeld, Warren D Smith, and Patrick White. Testing that distributions are close. In Foundations of Computer Science, 2000. Proceedings. 41st Annual Symposium on, pages 259–269. IEEE, 2000.
  • [3] James Bennett, Stan Lanning, et al. The netflix prize. In Proceedings of KDD cup and workshop, volume 2007, page 35. New York, NY, USA, 2007.
  • [4] Sébastien Bubeck, Nicolo Cesa-Bianchi, and Gábor Lugosi. Bandits with heavy tail. IEEE Transactions on Information Theory, 59(11):7711–7717, 2013.
  • [5] Olivier Catoni et al. Challenging the empirical mean and empirical variance: a deviation study. In Annales de l’Institut Henri Poincaré, Probabilités et Statistiques, volume 48, pages 1148–1185. Institut Henri Poincaré, 2012.
  • [6] Aad W Chapter 3 Van der Vaart. Asymptotic statistics, volume 3. Cambridge university press, 1998.
  • [7] Srivatsava Daruru, Nena M Marin, Matt Walker, and Joydeep Ghosh. Pervasive parallelism in data mining: dataflow solution to co-clustering large and sparse netflix data. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 1115–1124. ACM, 2009.
  • [8] David Eppstein and Joseph Wang. Fast approximation of centrality. Graph Algorithms and Applications 5, 5:39, 2006.
  • [9] Eyal Even-Dar, Shie Mannor, and Yishay Mansour.

    Pac bounds for multi-armed bandit and markov decision processes.

    In

    International Conference on Computational Learning Theory

    , pages 255–270. Springer, 2002.
  • [10] Eyal Even-Dar, Shie Mannor, and Yishay Mansour.

    Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems.

    Journal of machine learning research

    , 7(Jun):1079–1105, 2006.
  • [11] P. Fränti, M. Rezaei, and Q. Zhao. Centroid index: cluster level similarity measure. Pattern Recognition, 47(9):3034–3045, 2014.
  • [12] Charles AR Hoare. Algorithm 65: find. Communications of the ACM, 4(7):321–322, 1961.
  • [13] Kevin Jamieson and Robert Nowak. Best-arm identification algorithms for multi-armed bandits in the fixed confidence setting. In Information Sciences and Systems (CISS), 2014 48th Annual Conference on, pages 1–6. IEEE, 2014.
  • [14] Zohar Karnin, Tomer Koren, and Oren Somekh. Almost optimal exploration in multi-armed bandits. In Proceedings of the 30th International Conference on Machine Learning (ICML-13), pages 1238–1246, 2013.
  • [15] Leonard Kaufman and Peter Rousseeuw. Clustering by means of medoids. North-Holland, 1987.
  • [16] Allon M Klein, Linas Mazutis, Ilke Akartuna, Naren Tallapragada, Adrian Veres, Victor Li, Leonid Peshkin, David A Weitz, and Marc W Kirschner. Droplet barcoding for single-cell transcriptomics applied to embryonic stem cells. Cell, 161(5):1187–1201, 2015.
  • [17] Tze Leung Lai and Herbert Robbins. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics, 6(1):4–22, 1985.
  • [18] Yann LeCun.

    The mnist database of handwritten digits.

    http://yann. lecun. com/exdb/mnist/, 1998.
  • [19] Jure Leskovec, Anand Rajaraman, and Jeffrey David Ullman. Mining of massive datasets. Cambridge university press, 2014.
  • [20] George C Linderman and Stefan Steinerberger. Clustering with t-sne, provably. arXiv preprint arXiv:1706.02582, 2017.
  • [21] Stuart Lloyd. Least squares quantization in pcm. IEEE transactions on information theory, 28(2):129–137, 1982.
  • [22] Evan Z Macosko, Anindita Basu, Rahul Satija, James Nemesh, Karthik Shekhar, Melissa Goldman, Itay Tirosh, Allison R Bialas, Nolan Kamitaki, Emily M Martersteck, et al. Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets. Cell, 161(5):1202–1214, 2015.
  • [23] James MacQueen et al. Some methods for classification and analysis of multivariate observations. In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, volume 1, pages 281–297. Oakland, CA, USA., 1967.
  • [24] James Newling and François Fleuret. A sub-quadratic exact medoid algorithm.

    Proceedings of the 20th International Conference on Artificial Intelligence and Statistics

    , 2017.
  • [25] Vasilis Ntranos, Govinda M Kamath, Jesse M Zhang, Lior Pachter, and David N Tse. Fast and accurate single-cell rna-seq analysis by clustering of transcript-compatibility counts. Genome biology, 17(1):112, 2016.
  • [26] Kazuya Okamoto, Wei Chen, and Xiang-Yang Li. Ranking of closeness centrality for large-scale social networks. In International Workshop on Frontiers in Algorithmics, pages 186–195. Springer, 2008.
  • [27] Matei Ripeanu, Ian Foster, and Adriana Iamnitchi. Mapping the gnutella network: Properties of large-scale peer-to-peer systems and implications for system design. arXiv preprint cs/0209028, 2002.
  • [28] Hugo Steinhaus. Sur la division des corp materiels en parties. Bull. Acad. Polon. Sci, 1(804):801, 1956.
  • [29] Cathie Sudlow, John Gallacher, Naomi Allen, Valerie Beral, Paul Burton, John Danesh, Paul Downey, Paul Elliott, Jane Green, Martin Landray, et al. Uk biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS medicine, 12(3):e1001779, 2015.
  • [30] Grace XY Zheng, Jessica M Terry, Phillip Belgrader, Paul Ryvkin, Zachary W Bent, Ryan Wilson, Solongo B Ziraldo, Tobias D Wheeler, Geoff P McDermott, Junjie Zhu, et al. Massively parallel digital transcriptional profiling of single cells. Nature communications, 8:14049, 2017.

1 The Distance Evaluations Under Gaussian Prior

We assume that the mean distances of each point are i.i.d. samples of . We note that this implies that are iid random variables. Let be a random variable with the same law as .

From the concentration of the minimum of gaussians, we have that

This gives us that

We note that by Eq (2), we have that the expected number of distance evaluations is of the order of

where the expection is taken with respect to the randomness of .

To show this, it is enough to show that,

for some constant .

To compute that for this prior, we divide the real line into three intervals, namely

  • ,

  • ,

  • ,

and compute the expectation on these three ranges. We note that for if , while for , . Thus we have that,

where we use the Bounded Convergence Theorem to establish .

We next show that all three terms in the above equation are of . We first show the easy cases of and and then proceed to .

  • To establish that is , we start by defining,

    Further note that,

    Thus

  • To establish that is , we note that,

  • Finally to establish that is , we note that,

    Letting to go to faster than , we see that,

This gives us that under this model, we have that under this model,

2 Extensions to the Theoretical Results

In this section we discuss two extensions to the theoretical results presented in the main article:

  1. relax the sub-Gaussian assumption used for the analysis of Med-dit. We see that assuming that the distances random variables have finite variances is enough for a variant of Med-dit to need essentially the same number of distance evaluations as Theorem 1. The variant would use a different estimator (called Catoni’s M-estimator [5]) for the mean distance of each point instead of empirical average leveraging the work of [4]. This is based on the work of [4].

  2. note that there exists an algorithm which can compute the medoid with distance evaluations. The algorithm called exponential-gap algorithm [14] and is discussed below.

2.1 Weakening the Sub-Gaussian Assumption

We note that in order to have the sample complexity, Med-dit relies on a concentration bound where the tail probability decays exponentially. This is achieved by assuming that for each point , the random variable of sampling with replacement from is -sub-Gaussian in Theorem 1. As a result, we can have the sub-Gaussian tail bound that for any point at time , with probability at least , the empirical mean satisfies

In fact, as pointed out by [4], to achieve the sample complexity, all we need is a performance guarantee like the one shown above for the empirical mean. To be more precise, we need the following property:

Assumption 1.

[4] Let be a positive parameter and let be positive constants. Let be i.i.d. random variables with finite mean . Suppose that for all , there exists an estimator such that, with probability at least ,

Remark 2.

If the distribution of satisfies -sub-Gaussian condition, then Assumption 1 is satisfied for , , and variance factor .

However, Assumption 1 can be satisfied with conditions much weaker than the sub-Gaussian condition. One way is by substituing the empirical mean estimator by some refined mean estimator that gives the exponential tail bound. Specifically, as suggested by [4], we can use Catoni’s M-estimator [5].

Catoni’s M-estimator is defined as follows: let be a continuous strictly increasing function satisfying

Let be such that and introduce

If are i.i.d. random variables, the Catoni’s estimator is defined as the unique value such that

Catoni [5] proved that if and the have mean and variance at most , then with probability at least ,

(4)

The corresponding modification to Med-dit is as follows.

  1. For the initialization step, sample each point times to meet the condition for the concentration bound of the Catoni’s M-estimator.

  2. For each arm , if , maintain the confidence interval , where is the Catoni’s estimator of , and

Proposition 1.

For , let . If we pick in the above algorithm, then with probability , it returns the true medoid with the with number of distance evaluations such that,

Proof.

Let . The initialization step takes an extra distance computations. Following the same proof as Theorem 1, we can show that the modified algorithm returns the true medoid with probability at least , and apart from the initialization, the total number of distance computations can be upper bounded by

So the total number of distance computations can be upper bounded by

Remark 3.

By using the Catoni’s estimator, instead of the empirical mean, we need a much weaker assumption that the distance evaluations have finite variance to get the same results as Theorem 1.

2.2 On the Algorithm

The best-arm algrithm exponential-gap [14] can be directly applied on the medoid problem, which takes distance evaluations, essentially if are constants. It is an variation of the familiy of action elimination algorithm for the best-arm problem. A typical action elimination algorithm proceeds as follows: Maintaining a set for , initialized as . Then it proceeds in epoches by sampling the arms in a predetermined number of times , and maintains arms according to the rule:

where is a reference arm, e.g. the arm with the smallest . Then the algorithm terminates when contains only one element.

The above vanilla version of the action elimination algorithm takes distance evaluations, same as Med-dit. The improvement by exponential-gap is by observing that the suboptimal factor is due to the large deviations of with . Instead, exponential-gap use a subroutine median elimination [10] to determine an alternative reference arm with smaller deviations and allows for the removal of the term, where median elimination takes distance evaluations to return a -optimal arm. However, this will introduce a prohibitively large constant due to the use of median elimination. Regarding the technical details, we note both paper [14, 10] assume the boundedness of the random variables for their proof, which is only used to have the hoefflding concentration bound. Therefore, with our sub-Gaussian assumption, the proof will follow symbol by symbol, line by line.