    The celebrated Monte Carlo method estimates a quantity that is expensive to compute by random sampling. We propose adaptive Monte Carlo optimization: a general framework for discrete optimization of an expensive-to-compute function by adaptive random sampling. Applications of this framework have already appeared in machine learning but are tied to their specific contexts and developed in isolation. We take a unified view and show that the framework has broad applicability by applying it on several common machine learning problems: k-nearest neighbors, hierarchical clustering and maximum mutual information feature selection. On real data we show that this framework allows us to develop algorithms that confer a gain of a magnitude or two over exact computation. We also characterize the performance gain theoretically under regularity assumptions on the data that we verify in real world data. The code is available at https://github.com/govinda-kamath/combinatorial_MAB.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The use of random sampling to efficiently estimate a deterministic quantity dates back to the Buffon’s needle experiment of Mario Lazzarini in 1901, and was later developed into the celebrated Monte-Carlo method by Stanislav Ulam. In computer science, this idea has been applied to develop techniques such as locality-sensitive hashing (Gionis et al., 1999) and sketching (Alon et al., 1999). More recently it has been used for fast approximations of matrix operations like matrix multiplication, low rank approximation and matrix decomposition (Drineas et al., 2006a, b, c), to approximate hessians for convex optimization (Pilanci and Wainwright, 2016, 2017). These techniques find wide applications in machine learning.

In this paper, we consider the use of random sampling to solve discrete optimization problems of the form:

 mini∈If(i) (1)

where is a large but finite set and is an expensive-to-compute function. A direct application of the Monte-Carlo method to this problem is to generate enough samples to estimate each value accurately and then compute the minimum of the estimates. However, this is computationally inefficient since the values ’s which are much larger than the minimum need not be estimated as accurately as the values ’s which are closer to the minimum. Instead, a more efficient procedure is to first estimate all the ’s crudely using few samples and then adaptively focus the sampling on the more promising candidates. We term this general approach adaptive Monte-Carlo optimization.

Instances of this general approach have already been appeared in machine learning:

• [leftmargin=*]

• The Monte-Carlo tree search method (Chang et al., 2005; Kocsis and Szepesvári, 2006) to solve large-scale Markov decision problems. Here, the set is the set of all possible actions at a state and , the expected value of the total reward by taking action at state and then following the optimal policy. The problem of interest is estimating the action that maximizes reward, i.e. solving (1) (but with minimization replaced by maximization) at some state. The Monte-Carlo tree search method is an improvement over the non-adaptive Monte-Carlo planning method of Kearns et al. (2002)

, and is a central component of modern reinforcement learning systems such as AlphaZero

(Silver et al., 2017).

• Adaptive hyper-parameter tuning for deep neural networks

(Jamieson and Talwalkar, 2016; Li et al., 2016). Here, the set consists of the possible hyper-parameter configurations and is the validation performance of the neural network under the th hyper-parameter configuration. The problem is to find the best hyper-parameter configuration using the validation data.

• Computing the medoid (exemplar) of a large number of points in high-dimensional space (Bagaria et al., 2018). Here, the set consists of all the points in the dataset and is the average of the distances ’s of point to all the other points, i.e.

 f(i)=1n−1∑jdij.

The medoid is the point in the dataset that minimizes the average distance to the other points. Figure 1: Performance gain of adaptive Monte Carlo optimization algorithms over the brute force algorithm in terms of number of computations in four applications: (a) k-nearest neighbors graph construction, (b) k-means, (c) hierarchical clustering (d) minimum mutual information feature selection as a function of number of points and the dimensions. We see a gain roughly linear in dimensions in the first three and linear in sample size in the last, consistent with the theoretical results in Table 1. For k-nearest neighbors, k-means and hierarchical clustering we use the tiny-imagenet dataset. For Maximum Mutual information we use the UCI Green-house Gas dataset. More details are in Appendix B.

Each of these three problems is solved by a reformulation into a multi-armed bandit problem, where each arm corresponds to each element in the set , each arm’s reward corresponds to , and each pull of arm corresponds to generating a sample to update the estimate of . For example, in the medoid problem, each pull of the arm corresponds to the random sampling of another point and obtaining a distance from the point , and then using it to update the estimate of the average distance of point to the other points. Well-known algorithms developed for multi-arm bandits like the upper confidence bound (UCB) algorithm (Lai and Robbins, 1985) can be used to derive good adaptive sampling strategies. Computation savings of one or more orders-of-magnitude can sometimes be obtained by using such an algorithm. For example, under some mild conditions on the the distances, the computation cost of the medoid of points can be reduced from a naive exact computation cost of to a cost of

using the UCB algorithm with a high probability correctness guarantee

(Bagaria et al., 2018). To achieve within an accuracy of times the minimum, such an algorithm only needs distance computations in contrast to computations needed by a non-adaptive sampling method (Eppstein and Wang, 2006).

Curiously, although the solution approaches for these three problems are similar, their individual development is very tied to their specific context and is done in isolation of each other. The goal of the present paper is to advocate a broader and more unified view of adaptive Monte-Carlo optimization as a general framework which can be applied to a wide range of other problems in machine learning and beyond. Such opportunities are enhanced particularly by the fact that modern datasets often consist of not only large number of samples but also a large number of features, or dimensions. Thus, both computation across all features for a sample as well as a computation across all samples for a feature can be expensive, and their costs can be significantly ameliorated by adaptive Monte-Carlo optimization.

We support our thesis by applying the framework to develop new methods for several problems and evaluate their performance empirically on large datasets (Figure 1) as well as theoretically (Table 1).

• [leftmargin=*]

• Nearest neighbor : The basic problem of computing the nearest neighbor of a point among points in dimensional space is expensive if both and are large. For many common distance measures which are additive, i.e. the distance between two points and can be written as:

 ρ(x,y)=d∑j=1ρj(xj,yj),

where is the th component of the point and respective, the adaptive Monte-Carlo optimization framework can be applied. The set is the set of all contending points, is the distance of point to the point of interest, and each sample for is an evaluation of the distance from point to the point of interest on a randomly selected coordinate. We apply this accelerated nearest neighbor subroutine to improve -nearest neighbors graph construction as well as the assignment step of Lloyd iteration in the -means algorithm.

• Hierarchical clustering : Hierarchical clustering algorithms takes points in -dimensional space and then deriving a binary tree representing the closeness among points. Agglomerative algorithms start off with singleton clusters each corresponding to point and in every step the two closest clusters are merged to form a new cluster, until there is one cluster left. Finding the two closest clusters to merge at each step can be put into the adaptive Monte-Carlo optimization framework. At step of the algorithm, is the set of all pairs of remaining clusters at that step, is the average distance between points in the th pair of clusters. Between two clusters and :

 f(i)=1|C1||C2|∑x∈C1,y∈C2ρ(x,y)=1|C1||C2|∑x∈C1,y∈C2∑jρ(xj,yj),

assuming that the distance measure is additive. Each sample for can be obtained by randomly sampling along both points and co-ordinates. This application is an example of a dynamic algorithm in which each step is an adaptive Monte-Carlo optimization but the estimates are carried forward from one step to the next.

• Maximum mutual information feature selection

: Consider a supervised learning problem with

training samples and features plus one target variable. The features and the target variables are both continuous-valued. The goal is to select the feature with the maximum mutual information to the target. We pose this as an adaptive Monte-Carlo optimization, with as the set of features. Here is an empirical estimate of the mutual information between the feature and the target, calculated from all the sample points. Because the features are continuous-valued, this estimate is not in an additive form across the sample points. Nevertheless we can randomly sample and efficiently update estimates of . This example shows that even complicated non-additive ’s can be put into the adaptive Monte-Carlo optimization framework. Moreover, unlike the previous two examples where the set represents the sample points and the computation of is along the features (dimensions), here represents the features and the computation of is along the sample points. It shows the broad applicability of the framework.

In Sections 2-4, we pose the applications discussed above in the adaptive Monte Carlo optimization framework and derive algorithms to solve them. In Section 5 we discuss leveraging sparsity of the dataset in the adaptive Monte Carlo optimization framework. We then conclude with some discussion in Section 6.

## 2 k-nearest neighbors graph construction and k-means

### 2.1 k-nearest neighbor graph construction

-nearest neighbor graph construction is the problem of constructing -nearest neighbor graphs from unlabelled data points with respect to some distance metric (Hastie et al., 2009, Chapter 13)

. This is used as a subroutine in many unsupervised learning algorithms such as Iso-map

(Tenenbaum et al., 2000), Local linear Embedding (Roweis and Saul, 2000), Hessian Eigenmaps (Donoho and Grimes, 2003)

, some algorithms in spectral clustering

(Linderman et al., 2017) among others. In recent years there has been a lot of work in construction of

-nearest neighbor graphs on high dimensional data such as that on the YFCC100M dataset in

Johnson et al. (2017). While dimensionality reduction is often helpful in applications like embedding graphs (Wang et al., 2016) and videos (Ramanathan et al., 2015), the dimensions of the space embedded in is also in a few thousand, which makes constructing nearest neighbor graphs a problem of interest in high dimensions.

For low dimensions there are some heuristics that are known to perform well such as

- trees (Bentley, 1975) and ball-trees (Omohundro, 1989). One popular algorithm there is the KGraph (Dong, 2014) which is a heuristic algorithm improving over Dong et al. (2011).

In this section we consider the -nearest neighbor graph construction problem, pose it in the adaptive Monte Carlo optimization framework, and then evaluate performance of the algorithm we obtain.

Let us assume that we have an data matrix corresponding to points: . Our goal is to find the nearest neighbor graph in distance. As a subroutine, let us consider finding the nearest neighbors of point . Without loss of generality let us consider distances

We consider each of the points as arms. As the -nearest neighbors with distances is the same as finding -nearest neighbors with squared euclidean distances, we consider as

 f(i) =1dd∑t=1(x1,t−xi,t)2,

and the objective is to find points with the smallest distances to point .

Here and in applications to follow, we design a sequence of estimators for with increasing accuracy: , such that: 1) we can use evaluations of

to construct confidence intervals on

(which often involves making some regularity assumptions about the data), and 2) updating to is computationally cheap.

An estimator after pulls of arm (for squared euclidean distances) would be

 ^fℓ(i) =1ℓℓ∑k=1(x1,tk−xi,tk)2, (2)

where are uniformly sampled from with replacement.

The update of the estimator at step is

 ^fℓ+1(i)=ℓℓ+1^fℓ(i)+1ℓ+1(x1,tℓ+1−xi,tℓ+1)2, (3)

which takes time, where is sampled from , uniformly at random. We further assume that the the estimator of Eq. (2) has - sub-Gaussian tails 111 being independent of which allows us to construct confidence intervals. Figure 2 (b) shows that such an assumption is reasonable 222In practice we estimate sigma from some random samples at the beginning of the run and update it as the algorithm progresses. Details in Appendix B.. Thus we have reformulated the -nearest neighbor as a multi-armed bandit problem using the adaptive Monte Carlo optimization framework.

Throughout this manuscript, we restrict ourselves to using a slightly modified version of the standard Upper-Confidence Bound algorithm of Lai and Robbins (1985) to solve the bandit problem. We note that in reformulations obtained from the adaptive Monte Carlo optimization framework, there exists an integer MAX_PULLS such that the total computation required to compute (over all updates) is equal to the computation required to compute exactly. We thus evaluate exactly if arm is pulled MAX_PULLS times and set the confidence intervals to be of zero width giving us the only modification we make to the algorithm of Lai and Robbins (1985). Given a set of arms , the MAX-PULLS, the number of arms to be returned and a procedure to update them to construct confidence intervals we present this Algorithm 2 of Appendix A and call it UCB.

We find -nearest neighbors using the algorithm defined in Algorithm 1. This leverages the UCB sub-routine of Appendix A.

Let be the distance of the -th nearest neighbor to point . Let be the distance of point from point . Define, . is for the nearest neigbours of point and for the other points it measures how much larger than their distance to point is. In the

###### Theorem 2.1.

With probability , Algorithm 1 returns the -nearest neighbors of point in time.

###### Theorem 2.2.

We can modify Algorithm 1 to obtain an algorithm, which with probability finds the -nearest neighbor of point upto a -approximation is

###### Remark 1.

The running time of the above algorithm for finding -nearest neighbors of all points is under some natural assumptions on distribution of which is again better than the running time of the naive algorithm which is . Detailed analysis is in Appendix E. Figure 2: (a) This shows the gain of our algorithm 1 and that obtained by KGraph on the tiny-imagenet dataset, (b) shows the gain obtained by our algorithm 1 on sparse gene dataset of 10x-genomics. (c) Histogram of co-ordinate wise distances for randomly chosen pair of points in the dense dataset (imagenet) and sparse dataset (10x genomics).

#### Empirical Performance :

Since the bandit based algorithms do not compute exact distances between any two points, we measure the performance of algorithms in terms of effective number of distance evaluations

. We define effective number of distance evaluations between two points as the fraction of co-ordinates involved in estimating the distance between these two points. The effective number of distance evaluations of an algorithm is the sum of effective number of distance evaluations between every pair of points in the algorithm. For example, note that the effective number of distance evaluations to find the distance between two vectors exactly is

as all the co-ordinates are used.

In Figure 1 , we show the gain of the bandit based algorithm over the brute-force algorithm in terms of the effective number of distance evaluations on the tiny-imagenet training data. We vary the number of images from to and their dimensions from to . We obtain a gain of more than x over the naive algorithm. We also notice that the gain over the naive algorithm increases almost linearly with dimensions, giving us that this algorithm behaves in an almost dimension free manner. We hasten to add that the gain in absolute running time in our proof-of-concept implementation (which does not have systems level optimization) was around x, but there is scope of significant improvement with more careful cache management and parallelization.

A state of the art algorithm for computing the -nearest neighbor graph is KGraph (Dong, 2014). We compare our performance with KGraph on the tiny-imagenet dataset in Figure 2 . KGraph and the bandit based algorithm operate in two fundamentally different ways. While KGraph improves its effective number of distance evaluations by computing the distances between fewer pairs of points using the fact that neighborhoods of neighboring points have large intersections (giving it good scaling with the ), our algorithm improves its effective number of distance evaluations by computing distances between all pairs of points approximately (giving it good scaling with ). Combining the two approaches to improve scaling with both and is an avenue for further research.

###### Remark 2.

Another natural algorithm for computing -nearest neighbors would be to first project the data into lower dimensional space using Johnson-Lindenstrauss and then computing -nearest neighbors in the lower dimensional space. We discuss connections to this in Appendix I

### 2.2 k-Means

The canonical algorithm for computing -means: Lloyd’s algorithm (Lloyd, 1982), starts with initial centroids and then iteratively alternates between two steps: the assignment step : where each of the points is assigned to the centroid closest to it, and the update step : where the centroids are updated. For a standard implementation every iteration of -means, the assignment step takes time per point and the update step is much cheaper and takes time per point, where is the ambient dimension of the points.

We note that one can view the assignment step of means as finding the nearest neighbor of each of the points among the centroids. For each point, this can thus be posed as a nearest neighbor problem with arms. Under mild assumptions, this algorithm has a running time of per point. We analyse this formally in Appendix F. As shown in Figure 1 , we note that we get an improvement of 30-50x in terms of effective number of distance evaluations compared to the naive algorithm on the tiny-imagenet dataset, with .

## 3 Agglomerative Algorithms for Hierarchical clustering

We next illustrate incorporating the adaptive monte carlo optimization framework into sequential algorithms to obtain computational savings. We use hierarchical clustering as an example.

Hierarchical clustering is an algorithm that is widely used in data analysis [Chapter 7]LesRajUll. This is especially popular if the number of clusters is not known apriori, or the data has hierarchical structure. Given a set of points and a distance measure hierarchical clustering finds a binary tree with leaves to represent the data with each leaf corresponding to a point, and the distance between two points along the tree showing the dis-similarity between them.

A class of algorithms for hierachical clustering are agglomerative algorithms which combine points in a bottom up manner. Such algorithms start of with the clusters, each corresponding to a point. At every step, the algorithm merges the two “closest” clusters to form a single cluster until there is only one cluster remaining. This naturally sets up a binary tree representation where the each leaf is a cluster with point, the root is the cluster of all points, and all intermediate vertices are clusters formed at intermediate steps. The two children of any vertex in the tree are the clusters that were merged to form the that cluster.

For two clusters which consist of only one point, the notion of closeness is unambiguously defined by the distance metric. For clusters with more than one point, one way to define the distance between clusters is the average linkage which is the average distance between points in the cluster.

Each step of this algorithm computes the pair of clusters with smallest distance between them. Each step can be posed as a bandit problem with the arms being each pair of clusters in the system at that step. If we consider squared euclidean distances as our distance metric (other distances can be handled using the trick discussed in Appendix D), then pulling an arm corresponds to evaluating the square of difference of two co-ordinates in two points, randomly chosen each from one of the two clusters corresponding to the arm. More detailed analysis is presented in Appendix G.

On running this on the tiny-imagenet validation dataset of RGB images, we obtain a gain of more than x over the naive algorithm in terms of the effective number of distance computations as shown in Figure 1.

###### Remark 3.

The algorithm reduces the running time of the algorithm from to under some regularity assumptions discussed in Appendix G.

## 4 Maximum Mutual Information Feature Selection

In all applications we considered so far, was a simple function and constructing confidence intervals for estimates of was easy. Here we consider the case where is itself a sophisticated estimator of an underlying statistical quantity: mutual information, in the context of maximum mutual information feature selection.

A reasonably popular feature selection procedure is Maximum Mutual Information Criterion (Yang and Moody, 2000; Peng et al., 2005). In this scheme, given a set of features and a target, one picks feature that has the highest estimated mutual information with the target as the most relevant feature. Approaches to pick multiple features include Max-Relevance Min-Redundancy (Peng et al., 2005), Joint Mutual Information (Yang and Moody, 2000) or Chow-Liu trees (Chow and Liu, 1968) etc. Here, we will focus on applying the adaptive Monte Carlo optimization framework to the problem of picking a single most informative feature. The framework can also be applied accelrate the approaches which pick multiple features on a sequential basis, such as building Chow-Liu trees.

As before, we have an data matrix corresponding to points each with features. Let us also assume we have an -vector corresponding to the target. Here we focus on picking the single most relevant feature to the target. 333If both the features and the target are discrete, then one can use the plug-in estimators and put the problem into our framework as well. If both the features and the target are continuous, a clever non-parametric function of Kozachenko and Leonenko (1987) is often used used to empirically estimate mutual information.

On putting it into the framework (with min replaced by max), every feature is an arm. is an empirical mutual information of the feature with . We next describe the used in this application.

We then rely on the estimator of differential entropy by Kozachenko and Leonenko (1987) to obtain . Let be

iid samples of a random variable. Define

Then the Kozachenko-Leonenko estimator is given by for an absolute constant depending only on dimension of the random variable. Given an iid sequence of pairs of random variables , we estimate the mutual information between the two as

 ^Ik({Wi}k1;{Zi}k1) =^hk({Wi}k1)+^hk({Zi}k1)−^hk({(Wi,Zi)}k1)

and it’s estimators are given by:

 f(i) =^In({Xt,i}nt=1,{Yt}nt=1), ^fℓ(i) =^Iℓ({Xtk,i}ℓk=1,{Ytk}ℓk=1)

We obtain the confidence intervals using a central limit theorem derived for this estimator

(Delattre and Fournier, 2017; Bickel and Breiman, 1983).

While updating the estimators cheaply is not as trivial as in previous applications, it can be done cheaply as explained in Appendix H. We consider the Greenhouse gas sensor dataset of Lucas et al. (2015) obtained from the UCI repository (Dheeru and Taniskidou, 2017). This consists of greenhouse gas (GHG) concentrations measured on sensors. We consider one of the sensors as the target and empirically find that we get a saving of around x in the effective number of computations (which is the total number of points used in estimating the mutual information) to compute maximum mutual information as shown in Figure 1.

## 5 Improved Estimators for Sparse Datasets

In this section we illustrate that the performance of algorithms from our framework can be improved significantly by designing better estimators that exploit structure in the data using sparsity as an example.

Algorithms based on the adaptive Monte Carlo optimization framework adaptively sample a fraction of data to obtain estimates of to solve the minimization problem. Often we have to deal with sparse datasets. Ideally we would want to design estimators to leverage the sparsity and only sample the non-zero entries of the data. For example in the single cell gene dataset of 10xGenomics (2017), only of the entries are non-zero, even though the points live in dimensions.

Consider estimating the squared euclidean distance between sparse points and lying in dimensional space with and non-zero co-ordinates respectively. Rather than sampling over all co-ordinates, we would like to sample co-ordinates which are non-zero in either one of the two points. However computing the the set union of the non-zero co-ordinates of and , which would take time neutralizing any gain obtained using our framework. We circumvent this expensive computation by using the following estimator which does not use the set union.

 ^dℓ(x0,x1) =1ℓℓ∑k=1~d(x0,x1,k),

where the is obtained by first sampling , from the non-zero co-ordinates of and and

 ~d(x0,x1,k) =n02d(x0,t(0)k−x1,t(0)k)2(1+I(x1,t(0)k=0))+n12d(x0,t(1)k−x1,t(1)k)2(1+I(x1,t(1)k=0))

1. Sample a non-zero co-ordinate of and . Let the co ordinates obtained be and respectively.

2. Check if is a non-zero co-ordinate of . Based on this compute

 ~d(0)(x0,x1,k) =⎧⎪⎨⎪⎩n02d(x0,t(0)k−x1,t(0)k)2 if x1,t(0)k≠0,n0dx20,t(0)k if x1,t(0)k=0.

is similarly computed and depends upon if is a non-zero co-ordinate of .

3. We thus have that,

 ~d(x0,x1,k) =~d(0)(x0,x1,k)+~d(1)(x0,x1,k).

We next claim that the the estimator is unbiased. Let the set non-zero co-ordinates of and be and , with and . To see this note that, as is sampled uniformly from the , every element of is picked with probability . Thus, we have that

 E[~d(0)(x0,x1,k)] =1d∑j∈S0−S1(x0,j)2+12d∑j∈S0∩S1(x0,j−x1,j)2.

With an analogous evaluation of , we have that,

 E[~d(x0,x1,k)] =1d∑j∈S0−S1(x0,j)2+1d∑j∈S0∩S1(x0,j−x1,j)2+1d∑j∈S1−S0(x1,j)2, (a)=1d∑j∈S0∪S1(x0,j−x1,j)2, =1d∑j∈[d](x0,j−x1,j)2,

which is the distance between the two points. We note that follows from the definition of and .

We note that this estimator thus needs only two non-trivial operations: sampling from the non-zero co-ordinates of a sparse vector, and checking if a co-ordinate is non-zero in a sparse vector. To see that these can be done in time, we elaborate on how sparse vectors are usually stored in standard implementations.

Sparse vectors consist of a data structure a data-structure with two object: a vector storing the non-zero entries and a dictionary (or unordered map) with maps the co-ordinate in the original vector to the co-ordinate in the . We have that we can check membership of key in a dictionary as well find the value of the key if it is present in the dictionary in amortised time. Further we can generate a random key in the dictionary also in time. This gives us that can be computed in time.

As shown in Figure 2, we obtain x gain over brute force in the effective number of distance evaluations for -nearest neighbours and -means on the dataset of 10xGenomics (2017). The baseline considered here takes sparsity into account. We also observe if we had used the estimator of Section 2 we would not improve over the brute force algorithm.

## 6 Discussions

We show in our empirical and theoretical results that adaptive Monte-Carlo optimization is a powerful and broad framework that can provide significant computational savings in many basic machine learning problems. Some generalizations of the framework are discussed in Appendix D.

Modern datasets are often large in both sample size and dimensions. Our algorithm provides gains in one of the two. In applications such as -nearest neighbors as KGraph (Dong, 2014), Lanczos bisection based methods (Chen et al., 2009), Locality sensitive hashing based methods (Andoni and Indyk, 2008; Andoni et al., 2015) which provides savings in the sample size, whereas our algorithm provides savings in the dimensions. Integrating both these approaches to obtain savings in both sample size and dimension is an open question.

## References

• 10xGenomics  10xGenomics. 1.3 Million Brain Cells from E18 Mice. 10x Genomics, 2017.
• Alon et al.  Noga Alon, Yossi Matias, and Mario Szegedy.

The space complexity of approximating the frequency moments.

Journal of Computer and system sciences, 58(1):137–147, 1999.
• Andoni and Indyk  Alexandr Andoni and Piotr Indyk. Near-optimal hashing algorithms for near neighbor problem in high dimension. Communications of the ACM, 51(1), 2008.
• Andoni et al.  Alexandr Andoni, Piotr Indyk, Thijs Laarhoven, Ilya Razenshteyn, and Ludwig Schmidt. Practical and optimal lsh for angular distance. In Advances in Neural Information Processing Systems, pages 1225–1233, 2015.
• Bagaria et al.  Vivek Bagaria, Govinda Kamath, Vasilis Ntranos, Martin Zhang, and David Tse. Medoids in almost-linear time via multi-armed bandits. In

Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics

, volume 84, pages 500–509, 2018.
• Bentley  Jon Louis Bentley. Multidimensional binary search trees used for associative searching. Communications of the ACM, 18(9):509–517, 1975.
• Bickel and Breiman  Peter J Bickel and Leo Breiman. Sums of functions of nearest neighbor distances, moment bounds, limit theorems and a goodness of fit test. The Annals of Probability, pages 185–214, 1983.
• Chang et al.  Hyeong Soo Chang, Michael C Fu, Jiaqiao Hu, and Steven I Marcus.

An adaptive sampling algorithm for solving markov decision processes.

Operations Research, 53(1):126–139, 2005.
• Chen et al.  Jie Chen, Haw-ren Fang, and Yousef Saad.

Fast approximate knn graph construction for high dimensional data via recursive lanczos bisection.

Journal of Machine Learning Research, 10(Sep):1989–2012, 2009.
• Chow and Liu  C Chow and Cong Liu.

Approximating discrete probability distributions with dependence trees.

IEEE transactions on Information Theory, 14(3):462–467, 1968.
• Delattre and Fournier  Sylvain Delattre and Nicolas Fournier. On the kozachenko–leonenko entropy estimator. Journal of Statistical Planning and Inference, 185:69–93, 2017.
• Dheeru and Taniskidou  Dua Dheeru and Efi Karra Taniskidou. UCI machine learning repository, 2017.
• Dong  Wei Dong. Kgraph, 2014.
• Dong et al.  Wei Dong, Moses Charikar, and Kai Li. Efficient k-nearest neighbor graph construction for generic similarity measures. In Proceedings of the 20th international conference on World wide web, pages 577–586. ACM, 2011.
• Donoho and Grimes  David L Donoho and Carrie Grimes. Hessian eigenmaps: Locally linear embedding techniques for high-dimensional data. Proceedings of the National Academy of Sciences, 100(10):5591–5596, 2003.
• Drineas et al. [2006a] Petros Drineas, Ravi Kannan, and Michael W Mahoney. Fast monte carlo algorithms for matrices i: Approximating matrix multiplication. SIAM Journal on Computing, 36(1):132–157, 2006a.
• Drineas et al. [2006b] Petros Drineas, Ravi Kannan, and Michael W Mahoney. Fast monte carlo algorithms for matrices ii: Computing a low-rank approximation to a matrix. SIAM Journal on computing, 36(1):158–183, 2006b.
• Drineas et al. [2006c] Petros Drineas, Ravi Kannan, and Michael W Mahoney. Fast monte carlo algorithms for matrices iii: Computing a compressed approximate matrix decomposition. SIAM Journal on Computing, 36(1):184–206, 2006c.
• Eppstein and Wang  David Eppstein and Joseph Wang. Fast approximation of centrality. Graph Algorithms and Applications 5, 5:39, 2006.
• Even-Dar et al.  Eyal Even-Dar, Shie Mannor, and Yishay Mansour. Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems. Journal of machine learning research, 7(Jun):1079–1105, 2006.
• Gionis et al.  Aristides Gionis, Piotr Indyk, Rajeev Motwani, et al. Similarity search in high dimensions via hashing. In Vldb, volume 99, pages 518–529, 1999.
• Hastie et al.  Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The elements of statistical learning: data mining, inference, and prediction. Springer Heidelberg, 2nd edition, 2009.
• Jamieson and Talwalkar  Kevin Jamieson and Ameet Talwalkar.

Non-stochastic best arm identification and hyperparameter optimization.

In Artificial Intelligence and Statistics, pages 240–248, 2016.
• Johnson et al.  Jeff Johnson, Matthijs Douze, and Hervé Jégou. Billion-scale similarity search with gpus. arXiv preprint arXiv:1702.08734, 2017.
• Kearns et al.  Michael Kearns, Yishay Mansour, and Andrew Y Ng. A sparse sampling algorithm for near-optimal planning in large markov decision processes. Machine learning, 49(2-3):193–208, 2002.
• Kocsis and Szepesvári  Levente Kocsis and Csaba Szepesvári. Bandit based monte-carlo planning. In European conference on machine learning, pages 282–293. Springer, 2006.
• Kozachenko and Leonenko  LF Kozachenko and Nikolai N Leonenko. Sample estimate of the entropy of a random vector. Problemy Peredachi Informatsii, 23(2):9–16, 1987.
• Lai and Robbins  Tze Leung Lai and Herbert Robbins. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics, 6(1):4–22, 1985.
• Leskovec et al.  Jure Leskovec, Anand Rajaraman, and Jeffrey David Ullman. Mining of massive datasets. Cambridge university press, 2014.
• Li et al.  Lisha Li, Kevin Jamieson, Giulia DeSalvo, Afshin Rostamizadeh, and Ameet Talwalkar. Hyperband: A novel bandit-based approach to hyperparameter optimization. arXiv preprint arXiv:1603.06560, 2016.
• Linderman et al.  George C Linderman, Gal Mishne, Yuval Kluger, and Stefan Steinerberger. Randomized near neighbor graphs, giant components, and applications in data science. arXiv preprint arXiv:1711.04712, 2017.
• Lloyd  Stuart Lloyd. Least squares quantization in pcm. IEEE transactions on information theory, 28(2):129–137, 1982.
• Lucas et al.  DD Lucas, C Yver Kwok, P Cameron-Smith, H Graven, D Bergmann, TP Guilderson, R Weiss, and R Keeling. Designing optimal greenhouse gas observing networks that consider performance and cost. Geoscientific Instrumentation, Methods and Data Systems, 4(1):121, 2015.
• Moseley and Wang  Benjamin Moseley and Joshua Wang.

Approximation bounds for hierarchical clustering: Average linkage, bisecting k-means, and local search.

In Advances in Neural Information Processing Systems, pages 3097–3106, 2017.
• Murtagh and Contreras  Fionn Murtagh and Pedro Contreras. Algorithms for hierarchical clustering: an overview. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 2(1):86–97, 2012.
• Omohundro  Stephen M Omohundro. Five balltree construction algorithms. International Computer Science Institute Berkeley, 1989.
• Peng et al.  Hanchuan Peng, Fuhui Long, and Chris Ding. Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions on pattern analysis and machine intelligence, 27(8):1226–1238, 2005.
• Pilanci and Wainwright  Mert Pilanci and Martin J Wainwright. Iterative hessian sketch: Fast and accurate solution approximation for constrained least-squares. The Journal of Machine Learning Research, 17(1):1842–1879, 2016.
• Pilanci and Wainwright  Mert Pilanci and Martin J Wainwright. Newton sketch: A near linear-time optimization algorithm with linear-quadratic convergence. SIAM Journal on Optimization, 27(1):205–245, 2017.
• Ramanathan et al.  Vignesh Ramanathan, Kevin Tang, Greg Mori, and Li Fei-Fei. Learning temporal embeddings for complex video analysis. In

Proceedings of the IEEE International Conference on Computer Vision

, pages 4471–4479, 2015.
• Roweis and Saul  Sam T Roweis and Lawrence K Saul. Nonlinear dimensionality reduction by locally linear embedding. science, 290(5500):2323–2326, 2000.
• Russakovsky et al.  Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015.
• Silver et al.  David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, et al. Mastering chess and shogi by self-play with a general reinforcement learning algorithm. arXiv preprint arXiv:1712.01815, 2017.
• Tenenbaum et al.  Joshua B Tenenbaum, Vin De Silva, and John C Langford. A global geometric framework for nonlinear dimensionality reduction. science, 290(5500):2319–2323, 2000.
• Wang et al.  Daixin Wang, Peng Cui, and Wenwu Zhu. Structural deep network embedding. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pages 1225–1234. ACM, 2016.
• Yang and Moody  Howard Hua Yang and John Moody. Data visualization and feature selection: New algorithms for nongaussian data. In Advances in Neural Information Processing Systems, pages 687–693, 2000.

## Appendix A The modified UCB Algorithm

Let is a set of arms defined by the routine (say -nearest neighbors) that calls the function UCB. The routine calling function UCB also defines the methods to pull arms, update the estimates of their mean, and construct confidence intervals around these estimates. Further, it also provides a method to evaluate the means of the arm exactly.

Let be the true value of the mean of arm . Let be the estimate of the mean of arm at the -th iteration of the algorithm. Let be the -confidence interval for maintained at the -th iteration of the algorithm.

The UCB algorithm defined below is essentially the UCB of Lai and Robbins  with an extra condition of evaluating the mean of the arm exactly and setting the widths of the confidence interval to if it is pulled more than MAX_PULLS times. is the number of best arms returned by the algorithm.

## Appendix B Details of experiments

### b.1 Datasets

For -nearest neighbhours, -means and hierarchical clustering, we empirically evaluate the performance of MAB-based algorithms on two real world high-dimensional datasets: tiny-imagenet [Russakovsky et al., 2015], and single cell RNA-Seq dataset [10xGenomics, 2017].

Tiny-imagenet consists of k images of size with channels thus living in dimensional space. We downsize these images to evaluate our method on smaller dimensions. We chose imagenet because it is one of the most popular benchmark dataset in the field of computer vision – where many applications use nearest neighbhours, clustering in their pipelines.

To further exhibit the robustness of our approach across applications, our second dataset is a single cell RNA-Seq dataset of 10xGenomics  from the field of computational biology. The scale of this dataset is also large both in sample size and dimension and additionally it is a very sparse (). For our empirical evaluations we randomly sub-sample k points from this dataset. We note that each row lies in k dimensional space.

For Maximum Mutual information feature selection we evaluated the performance of MAB-based algorithm on smaller sized Greenhouse gas sensor dataset of Lucas et al. . This was the largest dataset we could find with continuous features and continuous response. We restricted ourselves to such datasets, because we wanted to exhibit the applicability of our approach with sophisticated estimators (like the Kozachenko-Leonenko based estimator for mutual information). This dataset consist of rows of dimensions i.e, it has readings from sensors for at time points. For a fixed sensor, say , our goal is to find the sensor (among the other sensors) with maximum mutual information.

### b.2 Implementation and evaluation

We implemented the code in C++. The results and figures of the paper can be reproduced from our github repository.

A schematic showing the structure of implementation is shown in Figure 3. Figure 3: The implementation consists of three modules. The first module loads the data, cleans it and puts it in a form that it can be processed. The second module uses the data and designs the arms corresponding to the problem at hand and then uses a multi-armed bandit algorithm from the third module.

UCB has only one tuning parameter, For all the experiments we set in the UCB algorithm. We evaluate the accuracy of the algorithms and control it as follows:

1. -nearest neighbors: For points, let the true nearest neighbors of point be the set (evaluated using the naive algorithm) and let the answer given by bandit-based algorithm be the set . We define the accuracy by . For all of the -nearest neighbour experiments, we obtain an accuracy of for the bandits based algorithm using .

2. means: For points and cluster centres, let for point let the nearest cluster be and the answer returned by bandit-based method be . We define the accuracy by . For all of the -means experiments, we obtain an accuracy of for the bandits based algorithm using .

3. Maximal Mutual Information: For dataset with features, let the feature with maximal mutual information with the label be and let the answer given by MAB-based algorithm be . We define the accuracy be . For all of the Maximal Mutual Information experiments, we obtain an accuracy of for the bandits based algorithm using .

4. Hierarchical clustering: For points let the be the tree graph given by the the brute hierarchical method, let be the tree graph given by MAB-based algorithm and let be a random tree graph with leaves. We measure the accuracy of as

 1−∑ni=0∑i−1j=0E[∣∣dist∗(i,j)−distMAB(i,j)∣∣]∑ni=0∑i−1j=0E[∣∣dist∗(i,j)−dist% random(i,j)∣∣],

where , , is the distance between point and in tree graph , and respectively. For all hierarchical clustering experiments, our accuracy and adjusted random score is .

In the above expressions, the expectation is taken over the randomness in the sampling to obtain the estimators.

We evaluate the algorithms based on the effective number of dimensions or points used by estimators in the algorithm. For -nearest neighbors, -means, hierarchical clustering this is the effective number of distance evaluations.

The effective number of distance computations between two points (which are non-sparse) is the fraction of dimensions used by the algorithm to estimate distance between two points. The effective number of distance computations in an algorithm is the sum of the effective number of distance computations between all pair points in the algorithm. This is a reasonable metric as the amount of work done between distance computations is at most .

For the sparse case we define the effective number of distance computations between two points is the fraction of the non-zero co-ordinates used by the algorithm to estimate distance between two points. The effective number of distance computations in an algorithm is the sum of the effective number of distance computations between all pair points in the algorithm.

For the Maximum Mutual information estimation problem we use the effective number of samples as the metric to measure performance. This sum of the number of samples used in the all the estimators.

### b.3 Exploration vs Exploitation

The bandit based algorithm for nearest neighbors algorithm evaluates true distance only for a handful of contenders. For the rest, it only computes an approximation of their distances. For point (in k) from imagenet , Figure 4 shows that the UCB achieves a ’good’ trade-off between exploration and exploitation for nearest neighbour problem - it evaluates distance along all the co-ordinates (measures the exact distance ) to the close points whereas for far off points it calculates distances only along very few co-ordinates (to obtain a coarse estimate of the distance). These observations generalise to the other three applications. Figure 4: The number of distance evaluations for k-nearest neighbor of a point with othe points in its neighborhood.

### b.4 Practice

By setting to large values (say ), UCB will maintain small confidence intervals around the mean estimates of each arm and this will reduce the total arm pulls by UCB. Therefore, one can trade off accuracy for faster running time. Further a deterministic running time can be acheived if one fixes the number of arm pulls for UCB. This might be desirable in certain applications.

#### Confidence intervals:

UCB heavily relies on the accuracy of its confidence intervals, whose estimation depends on the value of . Overestimating the value of (likewise confidence intervals) increases running time whereas underestimating decreases the accuracy. Hence having a good estimate of is crucial for the bandit based algorithm to perform well. We do it by maintaining a (running) estimate of the mean and the second moment for every arm. For instance which are application/dataset specific, if one can obtain tighter confidence intervals, it will vastly improve the running time.

Estimating confidence intervals using other techniques could potentially improve the performance.

#### Others miscellaneous observations

1. All the arms are maintained in a priority queue sorted by the lower confidence bounds of the arms. This gives us overhead in the running time over the number of arm pulls.

2. In our current implementation we sample the arms without replacement for the sake of simplicity. However, we note that sampling without replacement would improve both the accuracy and the running time of UCB.

3. We have not taken cache management into consideration in our implementation. Thus the gains in complexity are not directly reflected in the gain in running times.

## Appendix C More examples in adaptive Monte Carlo optimization framework

#### The medoid finding problem of Bagaria et al.  :

The problem they considered was finding medoids. Given a set of points in , the medoid is defined as the point with the smallest average distance from all other points in the ensemble (with respect to some distance). For simplicity of exposition we only present this in the context of distance. 444We discuss how other distances fall in our framework in Appendix D.

 mini∈[n]f(i) =mini∈[n]1nn∑j=1d∑k=1|xi,k−xj,k|

Bagaria et al.  assume that for any point , the distances to all other points is sub-gaussian (where is an absolute constant) which allows them to get the estimator,

 ^fℓ(i) =1ℓℓ∑j=1d∑k=1|xi,k−xtj,k|,

where is uniformly sampled from for . The sub-Gaussianity assumption gives that this estimators . They pose the problem as a multi-armed bandit problem with each point being an arm, and pulling an arm corresponding to updating the estimator, which can be done in constant time. They then run the UCB algorithm of Appendix A and report that this gives considerable computational savings.

#### Monte-carlo tree search of Kocsis and Szepesvári :

Here the authors aim to to carry out planning to decide the best possible policy at at state with respect to a finite horizon. The classical monte-carlo planning algorithms sample actions in each state and run the simulations for some fixed time and pick the action with best rewards. The authors propose treating each action as an arm and running a bandit at each state to decide on the best action.

#### Hyper-parameter tuning of Jamieson and Talwalkar  and Li et al.  :

We note that the hyperparameter tuning example of Jamieson and Talwalkar  and Li et al. 

can also be put into this framework with a twist. As discussed before, the authors are trying to find the hyperparameters with minimum validation error for a deep neural network. Each configuration of hyperparameters is an arm, and pulling an arm corresponds to running training for an epoch and measuring change in validation error. However, unlike our other examples, we it is not easy to come up with confidence intervals around the estimate, which makes the UCB algorithm untenable. However a successive elimination based algorithm like that of

Even-Dar et al.  can provide computational savings here, and Li et al.  present an algorithm similar in flavour.

#### Non-convex optimzation :

We note that non-convex optimization algorithms, especially in alternating minimization based algorithms where one starts with different starting points and picks the one with the best objective can in general be put in such a framework, and one can use successive elimination to obtain computational savings. This falls into the framework of [Jamieson and Talwalkar, 2016, Li et al., 2016].

#### Others:

Quickhull is a divide and conquer based method to compute convex hull of

points. The divide step of this algorithm involves first finds farthest points on each side of a hyperplane. One could improve the running time of this step by posing it in the framework of adaptive Monte Carlo optimization. Graph centrality measures such as betweenness centrality, closeness centrality can also be modeled in the adaptive Monte Carlo framework.

## Appendix D An extension: When we have a nice function of a good estimator

Consider a smooth function . We can generalise the adaptive Monte Carlo optimization framework of Eq. (1) to solve

 mini∈Ig(f(i))

Let be an estimator of . We use the delta method to approximate

 √n(g(f(i))−g(^fl(i))) ≈√ng′(^fl(i))(^fl(i)−f(i)) (4)

which gives us that,

 √n(g(f(i))−g(^fl(i))) →N(0,ng′(^fl(i))2Var% (^fl(i))) (5)

This allows us to construct confidence intervals around using estimators of . This allows us to extend the objective to additive functions composed with smooth functions.

If is assumed to be uniformly continuous in the closure of the range of , we have that for some known constant (independent of ). We further assume that for some known constant (independent of ).

For , we define the sequence of estimators where and whose confidence intervals can be obtained using equation (5). We then use this in the adaptive Monte Carlo optimization framework to solve (4). We note that the estimators are are not necessarily unbiased and can have a bias of approximately . An example here would be to compute medoids on some distance like the distance.

## Appendix E Details on k-nearest neighbors

To recap the setting we note that for posing -nearest neighbour graph construction in our framework, we run a multi-armed bandit with each arm being one of the remaining points.

We begin by noting that we measure the performance of algorithms in effective number of distance evaluations. We define effective number of distance evaluations between two points as the fraction of co-ordinates involved in estimating the distance between these two points. The effective number of distance evaluations of an algorithm is the sum of effective number of distance evaluations between every pair of points in the algorithm. We define the number of coordinate-wise distance evaluations of an algorithm to be times the the effective number of distance evaluations of the algorithm.

Without loss of generality, we assume that we are finding the nearest neighbors of point . In this case we have arms, one corresponding to each of points . Further the mean of each arm can be evaluated exactly in a maximum of pulls.

Let be used to denote the distance between point and point .

The expectation of each pull of arm is the distance between point and point . The estimator after pulls of arm is given by

 ^fℓ(i) =1ℓℓ∑j=1(x1,tj−xi,tj)2, (6)

where are sampled from with replacement for . The update of the estimator at step is

 ^fℓ+1(i)=1ℓ+1(ℓ^fℓ(i)+(x1,tℓ+1−xi,tℓ+1)2), (7)

We assume that the estimators are -subgaussian, for some independent of . In practice, we estimate from few initial samples and update it after every pull. We construct the confidence intervals of as,

 C(ℓ) =⎧⎪⎨⎪⎩√2σ2log2δℓ if ℓ≤d0 if ℓ>d, (8) f(i) ∈[^fℓ(i)−C(ℓ),^fℓ(i)+C(ℓ)], w.p. 1−δ. (9)

We assume that the estimate of is the true value. Further we set . Then we have that,

###### Lemma E.1.

With probability , each one of our true values lies in the their confidence intervals during the run of the algorithm.

Let be the -th nearest neighbors of point , and define

 Δi=max(0,f(i)−f(i∗k))

Using this lemma, we now state the following theorem,

###### Theorem E.2 (Restating Theorem 2.1).

With probability , Algorithm 1 returns the -nearest neighbors of point with at most

 M≤n∑i=2((24σ2Δ2ilog(nd))∧2d) (10)

distance evaluations. This takes

 O(n∑i=2log(n)((σ2log(nd)Δ2i)∧d)) (11)

time.

###### Proof.

Let , be the number of times, arm is pulled till iteration .

Let be the one nearest neighbors of point in order. For simplicity of notation, let us define

 ^μi(t) =^fℓi(t)(i), Ci(t) =C(ℓi(t)), Δ(w)i =max(0,f(i)−f(i∗w))

In words, is simply the estimate of the mean of arm at the -th iteration of the algorithm, and is the width of the confidence interval of arm at iteration of the algorithm. For any point which is not the nearest neighbors, is the difference in distance of point to point and the -th nearest neighbour. Also clearly from this definition, . Further defined above.

Let us first analyse the running of the algorithm till it found the first nearest neighbour of point to start with.

We observe that if we choose to update arm at time , then we have

 ^μi(t)−Ci(t)≤^μi∗1(t)−Ci∗1(t).

For this to occur, at least one of the following three events must occur:

 E1={^μi∗1(t)≥μi∗1(t)+Ci∗1(t)}, E2={^μi(t)≤μi(t)−Ci(t)}, E3={Δ(1)i=μi−μi∗1≤2Ci(t)}.

To see this, note that if none of occur, we have

 ^μi(t)−Ci(t)(a)>μi−2Ci(t)(b)>μi∗1(c)>^μi∗1−Ci∗1(t),

where , , and follow because , , and do not hold respectively.

We note that as we compute confidence intervals at most times for each point. Thus we have at most computations of confidence intervals in total.

Thus and do not occur during any iteration with probability , because

 w.p. (1−2n2):|f(i)−^μi(t)|≤Ci(t), ∀ i∈[n], ∀ t. (12)

This also implies that with probability the algorithm does not stop unless the event , a deterministic condition, stops occurring.

Let be the iteration of the algorithm when it evaluates a distance to point for the last time before declaring the -nearest neighbour. From the previous discussion, we have that the algorithm stops evaluating distances to points when the following holds.

 Ci(ζ(1)i)≤Δ(1)i2⟹Δ(1)i2≥ ⎷2σ2logn3dTi(ζ(1)i) or Ci(ζ(1)i)=0, ⟹Ti(ζ(1)i)≥8σ2(Δ(1)i)2log(n3d) or Ti(ζ(1)i)≥2d.

We note that the above is true for as well as we have at most distance evaluations there.

As the bandit algorithm progresses, let us similarly define, be the iteration of the algorithm when it evaluates a distance to point for the last time before declaring the -th-nearest neighbour. By the same computation as above, we have that,

 Ti(ζ(w)i)≥8σ2(Δ(w)i)2log(n3d) or Ti(ζ(w)i)≥2d.

We further note that the total number of distance computations is

 M =n∑i=2Ti(ζ(k)i), ≤n∑i=2⎛⎝⎛⎝8σ2(Δ(k)i)2log(n3d)⎞⎠∧2d⎞⎠, (d)=n