# Adaptive Estimation for Approximate k-Nearest-Neighbor Computations

Algorithms often carry out equally many computations for "easy" and "hard" problem instances. In particular, algorithms for finding nearest neighbors typically have the same running time regardless of the particular problem instance. In this paper, we consider the approximate k-nearest-neighbor problem, which is the problem of finding a subset of O(k) points in a given set of points that contains the set of k nearest neighbors of a given query point. We propose an algorithm based on adaptively estimating the distances, and show that it is essentially optimal out of algorithms that are only allowed to adaptively estimate distances. We then demonstrate both theoretically and experimentally that the algorithm can achieve significant speedups relative to the naive method.

## Authors

• 6 publications
• 66 publications
• 28 publications
• ### Leveraging Reinforcement Learning for evaluating Robustness of KNN Search Algorithms

The problem of finding K-nearest neighbors in the given dataset for a gi...
02/10/2021 ∙ by Pramod Vadiraja, et al. ∙ 0

• ### ParChain: A Framework for Parallel Hierarchical Agglomerative Clustering using Nearest-Neighbor Chain

This paper studies the hierarchical clustering problem, where the goal i...
06/08/2021 ∙ by Shangdi Yu, et al. ∙ 0

• ### Coresets for the Nearest-Neighbor Rule

The problem of nearest-neighbor condensation deals with finding a subset...
02/16/2020 ∙ by Alejandro Flores-Velazco, et al. ∙ 2

• ### Scalable Secure Computation of Statistical Functions with Applications to k-Nearest Neighbors

Given a set S of n d-dimensional points, the k-nearest neighbors (KNN) i...
01/22/2018 ∙ by Hayim Shaul, et al. ∙ 0

• ### Instance-based learning using the Half-Space Proximal Graph

The primary example of instance-based learning is the k-nearest neighbor...
02/04/2021 ∙ by Ariana Talamantes, et al. ∙ 0

• ### Combining Feature and Prototype Pruning by Uncertainty Minimization

We focus in this paper on dataset reduction techniques for use in k-near...
01/16/2013 ∙ by Marc Sebban, et al. ∙ 0

• ### Efficient Distributed Algorithms for the K-Nearest Neighbors Problem

The K-nearest neighbors is a basic problem in machine learning with nume...
05/15/2020 ∙ by Reza Fathi, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

A large number of algorithms in machine learning and signal processing are based on distance computations. The algorithms for solving the associated computational problems are typically designed to perform well on a set of problem instances, in a worst-case or average-case sense, but do not necessarily have optimal or close-to-optimal computational complexity on any given problem instance. As a consequence, these algorithms often have running times and guarantees that are the same for “easy” and “hard” problems.

Ideally, we would like an algorithm that adapts to any given problem instance and only carries out the computations necessary for that problem instance. In this paper, we consider an approach for speeding up algorithms and evaluating their complexity that is based on adapting to a given problem instance with random adaptive sampling techniques (Bagaria et al., 2018a, b). This approach is applicable to a variety of algorithms that are based on distance computations. Adding such adaptivity to algorithms can significantly speed up the algorithms’ running times, since computationally easy problems have accordingly smaller running times.

For concreteness, we focus on the problem of approximate -nearest-neighbor (-NN) computation. Specifically, given a query point and a set of points, , our goal is to find a subset containing the nearest neighbors of the query point. Our intuition is that an “easy” -NN problem instance is one where there is a set of at least points that are close to the query point, and the other points are rather far, such that the evaluation of only a few coordinate-wise distances of the far points is sufficient to know that they are farther away than the close points. Contrarily, a “hard” problem instance is one where the distance from to all other points is very similar, and thus it is difficult to find a subset of nearest neighbors.

We note that other formulations of the approximate -NN problem are common as well. For example, Andoni and Indyk (2006) consider a formulation where points can be returned whose distance from the query point is within a multiplicative factor to its nearest points.

We propose an algorithm, which we call the adaptive -NN algorithm, that adaptively estimates the distances and exploits the fact that for finding a set containing the nearest neighbors, it is not necessary to compute all distances exactly. In particular, for easy problem instances, coarse estimates are sufficient to identify a subset containing the nearest neighbors. Contrary to previous approaches, in particular that of Bagaria et al. (2018b), we focus on identifying a set containing the nearest neighbors, since this is computationally considerably cheaper than identifying the exact set of nearest neighbors.

We prove that the adaptive -NN algorithm is near instance-optimal for a restricted class of problems in the class of randomized algorithms that are based on adaptively estimating the distances. In a nutshell, the proof strategy is as follows: guaranteeing a solution to a given computational problem based on estimated distances (say, -NNs) requires sufficiently good estimates of the distances. With standard change of measure techniques (Kaufmann et al., 2016) we can derive instance-dependent lower bounds on the sample complexity required to obtain sufficiently good estimates. This sample complexity is also a lower bound on the computational complexity of the respective algorithm, since at the very least the distances have to be computed sufficiently well to be sure a subset of the

nearest neighbors can be identified with high probability. Further, we show that this computational complexity can also be achieved by designing an approximate

-NN algorithm that estimates the distances adaptively in time almost linear in the sample complexity.

## 2 Related Work on k-Nn

There are many highly efficient algorithms that solve versions of the (approximate) -NN problem. If the dimension of the data points is low, then the k-d tree algorithm (Bentley, 1975) performs very well. This algorithm builds a balanced k-d tree and traverses the tree to find the nearest neighbors. Contrary to our approach, the algorithm is based on pre-processing the data and thus becomes efficient only when performing many queries, so that the cost of building the tree becomes negligible. In addition, k-d trees become inefficient in high dimensions.

In order to overcome complexity in high dimensions, a number of works have proposed to find solutions that are approximate in that the algorithm is only asked to return points that are close in distance to the true nearest neighbors, for example, by using random projections (Ailon and Chazelle, 2006). Perhaps the most popular class of algorithms for performing approximate nearest-neighbor search is based on locality sensitive hashing (Andoni and Indyk, 2006). This class of algorithms works very well in theory and practice, but again uses a pre-processing step that is not negligible if only one query is executed.

If many -NN queries are carried out on the same dataset, then the k-d algorithm for small dimensions and locality sensitive hashing algorithms for higher dimensions are significantly more efficient than algorithms based on adaptively estimating scores, such as the algorithm proposed here, since then the amortized pre-processing costs become negligible. In contrast, our approach is efficient in high dimensions and when we only carry out one or very few queries.

A setting particularly relevant to our approach is that in which the dataset is rapidly changing, where the assumption of other -NN algorithms that pre-processing costs become negligible over time no longer holds. One such example is in the Impicit Maximum Likelihood Estimation procedure by Li and Malik (2018), where at each iteration nearest-neighbor queries must be performed against a set of samples from the new estimate of the distribution.

There are a few recent success stories where adaptive randomized algorithms significantly speed up computational problems: the Monte-Carlo tree search method for decision processes (Kocsis and Szepesvári, 2006)

, hyperparameter optimization in machine learning

(Li et al., 2018), finding the medoid in a large collection of high-dimensional points (Bagaria et al., 2018a), and solving discrete optimization problems involving distance computations adaptively (Bagaria et al., 2018b). All four works apply standard bandit algorithms in a creative way. Most related to our work is that of Bagaria et al. (2018b), which proposes an efficient algorithms for solving the -NN problem using an adaptive sampling strategy, similar to the one proposed here. The main difference is that we consider the approximate -NN problem, which is a more general problem that contains the problem of finding the exact -NN as a special case. In order to solve the approximate -NN problem, we have to solve a non-standard approximate bandit problem. In addition, we provide an algorithm that is near instance-optimal in the class of algorithms that estimate the distances for a restricted class of possible data points.

## 3 Problem Statement

Suppose we are given a set of high-dimensional points

 X={x1,…,xn}⊂Rm,

and our goal is to find, for another given point , a set of size that includes the nearest neighbors of to in -distance (our results and discussion generalize to other distances, such as the -distance). This is a generalization of the exact -nearest-neighbor problem and has applications in a vast number of classification tasks (Hastie et al., 2009).

For convenience, we assume that all points are normalized such that , where

denotes the largest absolute value of the vector

. We can brute-force solve the problem by computing all distances and then sorting, which yields a worst case complexity of . Our intuition is that it is unnecessary to compute the distances exactly, and that by approximating the distances we can save computations.

## 4 The Adaptive k-Nn Algorithm

The idea behind our adaptive -NN algorithm is to adaptively estimate the distances. Then, the problem of finding a superset containing the nearest neighbors reduces to a multi-armed bandit problem with the goal of identifying a set of size containing the smallest arms. Using that—up to a logarithmic factor—the sample complexity of the corresponding algorithm is equal to the computational complexity, we can provide an upper bound on the computational complexity of the algorithm by proving an upper bound on the sample complexity. Likewise, we can prove a lower bound for all algorithms that use adaptive estimates of the distances.

Our algorithm is inspired by the Hamming-LUCB algorithm in (Heckel et al., 2018) which in turn builds on the Lower-Upper Confidence Bound (LUCB) strategy, a popular algorithm for identifying the top-k items in a bandit problem (Kalyanakrishnan et al., 2012) and for ranking from pairwise comparisons (Heckel et al., 2019). The algorithm is based on actively identifying sets and consisting of and points, respectively, such that with high confidence the points in the first set have a smaller distance to than the points in the second set. Once we have found such sets, the set

 ˆS={1,…,n}∖ˆSfar

contains the closest points. Note that the cardinality of is , so to obtain a set of cardinality conaining the nearest neighbors, we choose on the order of . We adaptively estimate the normalized squared distances

 di=1m∥x−xi∥22

by sampling indices uniformly at random111Alternatively, one could select uniformly at random without replacement, in which case exactly. This could be implemented efficiently using ciphers, such as those described by Black and Rogaway (2002). from all indices and then estimating the squared distance as

 ˆdi(Ti)=1Ti∑j∈{j1,…,jTi}|[xi]j−[x]j|2,

where denotes the -th coordinate of . The key idea is to estimate the distances only sufficiently well as to be able to obtain the two sets and from them.

Let be the counter of the number of samples used for estimating the respective distance. We define a confidence bound based on a non-asymptotic version of the law of the iterated logarithm (Kaufmann et al., 2016; Jamieson et al., 2014); it is of the form222The constants involved can explicitly be chosen as

 α(u)∝√log(log(u)n/δ)u,

where is an integer corresponding to the number of samples, and is a parameter such that the algorithm succeeds with probability at least . Within each round, we also let denote a permutation of such that . We then define the indices

 q1 =argmaxi∈{(1),…,(k)}ˆdi+αi, (1) q2 =argmini∈{(k+1+h),…,(n)}ˆdi−αi, (2)

where . These indices are the analogues of the standard indices of the Lower-Upper Confidence Bound (LUCB) strategy from the bandit literature (Kalyanakrishnan et al., 2012) for the bottom and top arms. The LUCB strategy for exact bottom- recovery would update the scores and (for ) at each round. Our strategy will go after what it “thinks” are the bottom items, , and what it “thinks” are the top items, . Moreover, the algorithm keeps all the other items,

, in consideration for inclusion in these sets by keeping their confidence intervals below the confidence intervals of the items in

(see (3) in the algorithm below). This is crucial to ensure that the algorithm does not get stuck trying to distinguish the middle items, which in general requires many samples.

### 4.1 Logarithmic Computational Complexity for Each Iteration

We next describe several implementation details that are critical for ensuring that each iteration of the adaptive -NN algorithm has computational complexity and not , which a naïve implementation of computing the permutations via sorting and max and min computations would have. The key to achieving a logarithmic computational complexity is realizing that since only two distance estimates are updated in each iteration, the orderings of the distance estimates and confidence bounds will not change much between iterations, and at each iteration we only update the distance estimate for some indices that minimize or maximize some quantities relating to the confidence bound. This makes the algorithm amenable to the use of a heap data structure (Cormen et al., 2009) to reduce computational complexity.

Figure 1 illustrates how a total of seven heaps can be used to implement the adaptive -NN algorithm efficiently. For each of , , and , a set of two or three coupled heaps is maintained, providing ordering information on both the distance estimates (so that we can maintain our permutation at each iteration) and the confidence bounds (so that we can select which distance estimates to update). For example, to determine , we can simply extract the minimum from the min-heap on defined on the lower confidence bounds , which has a computational cost of . Later in the iteration, if we update the distance estimate at , we update both the distance estimate min-heap and lower confidence bound min-heap on accordingly, which has a computational cost of . At the end of the iteration, after making such updates across all three sets of heaps, the distance estimate ordering may not be maintained; e.g., the largest distance estimate in may be larger than the smallest distance estimate in . Items from each set must be swapped with items from other sets accordingly to restore the distance estimate ordering. Only two distance estimates are updated at each iteration, so at most a constant number of swaps that does not depend on are required, and each swap involves updating the appropriate heaps, yielding that the computational cost of restoring the distance estimate ordering is also . Thus, the overall complexity per iteration is . We ask the reader to refer to our implementation for further details.

### 4.2 Guarantees and Optimality of the Adaptive k-NN Algorithm

We next establish guarantees on the adaptive -NN algorithm’s success as well as on its computational complexity. The computational complexity depends on the gaps between the distances, defined as through the function

 N(x,X,h)=˜O(k∑i=1min(Δ−2i,k+1+h,m) (5) +n∑i=k+1+hmin(Δ−2k,i,m)+hmin(Δ−2k,k+1+h,m)).

The notation absorbs factors logarithmic in and doubly logarithmic in the gaps.

###### Theorem 1.

For any points , , the adaptive -NN algorithm run with parameters and yields a set of size that contains the nearest neighbors and has computational complexity at most

 N(x,X,h)

with probability at least .

Note that the computational complexity of the adaptive -NN algorithm is upper-bounded by the complexity of the naïve brute force method, , but is potentially significantly smaller. In particular, the computational complexity is small if there is a large gap between the -th closest point and the -th closest point. Taking , for example, the algorithms returns a set of size containing the nearest neighbors. We next present two examples, one where the sample complexity of the adaptive -NN algorithm is small, and one where it does not realize savings over the brute force method.

Both of these examples assume the data lies in a low-dimensional linear subspace, a regime where one might consider using projection-based -NN methods. However, such methods require knowledge of the dimension of the subspace and often have projection cost at least . Even if the dimension is known and the projection is as efficient as possible (such as subsampling), our method will still have the advantage in that it adapts to the distances.

First, consider a -dimensional subspace spanned by a matrix that has orthogonal columns. In addition, suppose that the columns of are incoherent with respect to the standard basis vectors . Specifically, suppose that the maximum inner product between a column of and a standard basis vector obeys

 maxi,j∣∣ ∣∣⟨ui∥ui∥2,ej⟩∣∣ ∣∣≤B√m

for some constant . We say is incoherent if this bound holds for small values of . Suppose then that the columns of are normalized to -norm and that the dataset and query point lie in that subspace, i.e.,

 x=Uy,X={Uyi:yi∈Y},

where and are the associated coefficient vectors. Assume that the associated coefficient vectors are normalized to have -norm equal to . Denote the gaps in the coefficient space by . From these assumptions, we are guaranteed that and the points in are bounded in norm by . In addition, we have that

 1m∥x−xi∥22=1pB2∥y−yi∥22.

Then , so the computational complexity of the adaptive -NN algorithm behaves like for large ; i.e., the computational complexity does not scale linearly with . Hence, we can expect the adaptive -NN algorithm to achieve significant computational savings when the data lies in a low-dimensional subspace of . Furthermore, the algorithm is able to realize these savings without having this subspace or its dimension specified. We illustrate this ability in our experiments below.

Next, suppose that the subspace is coherent with respect to the identitiy matrix. For example, consider the extreme case where all points lie in the one-dimensional subspace spanned by a single standard basis vector . Then, estimation of the distances to an accuracy of requires at least samples, and thus the adaptive -NN algorithm will always have the same sample complexity as the naïve brute force algorithm.

We next show that the algorithm is optimal among active algorithms that estimate the distances by sampling indices when the data points satisfy . We note that it is only the coordinates of the data that are so constrained; the normalized distances themselves can be essentially any values between 0 and 1 for large enough .

###### Theorem 2.

For any , let denote an algorithm that can only interact with the data by sampling coordinates uniformly at random and yields, for any and , the nearest neighbors with probability at least . Then, when is run on any set of data points , such that each coordinate of each point is either or , it has expected sample complexity at least

 Nlow(x,X,h)=c′(k−h∑i=1Δ−2i,k+1+h+n∑i=k+1+hΔ−2k−h,i),

where .

Note that the above lower bound does not depend on the gaps involving the items . However, in the case when , we can relate the lower bound and the upper bound by

 N(x,X,2h)≤˜O(Nlow(x,X,h)),

so that we see that, up to rescaling of the approximation parameter and logarithmic factors, the upper and lower bounds match. We emphasize that the lower bound only applies to algorithms that interact with the data by uniformly sampling the distances and only when we constrain the data points. Thus, Theorem 2 only tells us that the adaptive -NN algorithm is optimal among algorithms based on adaptively estimating the distances, but it does not state that algorithms based on other strategies cannot perform better.

## 5 Experiments

We run the adaptive -NN algorithm both on artificial data restricted to low-dimensional subspaces and on real data to demonstrate the effectiveness in reducing computation over the naïve algorithm. The artificial data is generated via , where is an i.i.d. Gaussian matrix, normalized to have unit-norm columns, are drawn uniformly at random from the unit sphere, and is the largest scalar such that for all

in the generated dataset. The real data comes from the Tiny ImageNet dataset

(2015), taking pixel values in . For each trial, we select by sampling points at random (without replacement for Tiny ImageNet) and then select the query point by drawing another sample. For these experiments we used and , where the choice for comes from the dimensionality of the Tiny ImageNet images, which are . Futher, we use and vary .

Figure 2 depicts the fraction of contained in (recall) along with ratio of the number of iterations taken by the adaptive -NN algorithm versus the naïve algorithm as we vary . Here , , and . We perform 20 random trials at each and plot the median value (lines) and interquartile range (shaded area) over the trials. As we expect from our previous discussion, for low-dimensional subspaces (e.g., ) we see significant computational savings (multiple orders of magnitude) by using the adaptive -NN algorithm over the naïve method while still being able to return a set containing all of the nearest neighbors. For larger , such as , this advantage is nearly non-existent. On Tiny ImageNet, we see similar performance gains to the small setting, which can be explained by the fact that, like most real datasets, the data can be well-approximated as lying in a low-dimensional subspace.

## 6 Proof of Theorem 1

The proof of Theorem 1 relies on relating the sample complexity to the computational complexity. We use that the computational complexity of the adaptive -NN algorithm is no more than times the sample complexity of the adaptive -NN algorithm. To see this, note that, as discussed in Section 4.1, each iteration has computational cost at most . The initialization of the heaps at the start of the algorithm can be done in computations using Floyd’s algorithm, which is smaller or equal to the sample complexity. As a consequence, the computational complexity of the adaptive -NN algorithm is no larger than times the sample complexity.

For notational convenience, we assume throughout that the distances are ordered so that

 d1≤d2≤…≤dn.

We begin by showing that the estimate is guaranteed to be -close to , for all , with high probability.

###### Lemma 3 (Kaufmann et al., 2016, Lemma 7).

For any , with probability at least , the event

 Eα\coloneqq{ ∣∣ˆdi(t)−di∣∣≤αi,∀i∈[n],∀t≥1} (6)

occurs. The statement continues to hold for any with , .

Lemma 3 is a non-asymptotic version of the law of the iterated logarithm from (Kaufmann et al., 2016) and (Jamieson et al., 2014). Note that the lemma uses that is a sum of

independent random variables, each bounded between

and , and .

We first show that, on the event defined in equation (6), the set contains the nearest neighbors. On the event , we have by the termination condition (4) (which is satisfied since the algorithm has terminated) that the items in the set have smaller distances than the items in the set . Because there are distances that are smaller than the distances in , the set cannot contain any of the -nearest neighbors, i.e., . Thus .

We next show that on the event , the adaptive -NN algorithm terminates after the desired number of samples. Let , and define the event that item is bad as

###### Lemma 4.

If occurs and the termination condition (4) is false, then either or occurs.

Lemma 4 is proved in Section A.1 in the supplementary material. Given Lemma 4, we can complete the proof in the following way. For an item , define

 Δi=⎧⎨⎩dk+1+h−di,i∈{1,…,k}di−dk,i∈{k+1+h,…,n}dk+1+h−dk,otherwise,

and let be the smallest integer satisfying the bound . A simple calculation (see Section A.2 in the supplementary material) yields the following fact.

###### Fact 5.

On the event , if holds, then does not occur.

Let be the -th iteration of the steps in the algorithm, and let and be the two items selected in the algorithm. Note that in each iteration only those distances are estimated. By Lemma 4, we can therefore bound the total number distance estimate updates by

For inequality (i), we used Fact 5, and inequality (ii) follows because can only be true for iterations .

We conclude the proof by noting that the definition of yields the following upper bound (see Section A.3 in the supplementary material).

###### Fact 6.

For sufficiently large,

 ˜Ti ≤clog(nδ)log(2log(2/Δi))Δ2i. (8)

Applying this inequality to the right-hand side of equation (7) above concludes the proof.

## 7 Proof of Theorem 2

Consider an algorithm that can only interact with the data by sampling a index uniformly at random and then obtaining in response. Because the points satisfy , this is equivalent to drawing from a distribution corresponding to a binary random variable that has mean . The problem of finding a subset containing the nearest neighbors then corresponds to the problem of identifying a subset of all distribution consisting of distributions containing the smallest means. Here we will consider only the case where . This is an approximate version of the bottom- identification problem in the bandit literature (Kalyanakrishnan et al., 2012). Thus, to prove Theorem 2, we provide a lower bound on the sample complexity required to find a subset of distributions containing the smallest means, and this lower bound is also a lower bound on the computational complexity.

Towards this goal, we first introduce some notation required to state a useful lemma (Kaufmann et al., 2016, Lemma 1) from the bandit literature. Let be a collection of probability distributions, each supported on . Consider an algorithm , that, at times , selects the index and receives an independent draw from the distribution in response. Algorithm may select only based on past observations; that is, is -measurable, where is the -algebra generated by . Algorithm has a stopping rule that determines the termination of . We assume that is a stopping time measurable with respect to and obeying .

Let denote the total number of times index has been selected by the algorithm (until termination). For any pair of distributions and , we let

denote their Kullback-Leibler divergence, and for any

, let denote the Kullback-Leiber divergence between two binary random variables with success probabilities .

With this notation, the following lemma relates the cumulative number of comparisons to the uncertainty between the actual distribution and an alternative distribution .

###### Lemma 7 (Kaufmann et al., 2016, Lemma 1).

Let be two collections of probability distributions on . Then for any event with , we have

 n∑i=1Eν[Ni(ξ)]KL(νi,ν′i)≥kl(Pν[E],Pν′[E]). (9)

In our setting, since and are binary distributions, . Let us now use Lemma 7 to prove Theorem 2.

Let be the set of distributions out of the distributions with the smallest means. Define the event

 E\coloneqq{S1(ν)⊂ˆS},

corresponding to success of the algorithm . Recalling that is the stopping rule of algorithm , we are guaranteed that . Let be a set of distinct items from the set of distributions with the largest means denoted by . We next construct an alternative distribution such that under that distribution, , or equivalently, that . Since we assume algorithm succeeds for any distribution with probability at least , we have both and . To see this, note that if succeeds under , then . As such, there can be at most elements of in , which means that does not occur.

It follows that

 kl(Pν[E],Pν′[E])≥kl(δ,1−δ) =(1−2δ)log1−δδ≥log12δ, (10)

where the last inequality holds for . It remains to specify the alternative distribution . The alternative distribution is defined as

 ν′i ={νk−h,i∈Mνi,otherwise.

To be most precise on avoiding ties, for , one should take for some and let , but we omit carrying out the associated technical details in this proof. It follows that, under the distribution , the items in the set are not among the items with the largest means which ensures that .

Let be the total number of draws from the distribution . We have that

 ∑ℓ∈Mkl(νℓ,ν′ℓ)Eν[Nℓ] (i)=n∑i=1Eν[Ni]kl(νi,ν′i) (ii)≥kl(Pν[E],Pν′[E]) ≥log12δ. (11)

Here step (i) follows from the fact that for all by definition of the , and step (ii) follows from Lemma 7. Finally, inequality (11) follows from inequality (10).

We next upper bound the KL divergence on the left hand side of inequality (11). Using the inequality , valid for , we have that for ,

 kl(νℓ,ν′ℓ) ≤klℓ,klℓ\coloneqq(dℓ−dk−h)2dk−h(1−dk−h). (12)

Applying inequality (12) to the left hand side of inequality (11) yields

 ∑ℓ∈MklℓEν[Nℓ]≥log12δ, (13)

which is valid for each subset of cardinality .

We can therefore obtain a lower bound on by solving the minimization problem:

 minimizeeℓ≥0∑ℓ∈S2(ν)eℓsubject to∑ℓ∈Mklℓeℓ≥log12δ for all M⊆S2(ν) of % cardinality h+1. (14)

Since the are increasing in (recall that we assume the distances to be ordered such that ) the solution to this optimization problem is and for .

Using an analogous line of arguments for items in the set (see Section B in the supplemental material), we arrive at the lower bound

 log12δk−h∑i=1dk+1+h(1−dk+1+h)(di−dk+1+h)2 +log12δn∑i=k+1+hdk−h(1−dk−h)(dk−h−di)2

on the number of comparisons. This concludes the proof.

#### Acknowledgements

We thank the anonymous reviewers for their constructive feedback. DL and RB are partially supported by NSF grants IIS-17-30574 and IIS-18-38177, AFOSR grant FA9550-18-1-0478, ARO grant W911NF-15-1-0316, ONR grants N00014-17-1-2551 and N00014-18-12571, DARPA grant G001534-7500, and DOD Vannevar Bush Faculty Fellowship (NSSEFF) grant N00014-18-1-2047. RH is partially supported by NSF award IIS-1816986.

## References

• tin (2015) Tiny imagenet visual recognition challenge. Accessed: 2019-02-21.
• Ailon and Chazelle (2006) N. Ailon and B. Chazelle. Approximate nearest neighbors and the fast Johnson-Lindenstrauss transform. In

Proceedings of the Thirty-eighth Annual ACM Symposium on Theory of Computing

, pages 557–563, 2006.
• Andoni and Indyk (2006) A. Andoni and P. Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. In Proceedings of the 47th Annual IEEE Symposium on Foundations of Computer Science, pages 459–468. IEEE, 2006.
• Bagaria et al. (2018a) V. Bagaria, G. Kamath, V. Ntranos, M. Zhang, and D. Tse. Medoids in almost-linear time via multi-armed bandits. In

Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics

, pages 500–509, 2018a.
• Bagaria et al. (2018b) V. Bagaria, G. M. Kamath, and D. N. Tse. Adaptive monte-carlo optimization. arXiv:1805.08321, 2018b.
• Bentley (1975) J. L. Bentley. Multidimensional binary search trees used for associative searching. Communications of the ACM, 18(9):509–517, Sept. 1975.
• Black and Rogaway (2002) J. Black and P. Rogaway. Ciphers with arbitrary finite domains. In Proceedings of the Cryptographers’ Track at the RSA Conference on Topics in Cryptology, pages 114–130. Springer, 2002.
• Cormen et al. (2009) T. Cormen, C. Leiserson, R. Rivest, and C. Stein. Introduction to Algorithms. MIT Press, 3 edition, 2009.
• Hastie et al. (2009) T. J. Hastie, R. J. Tibshirani, and J. J. H. Friedman. The elements of statistical learning. Springer, 2 edition, 2009.
• Heckel et al. (2018) R. Heckel, M. Simchowitz, K. Ramchandran, and M. Wainwright. Approximate ranking from pairwise comparisons. In Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics, pages 1057–1066, 2018.
• Heckel et al. (2019) R. Heckel, N. B. Shah, K. Ramchandran, and M. J. Wainwright. Active ranking from pairwise comparisons and when parametric assumptions don’t help. Annals of Statistics, 2019. arXiv:1606.08842.
• Jamieson et al. (2014) K. Jamieson, M. Malloy, R. Nowak, and S. Bubeck. lil’ UCB : An optimal exploration algorithm for multi-armed bandits. In Proceedings of The 27th Conference on Learning Theory, pages 423–439, 2014.
• Kalyanakrishnan et al. (2012) S. Kalyanakrishnan, A. Tewari, P. Auer, and P. Stone. PAC subset selection in stochastic multi-armed bandits. In Proceedings of the 29th International Conference on Machine Learning, pages 655–662, 2012.
• Kaufmann et al. (2016) E. Kaufmann, O. Cappé, and A. Garivier. On the complexity of best-arm identification in multi-armed bandit models. Journal of Machine Learning Research, 17(1):1–42, 2016.
• Kocsis and Szepesvári (2006) L. Kocsis and C. Szepesvári. Bandit based monte-carlo planning. In Proceedings of the 17th European Conference on Machine Learning, pages 282–293. Springer-Verlag, 2006.
• Li and Malik (2018) K. Li and J. Malik. Implicit maximum likelihood estimation. arXiv preprint arXiv:1809.09087, 2018.
• Li et al. (2018) L. Li, K. Jamieson, G. DeSalvo, A. Rostamizadeh, and A. Talwalkar. Hyperband: A novel bandit-based approach to hyperparameter optimization. Journal of Machine Learning Research, 18(185):1–52, 2018.

## References

• tin (2015) Tiny imagenet visual recognition challenge. Accessed: 2019-02-21.
• Ailon and Chazelle (2006) N. Ailon and B. Chazelle. Approximate nearest neighbors and the fast Johnson-Lindenstrauss transform. In

Proceedings of the Thirty-eighth Annual ACM Symposium on Theory of Computing

, pages 557–563, 2006.
• Andoni and Indyk (2006) A. Andoni and P. Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. In Proceedings of the 47th Annual IEEE Symposium on Foundations of Computer Science, pages 459–468. IEEE, 2006.
• Bagaria et al. (2018a) V. Bagaria, G. Kamath, V. Ntranos, M. Zhang, and D. Tse. Medoids in almost-linear time via multi-armed bandits. In

Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics

, pages 500–509, 2018a.
• Bagaria et al. (2018b) V. Bagaria, G. M. Kamath, and D. N. Tse. Adaptive monte-carlo optimization. arXiv:1805.08321, 2018b.
• Bentley (1975) J. L. Bentley. Multidimensional binary search trees used for associative searching. Communications of the ACM, 18(9):509–517, Sept. 1975.
• Black and Rogaway (2002) J. Black and P. Rogaway. Ciphers with arbitrary finite domains. In Proceedings of the Cryptographers’ Track at the RSA Conference on Topics in Cryptology, pages 114–130. Springer, 2002.
• Cormen et al. (2009) T. Cormen, C. Leiserson, R. Rivest, and C. Stein. Introduction to Algorithms. MIT Press, 3 edition, 2009.
• Hastie et al. (2009) T. J. Hastie, R. J. Tibshirani, and J. J. H. Friedman. The elements of statistical learning. Springer, 2 edition, 2009.
• Heckel et al. (2018) R. Heckel, M. Simchowitz, K. Ramchandran, and M. Wainwright. Approximate ranking from pairwise comparisons. In Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics, pages 1057–1066, 2018.
• Heckel et al. (2019) R. Heckel, N. B. Shah, K. Ramchandran, and M. J. Wainwright. Active ranking from pairwise comparisons and when parametric assumptions don’t help. Annals of Statistics, 2019. arXiv:1606.08842.
• Jamieson et al. (2014) K. Jamieson, M. Malloy, R. Nowak, and S. Bubeck. lil’ UCB : An optimal exploration algorithm for multi-armed bandits. In Proceedings of The 27th Conference on Learning Theory, pages 423–439, 2014.
• Kalyanakrishnan et al. (2012) S. Kalyanakrishnan, A. Tewari, P. Auer, and P. Stone. PAC subset selection in stochastic multi-armed bandits. In Proceedings of the 29th International Conference on Machine Learning, pages 655–662, 2012.
• Kaufmann et al. (2016) E. Kaufmann, O. Cappé, and A. Garivier. On the complexity of best-arm identification in multi-armed bandit models. Journal of Machine Learning Research, 17(1):1–42, 2016.
• Kocsis and Szepesvári (2006) L. Kocsis and C. Szepesvári. Bandit based monte-carlo planning. In Proceedings of the 17th European Conference on Machine Learning, pages 282–293. Springer-Verlag, 2006.
• Li and Malik (2018) K. Li and J. Malik. Implicit maximum likelihood estimation. arXiv preprint arXiv:1809.09087, 2018.
• Li et al. (2018) L. Li, K. Jamieson, G. DeSalvo, A. Rostamizadeh, and A. Talwalkar. Hyperband: A novel bandit-based approach to hyperparameter optimization. Journal of Machine Learning Research, 18(185):1–52, 2018.

## Appendix A Proofs of Lemmas and Facts

### a.1 Proof of Lemma 4

The proof is very similar to the proof of Lemma 2 of Heckel et al. (2018). There are several cases of and to consider. We will show each by contradiction, starting with the assumption that the termination condition is false and both and do not occur, all under the event . Let denote the complement of . It also will be useful to define the quantity

 m2=argmaxi∈{(k+1),…,(k+h)}αi (15)

such that .

1. When and , we have by that

 ˆdq1+αq1<ˆdq1+3αq1≤γ (16)

and similarly that by . Since , we have that in both the case that and . Together, this implies that the termination condition (4) is true, which violates our assumption.

2. When and , we have first by that . Starting from here, and using the definition of , we have for all ,

 γ ≥ˆdq1+αq1+2αq1 ≥ˆdi+αi+2αq1 ≥di+2αq1 >di. (17)

Now we let denote . By definition of , using , we have that for all . Then we can start from to conclude that for all ,

 γ >ˆdq1+αq1 (i)>ˆdq2−αq2 ≥ˆdq2−Δ4 ≥ˆdj−Δ4 ≥dj−αj−Δ4 ≥dj−Δ2, (18)

where comes from our assumption that the terminating condition (4) is false. Combining (17) and (18) along with , we obtain that for all , which is a contradiction, since there can be at most values of that are smaller than .

3. When and , the case is similar to the previous case, except that we need to bound for in a different way. By , , and