K-nearest neighbor (kNN) search refers to the problem of finding K points closest to a given data point on a distance metric of interest. It is an important task in a wide range of applications, including similarity search in data mining [15, 19], fast kernel methods in machine learning [17, 30, 38]
, nonparametric density estimation[5, 29, 31] and intrinsic dimension estimation [6, 26]
in statistics, as well as anomaly detection algorithms[2, 10, 37]. Numerous algorithms have been proposed for kNN search; the readers are referred to [35, 46] and references therein. Our interest is kNN search in emerging applications. Two salient features of such applications are the expected scalability of the algorithms and their ability to handle data of high dimensionality. Additionally, such applications often desire more accurate kNN search. For example, robotic route planning  and face-based surveillance systems  require a high accuracy for the robust execution of tasks. However, most existing work on kNN search [1, 4, 12, 15] have focused mainly on the fast computation and accuracy is of a less concern. Indeed these form the major motivations of the present work.
We propose to use random projection forests (rpForests
) for kNN search, inspired by the success of Random Forests (RF), random projection trees (rpTree) , as well as some previous work of the author [42, 44, 45] that use the idea of ensemble or random projections. rpForests is an ensemble of rpTrees, a randomized version of the popular kd-tree [3, 19, 21, 33]. The idea of ensemble to improve algorithmic performance is well-established in statistics and machine learning (see, for example, [7, 8, 18, 42]), and has been the essential ingredient underlying some of the most successful machine learning algorithms [8, 18]. More recently, the winner of the well-known Netflix Challenge 
is an ensemble of 106 classifiers.
As rpForests uses rpTrees as its building block, it inherits several desired properties of tree-based methods. Trees are known to be invariant to monotonic transformations of the data and are easy to interpret. Tree-based methods are typically very efficient with a computational complexity at the order of the tree heights (which is on average the logarithm of the total number of data points). Arguably more importantly, tree-based methods can be viewed as recursive space partitioning [3, 11, 43]. Thus data points living in the same tree leaf node would be “similar”. This property is frequently leveraged for approximate large scale computation, either with data points in the same leaf node as a large computation unit, or one of such points or a signature of them as a proxy for further computation [12, 13, 28, 39, 44] etc.
A non-desirable effect of tree-based methods is the introduction of boundary among data points not in the same leaf nodes (i.e., data points not in the same leaf node are viewed as lying outside the locality of each other). This is not a problem for some applications (e.g., ), as the overall decision boundary is only affected by tree nodes near the decision boundary (which amounts to a negligible fraction of all the data points). However, this may cause errors to algorithms that would be affected by the “artificial” boundary introduced at every leaf node. For example in kNN search, it has been observed [12, 28] that the best matches for data points may be missed near the boundary of the leaf nodes. This is illustrated in Figure 1 where rpTree is used for kNN search. Initially, the root node consists of all the data points. Suppose we are interested in a given data point A, and all its kNNs are marked as blue (other points are not drawn for clarity of visualization). The split of the root node causes one of the kNNs of point A to lie at a different child node from where A and all its other kNNs would lie. During the growth of rpTree, additional kNNs of A may be separated from A. Eventually, point A and most of its kNNs would be in the same leaf node, with a few of its kNNs landing in different leaf nodes.
Several innovative ideas have been proposed as a remedy. For example,  proposed to use the spill tree where the child nodes in a k-d tree are allowed to overlap thus effectively reduce the chance of missing the best matches.  considered a variant, the virtual spill tree, where each data point is allocated to one leaf node but an overlapping split would route a data point to multiple leaf nodes and kNN search is performed on their union. The implementation of spill tree or its variants is, however, complicated as it involves decision on the overlapping of splits or nodes.
rpForests uses the ensemble of rpTrees to reduce the chance of mis-matches in kNN search.
For each rpTree, kNN search will be routed to one leaf node, but the search for kNNs will be
conducted on the union of all the routed leaf nodes in the ensemble. As the growth of individual
rpTrees is independent, the union of such leaf nodes will extend the boundary beyond that of any single leaf node, thus reducing the chance of a mismatch. The ensemble nature of the algorithm and the random nature of decisions involved in the growth of rpTree make it very easy to implement and also possible to
run the algorithm on multi-core or clustered computers. As our algorithm uses rpTree as a building block (modified in the choice of splitting direction) and resembles that of RF, we term it rpForests.
Our main contributions are as follows. First, we propose a method that has the flexibility of tree-based methods and the power of ensemble methods; the method is simple to implement, highly scalable, and readily adapt to the geometry of the data. As the method is ensemble-based, easily it can run on clustered or multi-core computers. Second, we develop a theory on the probability of neighboring points being separated by ensemble rpTrees. Such a probability would explain why a tree built through random projections may be suitable for kNN search, and indeed the error decays exponentially fast when the number of trees increases. Third, our theory can be used to guide the choice of random projections—those aligning with directions along which the data stretches more are preferred. Indeed, almost all previous methods on random trees pursue the selection of the splitting point rather than the direction (i.e., random); rpForests refines the splitting direction and our experiments suggest that this works well and such a strategy maybe applied in more general settings.
The remainder of this paper is organized as follows. In Section 2, we give a detailed description of rpForests. This is followed by a little theory on the probability of a miss in kNN search by rpForests in Section 3. Related work are discussed in Section 4. In Section 5, we present experimental results on a wide variety of real datasets. Finally, we conclude in Section 6.
2 An algorithmic description of rpForests
rpForests uses rpTree as its building block. The growth of a tree proceeds as follows. Starting with the entire data set as the root node of a tree, it first splits the root node into two child nodes according to a splitting rule. On each of the two child nodes, it recursively applies the same procedure until some stopping criterion is met, e.g., the node becomes too small (i.e., contains too few data points).
The split of a node, say , in rpTree will be along a randomly generated direction, . There are many ways to randomly split a node into its left and right child, . One choice is to select a point, say , uniformly at random over the interval formed by the projection for all points in onto . For a point , its projection onto is given by
where indicates dot product. Define the projection coefficient of points in along direction as . Denote the projection coefficient of the splitting point by . Then the left child is given by ,
and the right child by the rest of points. Another popular way is to choose the median of as the split point.
One advantage of rpTree over traditional tree-based methods such as the kd-tree is its ability to adapt to the geometry of the data and readily overcome the curse of dimensionality. rpTree has been used frequently as a central data structure for fast computation; see, for example, [12, 13, 44]. kNN search by rpTree has a very low computational complexity. The growth of the tree for data points has an expected computational complexity , and a search involves traversing a data point from the root node down to a leaf node which, on average, costs .
The rpForests algorithm for kNN search consists of three parts—algorithm for rpTree (Algorithm 1), algorithm for the selection of a splitting direction (Algorithm 2), and algorithm to ensemble many instances of rpTrees to find kNNs (Algorithm 3). We start by describing Algorithm 1.
Let denote the given data set. Let denote the rpTree to be built from . Let denote the set of working nodes. Let denote a constant for the minimal number of data points in a tree node for which we will split further. Let denote the projection coefficient of point onto line . Let denote the set of neighborhoods s.t. each element of is a set of neighboring points in .
In choosing the splitting direction, a basic implementation of our algorithm would simply generate a random direction. A refinement is to generate a number, nTry
, of random projections, and choose one such that the projected data stretches the most. In Statistics, the stretch of the data or the spread (dispersion) is measured by variance of the data. So we will use the standard deviation of the projected data as a measure of the data spread. As will be clear from our theory, such a choice will guide the split along a direction that the data stretches the ‘most’ thus avoiding the situation where the data split would lead to thin slices (a setting where kNN search would easily fail). This is described as Algorithm2.
Next we describe Algorithm 3 for finding kNNs using the ensemble of rpTrees. Let be the set of data points for which we wish to find their kNNs in . Note that can be the set itself.
3 Theoretical analysis
rpForests involves randomness in both split direction and split point in the growth of individual trees, it is desirable to know
what performance guarantee rpForests would deliver. Our analysis will estimate the probability that two points for which one
is a kNN of another will get separated (i.e., landing in different leaf nodes) during the growth of a tree (for the basic implementation). That is
when there would be a miss in kNN search on a single tree. We will show that such a probability is small for any given pair of kNN points.
Of course, ensemble further reduces such a probability, and indeed the probability would decrease sharply as the ensemble size increases.
Definition. Let be a set of points. Define its neck size, denoted as , as the following
The above definition defines the “minor” direction of a data set, i.e., the direction along which the data points “stretch” the least, while the principal direction the most. The neck is the size of the range of data points along the minor direction. For kNN search, it is undesirable to have a small neck size during any stage of the tree growth, as that will increase the chance of separating two nearby points by a tree split (a potential miss in kNN search). Algorithm 2 aims at reducing the chance of a small neck as the node split selected by the algorithm will be along a direction that the data “stretches a lot”.
Let be a set of data points with neck size . Assume each tree in the ensemble splits at most times, and the neck of the child nodes shrinks by at most a factor of . Then, given any two points, and , with distance , the probability that they will be separated in the ensemble is at most
For given , the kNN distance decreases very quickly when the number of data points increases [6, 36]. So we can reasonably assume that is small compared to other quantities such as in Theorem 3.1. If one can properly control the value of , then the probability that any given two nearby points are separated into different buckets (tree leaf nodes) is small for a single tree in the ensemble, and this probability will further decrease when the size of the ensemble grows. This is feasible as the value of only affects the size of the leaf nodes and can be adjusted.
4 Related work
Algorithms for fast kNN search can be divided into three categories, including hashing-, graph-, and partitioning tree-based. Hashing-based algorithms  typically need to build a locality-sensitive hash function to find nearby points. The hash function will route neighboring points into the same hash bucket with higher probability than those points far apart. Thus, its design is critical and would determine the quality of kNN search. Graph-based methods [17, 30] construct a kNN graph over the data points, which is then used for fast kNN search. Though such methods are generally computationally efficient, the index construction is typically slow. Space-partitioning trees are more popular for kNN search. For example, K-d tree 
divides the data space recursively into cells along coordinate-aligning axes, and then search for kNNs by a backtrack or priority search over the tree. Many methods have been proposed for space partitioning, such as k-means trees, cover trees , VP trees  and ball trees . However, these methods typically require a long index building process for large data sets.
Recently, randomized trees have been used for kNN search, such as randomized k-d tree  and random projection trees [12, 22, 39]. Randomized k-d trees  grow a tree by randomly choosing a split point from coordinate-align axes. Random projection trees 
choose the splitting hyperplanes sampled randomly from the unit sphere. Hyvonen et al. uses sparse random projections to grow the tree and then ensemble. Sparse projections are generated and shared among nodes in the same level of a tree. Implementations with sparse projections are only slightly faster than dense ones, and it is not clear if there is an adverse effect to the quality of kNN search.
rpForests is a partitioning-tree based algorithm. It was primarily inspired by our past experience with RF and the use of rpTrees in algorithmic design [42, 44, 45]. Important ingredients of rpForests, the selection of random projections and the ensemble of rpTrees, are motivated by the theory we have developed. Our random projections are towards a real line (exactly as in ) which is fast and easy to implement; while those in related work are from to , which involves expensive matrix multiplication and is typically slow to implement. rpForests can be viewed as complementing existing work on unsupervised extensions to RF—Cluster Forests  which aims at clustering by ensemble of randomized feature pursuits—in the sense that it preserves locality of nearby data points by ensemble of rpTrees. One existing work  also uses the name random projection forests, but is fundamentally different in that it implements RF by replacing the random selection of candidate features at node splits with sparse random projections. More recently,  considers random projection ensemble for classification where each ensemble instance is built on a selected random projection out of a block of projections for quality assurance. Also related is multi-view learning which combines views from different sources to improve generalization [14, 41].
We conduct experiments on a wide variety of real datasets. Most are taken from the UC Irvine Machine Learning Repository , with the exception of the Olivetti face from the Cambridge University Computer Laboratory (http://www.cl.cam.ac.uk/research/dtg/attarchive/facedatabase.html). A summary of these datasets is given in Table 1. A particularly remarkable feature about these datasets is their wide coverage of data dimensions, ranging from 19 to about 10000. We conduct experiments on both accuracy and running time, presented in Section 5.2 and Section 5.3, respectively.
|Wisconsin breast cancer (WDBC)||30||569|
|CT Slice Localization||386||53500|
5.1 Evaluation metrics
Our performance evaluations are based on two metrics, the average missing rate and the average discrepancy in the k-th nearest neighbor distance. Let the data be denoted by . For each point , denote the number of its kNNs missed by the algorithm by . Then,
Clearly, for , the smaller the better.
For each point , the distance between and its k-th nearest neighbor is called the kNN distance for , denoted by . Thus,
For any point, say , it is clear that if one misses any of its kNNs, then the estimated kNN distance, , obtained by an algorithm satisfies . Thus, the average kNN distance will be larger as well. So for , the smaller the better.
5.2 Experiments on accuracy
For each dataset, we vary the number of trees in rpForests through . The results are
averaged over 100 runs. We take and (results not reported due to limit in space),
as these two are of the most interests in real applications.
When growing the rpTrees, we fix the size of the leaf nodes (i.e., the node capacity)
to be no more than 20 for and 30 for . Here, we normalize the average discrepancy in kNN distance for each dataset
when plotting the results.
Figure 2 shows kNN search results under the two metrics. The number of random projections, , are taken as for the 9 datasets in the order listed in Table 1. It can be seen that errors, as measured by the two metrics, in kNN search by rpForests decrease quickly as the number of trees increases. Indeed the decay appears to be exponentially fast, as predicted by our theory. For most datasets we explore, and the errors sharply vanish to 0 with about 20-40 trees.
We also assess the role of the number of random projections, , on kNN search. Here, we report only result () on the Musk dataset. As shown in Figure 3, more trees or projections tend to improve kNN search quality. That choosing the splitting direction from a pool of more random projections can actually be explained by our theory—more random projections will lead to a cut along a direction that the data stretches more thus reducing the chance of thin slices (a situation that would cause misses in kNN search). Similar patterns are also observed on all other datasets (not reported here due to limit in space).
5.3 Experiments on running time
The running time of rpForests is evaluated and compared to the cover tree  and the CR algorithm . We reuse some data from Table 1 with the addition of some larger datasets. This gives seven datasets with varying sizes and data dimensions; see Table 2. To leverage the ensemble nature of rpForests and the wide availability of multicore machines, we run rpForests with 2-core and 4-core machines, termed as rpF and rpF, respectively. For all data, rpForests consists of 40 trees which is deemed adequate according to experiments discussed above. As the running time vary widely across different datasets, we present those in logarithmic (base 2) scale. Figure 4 shows the running time on all datasets as a bar-chart for . It can be seen that rpForests is competitive on most of the data when a 4-core machine is used. A similar pattern can be seen for (omitted here). The running time are produced on a MacBook Air with 1.7GHz Intel Core i7 processor and 8G memory.
|Gas sensor array||19||4,178,504|
For datasets we have explored, the running time decreases nearly inversely proportional to the number of cores of the computing machine. We hypothesize that this is true in general; a conclusive statement requires more experiments on machines with more cores which we leave to future work.
rpForests is an efficient kNN search algorithm that is simple to implement, highly scalable, and readily adapt to the geometry of the underlying data. The ensemble nature of rpForests makes it easy to run in parallel on multicore or clustered computers, with running time decreasing with more cores or computers used in the computation. rpForests has the flexibility of tree-based methods; it is easy to interpret and is invariant under (monotonic) data transformations. On a wide variety of real datasets, with data dimension ranging from a few dozen to about 10000, rpForests quickly achieves about zero error in kNN search when the number of trees increases. We develop a theory on the fast decay of probability that neighboring points are separated by ensemble rpTrees. Interestingly, our theory actually gives guidance on the selection of random projections—those aligning with directions along which the data stretches more are preferred, which is confirmed by our experiments. This is different from all previous random tree-based methods that focus on the selection of the splitting point rather than the splitting direction; this strategy is expected to be applicable in more general settings.
-  A. Andoni and P. Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Communications of the ACM, 51(1):117–122, 2008.
F. Angiulli and C. Pizzu.
Fast outlier detection in high dimensional spaces.Lecture Notes in Computer Science, 2431:43–78, 2002.
-  J. Bentley. Multidimensional binary search trees used for associative searching. Communications of the ACM, 18(9):509–517, 1975.
-  A. Beygelzimer, S. Kakade, and J. Langford. Cover trees for nearest neighbor. In Proceedings of the 23rd International Conference on Machine Learning (ICML), 2006.
P. J. Bickel and L. Breiman.
Sums of functions of nearest neighbor distances, moment bounds, limit theorems and a goodness of fit test.The Annals of Probability, 11(1):185–214, 1983.
-  P. J. Bickel and D. Yan. Sparsity and the possibility of inference. Sankhya: The Indian Journal of Statistics, Series A (2008-), 70(1):1–24, 2008.
-  L. Breiman. Bagging predicators. Machine Learning, 24(2):123–140, 1996.
-  L. Breiman. Random Forests. Machine Learning, 45(1):5–32, 2001.
-  T. Cannings and R. Samworth. Random-projection ensemble classification. Journal of Royal Statistical Society, Series B, 79(4):959–1035, 2017.
-  V. Chandola, A. Banerjee, and V. Kumar. Anomaly detection: A survey. Technical Report, University of Minnesota, 2007.
S. Dasgupta and Y. Freund.
Random projection trees and low dimensional manifolds.
Fortieth ACM Symposium on Theory of Computing (STOC), 2008.
-  S. Dasgupta and K. Sinha. Randomized partition trees for nearest neighbor search. Journal Algorithmica, 72(1):237–263, 2015.
-  A. Dhesi and P. Kar. Random projection trees revisited. In Advances in Neural Information Processing Systems (NIPS), 2010.
Z. Ding, M. Shao, and Y. Fu.
Robust multi-view representation: A unified perspective from
multi-view learning to domain adaption.
In Proceedings of the 27th International Joint Conference on Artificial Intelligence, pages 5434–5440, 2018.
-  W. Dong, C. Moses, and K. Li. Efficient k-nearest neighbor graph construction for generic similarity measures. In Proceedings of the 20th International Conference on World Wide Web, 2011.
-  A. Feuerverger, Y. He, and S. Khatri. Statistical significance of the netflix challenge. Statistical Science, 27(2):202–231, 2012.
-  C. Fowlkes, S. Belongie, F. Chung, and J. Malik. Spectral grouping using the Nystrm method. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(2), 2004.
-  Y. Freund and R. Schapire. Experiments with a new boosting algorithm. In International Conference on Machine Learning (ICML), 1996.
-  J. Friedman, J. Bentley, and R. Finkel. An algorithm for finding the best matches in logorithmic expected time. ACM Transactions on Mathematical Software, 3(3):209–226, 1977.
-  R. Hartley. Optimised kd-trees for fast image descriptor matching. In Proceedings of IEEE CVPR, 2008.
W. Hunt, W. Mark, and G. Stoll.
Fast kd-tree construction with an adaptive error-bounded heuristic.In IEEE Symposium on Interactive Ray Tracing, pages 81–88, 2006.
-  V. Hyvönen and T. Pitkänen and S. K. Tasoulis and E. Jaasaari and R. Tuomainen and L. Wang and J. Corander and T. Roos. Fast nearest neighbor search through sparse random projections and voting. In Proceedings of the IEEE International Conference on Big Data (Big Data), 2016.
-  M. Kleinbort, O. Salzman, and D. Halperin. Efficient high-quality motion planning by fast all-pairs r-nearest-neighbors. In IEEE International Conference on Robotics and Automation (ICRA), pages 2985–2990, 2015.
D. Lee, M.-H. Yang, and S. Oh.
Fast and accurate head pose estimation via random projection forests.
IEEE International Conference on Computer Vision (ICCV), 2015.
-  B. Leibe, K. Mikolajczyk, and B. Schiele. Efficient clustering and matching for object class recognition. In British Machine Vision Conference (BMVC’06), September 2006.
-  E. Levina and P. J. Bickel. Maximum likelihood estimation of intrinsic dimension. In Advances in Neural Information Processing Systems 17, 2005.
-  M. Lichman. UC Irvine Machine Learning Repository. http://archive.ics.uci.edu/ml, 2013.
-  T. Liu, A. Moore, A. Gray, and K. Yang. An investigation of practical approximate nearest neighbor algorithms. In Neural Information Processing Systems (NIPS), volume 19, pages 825–832, 2004.
-  D. O. Loftsgaarden and C. P. Quesenberry. A nonparametric estimate of a multivariate density function. The Annals of Mathematical Statistics, 36(3):1049–1051, 1965.
-  M. Lucinska and S. Wierzchon. Spectral Clustering Based on k-Nearest Neighbor Graph. In Conference on Computer Information Systems and Industrial Management, pages 254–265, September 2012.
Y. P. Mack and M. Rosenblatt.
Multivariate k-nearest neighbor density estimates.
Journal of Multivariate Analysis, 9:1–15, 1979.
M. Muja and D. G. Lowe.
Scalable nearest neighbor algorithms for high dimensional data.IEEE Transactions on Pattern Analyses and Machine Intelligence, 36:2227–2240, 2014.
-  M. Otair. Approximate k-nearest neighbor based spatial clustering using k-d tree. International Journal of Database Management Systems, 5(1):97–108, 2013.
-  C. Otto, D. Wang, and A. K. Jain. Clustering millions of faces by identity. IEEE Transactions on Pattern Analyses and Machine Intelligence, 40(2):289–303, 2018.
-  A. N. Papadopoulos and Y. Manolopoulos. Nearest Neighbor Search: A Database Perspective. Springer, 2005.
-  M. Penrose and J. Yukich. Laws of large numbers and nearest neighbor distances. Advances in Directional and Linear Statistics, pages 189–199, 2010.
-  S. Ramaswamy, R. Rastogi, and K. Shim. Efficient algorithms for mining outliers from large data sets. In Proceedings of the ACM SIGMOD, pages 427–438, 2000.
B. Schlkopf and A. Smola.
Learning with kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press, 2001.
-  K. Sinha. LSH vs randomized partition trees: Which one to use for nearest neighbor search? In Proceedings of the 13th International Conference on Machine Learning and Applications, pages 41–46, 2014.
-  W. N. Venables and B. D. Ripley. Modern Applied Statistics with S. Springer, 2002.
-  C. Xu, D. Tao, and C. Xu. A survey on multi-view learning. arXiv:1304.5634, 2013.
-  D. Yan, A. Chen, and M. I. Jordan. Cluster Forests. Computational Statistics and Data Analysis, 66:178–192, 2013.
-  D. Yan and G. E. Davis. The turtleback diagram for conditional probability. The Open Journal of Statistics, 8(4):684–705, 2018.
-  D. Yan, L. Huang, and M. I. Jordan. Fast approximate spectral clustering. In Proceedings of the 15th ACM SIGKDD, pages 907–916, 2009.
D. Yan, T. W. Randolph, J. Zou, and P. Gong.
Incorporating deep features in the analysis of tissue microarray images.Statistics and Its Interface, 2018.
-  P. N. Yianilos. Data structures and algorithms for nearest neighbor search in general metric spaces. In Proceedings of ACM-SIAM SODA, pages 311–321, 1993.