1 Introduction
Stochastic optimization of a function defined on a large finite set frequently arises in many practical problems. Instances of this problem include finding the most attractive design for a webpage, the node with maximum influence in a social network, etc. There are a number of approaches to this problem. At one extreme, we can use global optimization methods such as simulated annealing, genetic algorithms, or crossentropy methods
(Rubinstein, 1997, Christian and Casella, 1999). Although limited theoretical performance guarantees are available, these methods perform well in a number of practical applications. In the other extreme, we can use the bestarm identification algorithms from bandit literature or hypothesis testing from statistics community (Mannor and Tsitsiklis, 2004, EvenDar et al., 2006, Audibert and Bubeck, 2010). We have stronger sample complexity results for this family of algorithms. The sample complexity states that, with high probability, the algorithm will return a nearoptimal solution after a small number of observations, and this number typically grows polynomially with the size of the decision set. These methods however often perform poorly when the size of the decision set is large. In such problems, it is important to exploit the structure of the specific problems to speed up the optimization. If appropriate features are available and it can be assumed that the objective function is linear in these features, then algorithms designed for bandit linear optimization are applicable
(Auer, 2002). More generally, kernelbased or neural network based bandit algorithms are available for problems with nonlinear objective functions
(Srinivas et al., 2010).We are particularly interested in problems where the item similarity is captured by a graph structure. In general, and with no further conditions, we cannot hope to show nontrivial sample complexity rates. We observe that in many realworld applications, the objective function is easy and hillclimbing friendly, in the sense that from many nodes, there exist a monotonic path to the global minima. We make this property explicit by defining a notion of convexity
for functions defined on graphs. Under this condition, we show that a hillclimbing procedure that uses a variant of bestarm identification as a subroutine has a small sample complexity that is independent of the size of graph. In the presence of local minima, this greedy procedure might require many restarts, and might not be efficient in practice. Simulated annealing is commonly used in practice for such problems. We also define a notion of nearly convex functions that allows for existence of shallow local minima. We show that for nearly convex functions and using an appropriate estimation of function values, the classical simulated annealing procedure finds near optimal nodes after a small number of function evaluations.
While asymptotic convergence of simulated annealing is studied extensively in the literature, there are only few finitetime convergence rates for this important algorithm. Sreenivas Pydi et al. (2018) show finitetime convergence rates for the algorithm, but they consider only deterministic functions and their rates scale with the size of the graph. These results, that are obtained under very general conditions, do not quite explain success of the simulated annealing algorithm in largescale problems. In practice, simulated annealing finds a nearoptimal solution with a much smaller number of function evaluations. Our bounds are in terms of the convexity of the function, and the rates do not scale with the size of the graph. Additionally, our results hold in the noisy setting. Bouttier and Gavra (2017) show convergence rates for simulated annealing applied to noisy functions. Their results, however, appear to have several gaps. ^{size=,color=red!20!white,}^{size=,color=red!20!white,}todo: size=,color=red!20!white,Yasin: explain more or remove this!
1.1 Related Work
A bestarm identification algorithm from bandit literature can find a nearoptimal node, and the time and sample complexity is linear in the number of nodes (Mannor and Tsitsiklis, 2004, EvenDar et al., 2006, Audibert and Bubeck, 2010, Jamieson and Nowak, 2014, Kaufmann et al., 2016). Such complexity is not acceptable when the size of the graph is very large. We are interested in designing and analyzing algorithms that find a nearoptimal node, and the sample and computational complexity is sublinear in the number of nodes.
Bandit algorithms for graph functions are studied in Valko et al. (2014). The sample complexity of these algorithms scale with the smoothness of the function, but the computational complexity scales linearly with the number of nodes. We are interested in problems where the graph is very large, and we might have access to only local information. As we will see in the experiments, the SpectralBandit algorithm of Valko et al. (2014) is not applicable to problems with large graphs. Further, we study a different notion of regularity that is inspired by convexity in continuous optimization. Bandit problems for large action sets are also studied under a number of different notions of regularity (Bubeck et al., 2009, AbbasiYadkori et al., 2011, Carpentier and Valko, 2015, Grill et al., 2015).
1.2 Contributions
We cannot hope to achieve sublinear rate without further conditions on the function. We define a notion of strong convexity for functions defined on a graph. For strongly convex functions, we design an algorithm, called ExploreDescend, to find the optimal node with guaranteed error probability. The ExploreDescend algorithm uses a bestarm identification algorithm as a submodule. The bestarm identification problem is the problem of finding the optimal action given a sampling budget. For nearly convex functions, we show that the classical simulated annealing algorithm finds the global minima in a reasonable time. For convex functions, the ExploreDescend has lower sample complexity, but simulated annealing can handle nonconvex functions.
We study the empirical performance of ExploreDescend and simulated annealing in two applications. First, we consider the problem of content optimization in a digital marketing application. In each round, a user visits a webpage that is composed of a number of components. The learning agent chooses the contents of these components and shows the resulting page to the user. The user feedback, in the form of click/no click or conversion/no conversion, is recorded and used to update the decision making policy. The objective is to return a page configuration with nearoptimal clickthrough rate. We would like to find such a configuration as quickly as possible with a small number of user interactions. In this problem, each page configuration is a node of the graph, and two nodes are connected if the corresponding page configurations differ in only one component.
Our second application is the problem of nearest neighbor search and classification. Given a set of points , a query point , and a distance function , we are interested in solving . A trivial solution to this problem is to examine all points in and return the point with the smallest distance value. The computational complexity of this method is , and is not practical in largescale problems. An approximate nearest neighbor method returns an approximate solution in a sublinear time. A class of approximate nearest neighbor methods that is particularly suitable for bigdata applications is the graphbased nearest neighbor search Arya and Mount (1993). In a graphbased search, we construct a proximity graph in the training phase, and perform a hillclimbing search in the test phase. To improve the performance, we perform an additional smoothing procedure that replaces the value of a node by the average function values in a vicinity of the node. In practice, these average values are estimated by performing random walks, and hence the problem is a graph optimization problem with noisy observations. We show that the proposed graph optimization technique outperforms popular nearest neighbor search methods such as KDTree and LSH in two image classification problems. Interestingly, and compared to these methods, the computational complexity of the proposed technique appears to be less sensitive to the size of the training data. This property is particularly appealing in big data applications.
1.3 Notations
We use to denote the set . Let be a graph with nodes and the set of edges . Let be some unknown function defined on . Let be the global minimizer of . The goal is to find a node with small loss . In this paper, we study the problem where the evaluation of is noisy, such that we can only observe , where is a zeromean
subGaussian random variable, meaning that for any
, .Let denote the set of neighbors of node . Let . For simplicity, we assume all nodes have the same degree and we let denote the number of neighbors. We sometimes write to denote the function value . For two nodes , we use to denote all paths from to in the graph. We use to denote all paths starting from node . We use to denote the length of a path.
1.4 Convexity
The general discrete optimization problem is hard for an unrestricted function and graph . As such, we study a restricted class of problems that allow for efficient algorithms. Let be the amount of improvement if we move from node to the neighbor . We say path is strongly convex if , for all . Sometimes we use to denote if the strongly convex path is clear from the context. We use to denote . ^{size=,color=blue!20!white,}^{size=,color=blue!20!white,}todo: size=,color=blue!20!white,Tan: can be negative at local max?

Definition 1 (Convex Functions) A function defined on a graph is (strongly) convex if from any node , there exists a (strongly) convex path to the global minima .
For a nearly convex function, as defined next, the convexity condition is not satisfied at all points.

Definition 2 (Nearly Convex Functions) Let be a parameter. Let be the set of points such that . We say function is nearly convex if for any , there exists a and a such that and .
A path that satisfies the above conditions is called a low energy path. A convex function is also a nearly convex function.
Lemma 1.
We have that . ^{size=,color=cyan!20!white,}^{size=,color=cyan!20!white,}todo: size=,color=cyan!20!white,Anup: If are next to each other, then we should have ? ^{size=,color=red!20!white,}^{size=,color=red!20!white,}todo: size=,color=red!20!white,Yasin: yes, which is fine, right?
Proof.
Let be a strongly convex path. Then
We have that
We conclude that , from which the statement follows. ∎
Concave and nearly concave functions are similarly defined. The convexity condition allows for efficient algorithms. The intuition behind our analysis is that, by strong convexity, if node is far from the global minima the improvement is large. When degree is relatively small, the local search methods have a sufficiently large probability of hitting a good direction (the strongly convex path). So either we are already close to the global minima, or we are far from the global minima and with a constant probability we make a large improvement.
2 Approximating the Global Minimizer
We analyze two algorithms for the graphbased optimization. The first algorithm is a local search procedure where in each round, the learner attempts to identify a good neighbor and move there. This algorithm is analyzed under the strong convexity condition. The second algorithm is the wellknown simulated annealing procedure with an exponential transition function. We provide high probability error bounds for the greedy method, while the sample complexity of the more complicated simulated annealing is analyzed in expectation.
2.1 The Local Search Algorithm and BestArm Identification
The greedy approach is an intuitive approach to the graph optimization problem: We start from a random node, and at each node and given a sampling budget, we explore its neighbors to find the best neighbor. The problem that we solve in each node can be viewed as a fixedbudget bestarm identification problem Audibert and Bubeck (2010). In a fixedbudget setting, the learner plays actions for a number of rounds in some fashion and returns an approximately optimal arm at the end. An example of an algorithm designed for the fixedbudget setting is the SuccessiveReject algorithm of Audibert and Bubeck (2010). ^{size=,color=red!20!white,}^{size=,color=red!20!white,}todo: size=,color=red!20!white,Yasin: explain why fixed budget algorithms are more appropriate
Before describing the bestarm identification algorithm, we introduce some notation. Let be an integer, be the set of arms in a bandit problem, be the mean value of arm , and be the empirical mean of arm after observations. Let be the optimal arm (break tie arbitrarily) and be its mean, and let be the optimality gap of arm , i.e. . Without loss of generality assume that . Let be the budget. Define
The SuccessiveReject algorithm is a procedure with rounds where in each round one arm is eliminated. Let be the remaining arms at the beginning of round . In round , each arm is selected for rounds. At the end of this round, the arm with the smallest empirical mean is removed.
Theorem 1 (Audibert and Bubeck (2010)).
The probability of error of the SuccessiveReject algorithm with a budget of is smaller than
In the ExploreDescend procedure, we use the above bandit algorithm to find the best neighbor in each round. More specifically, for a node , we solve a bestarm identification problem with action values , where is the th neighbor of node . ^{size=,color=cyan!20!white,}^{size=,color=cyan!20!white,}todo: size=,color=cyan!20!white,Anup: Hill ’climbing’, descent direction and the good arm definitions are a bit confusing as sometimes they speak of ascent and sometimes descent We then move to the chosen neighbor and repeat the process until the budget is consumed. We call this approach “Explore and Descend” and it is detailed in Algorithm 1.
The algorithm depends on the subroutine . This subroutine is an implementation of the bestarm bandit algorithms described earlier with the decision set and budget .
Corollary 1.
Let be a strongly convex function, let be an arbitrary starting node, and let , , be per round budgets. Assume that is sufficiently large for the steepest descent path from to reach . Let be the probability of error of the ExploreDescend algorithm. Then is upper bounded by the following inequality:
(1) 
where is the th optimality gap at node .
Proof.
On the other hand, we can also apply the SuccessiveReject algorithm directly on , the set of all nodes of the graph. In this case, each node in the graph is one arm, and the graph structure is disregarded. The probability of error of SuccessiveReject, using Theorem 1 with and , is upper bounded by the following inequality:
(2) 
Using the loose upper bound , the above can be written as:
(3) 
Note that the error bounds of SuccessiveReject given in Equations 2 and 3 is vacuous when . In contrast, Equation 1 provides meaningful error bounds for ExploreDescend, even in this small budget regime, when . Additionally, we see that the error bound for ExploreDescend are independent of the size of the graph. Rather, it depends on , which in turn depends on the convexity constant . Larger means the function is steeper and fewer steps (smaller ) are required to reach the global optimum.
2.2 Nearly Convex Problems: Simulated Annealing
Given that we have access only to noisy observations, we consider the following MetropolisHastings Algorithm with exponential weights:
To simplify the analysis, we use a fixed timeindependent inverse temperature, although in practice a timevarying inverse temperature might be preferable. We estimate each function value by samples. Next, we provide sample complexity bounds for the above procedure in expectation.
Theorem 2.
For a strongly convex function , let and be the path generated by SimAnnealing from an arbitrary initial node . Given a constant and with the choice of , after rounds, we have
For a nearly convex function, let and let . With the choice of and after rounds, we have that
Proof.
Let be the closest point in on a low energy path from and let be the distance to this point from . Let be the set of paths of length less than and starting at such that at least one node on the path is not the same as . Consider the low energy subpath starting at :
Let be the terminating node in a path . Given that function is nearly convex, and given that noise is subGaussian^{size=,color=cyan!20!white,}^{size=,color=cyan!20!white,}todo: size=,color=cyan!20!white,Anup: No mention about the noise in the theorem statement, probability that this lowenergy path is taken by the algorithm is
Because is a low energy path, we have . Let , and . Notice that given , and are deterministic. We write
The first term on the right side is related to the event that the state follows the low energy path to the state for rounds, and then goes to the best immediate neighbor at state . The second term is related to the event that the state is not the same as after rounds. Finally, the last term is related to the event that the state stays in for the next rounds. If , then by Definition 1.4, we already have
Otherwise if , we continue as follows:
where the last step follows from inequality . By Definition 1.4, . Thus,
where we used in the last step. Let . We have,
where the second step holds by , , and , and is defined in the theorem statement. Using the fact that and , and a simple calculation shows that after rounds, we have that
If , then we get that
After rounds, we have that
For a strongly convex function, following a similar argument, we have that
Thus, given and with the choice of in the theorem statement, after rounds, we have . ∎
For nearly convex problems, the error bound in this theorem is meaningful only if parameter is small. In our experiments data, this value is or .
3 Experiments
We implemented ExploreDescend and Simulated Annealing algorithms and compare them with SpectralBandit of Valko et al. (2014) and SuccessiveReject of Audibert and Bubeck (2010). The SuccessiveReject algorithm works by considering all nodes of the graph as a big multiarm bandit problem. The SpectralBandit
uses the adjacency matrix to calculate the Laplacian and Eigenvectors. Both of these algorithms, therefore, require global information of the graph, while our algorithms assume only local information: from one node one can only access its neighbors.
We note that the SpectralBandit algorithm is originally designed to minimize the cumulative regret, which partly explains the poor performance in a bestarm identification problem. In the fixed budget bestarm identification setting, we run SpectralBandit
until the budget is consumed and output the most frequently pulled arm, which is, in our experiments, better than taking the arm with the best empirical mean. The algorithm has a high time complexity because of its reliance on matrix operations on matrices and vectors of dimension
. As SpectralBandit does not scale well with the size of the graph due to its matrix operations, we generated a small synthetic graph in order to evaluate it.3.1 Applications in GraphStructured BestArm Identification: Synthetic Data
First, we evaluated different algorithms on synthetic graphs, which are generated as follows. Each node of the graph is a point , where for some . Each node is connected to all of its eight immediate neighbors on the plain. Additionally, random edges are added in, such that the degree of each node is 15. The mean function value is . This mean is unknown to the algorithms, i.e. when the algorithm requests the value of , it is returned with a stochastic value: 1 with probability and 0 with probability . It is easy to see that this graph is concave by Definition 1.4.
The results of the experiment is presented in Figure 1. The performance measure is the average suboptimality gap, i.e. , where is the solution returned by the algorithm. The average is taken over 1000 trials. Overall, our algorithms significantly outperform SpectralBandit and SuccessiveReject both in term of time and suboptimality gap, especially when the budget is smaller than the graph size, which is our intended setting. For Simulated Annealing, we use a fixed inverse temperature . The result could be further improved by optimizing a schedule for this parameter. Interestingly, the number of pulls for each function evaluation also has significant impact on the performance, which can be seen from the plots for Sim Annealing 1 (1 pull) and Sim Annealing 5 (5 pulls) in Figure 1. As for ExploreDescend, we simply allocate the budget equally for each node in the descending path, with the maximum path length set to for and for . This algorithm is the fastest and also offers the best performance. Source code is available at https://github.com/tan1889/graphopt
3.2 Applications in GraphStructured BestArm Identification: Web Document Reranking
To demonstrate the performance of our algorithm on realworld nonconcave problems, we used data from Yandex Personalized Web Search Challenge to build a graph as follows. Each query forms a graph, whose nodes correspond to lists of 5 items (documents) for the given query. Two nodes are connected if they have 4 or more items in common at the same positions. The value of a node is a Bernoulli random variable with mean equal to the clickthrough rate of the associated list. The goal is to find the node with maximum clickthrough rate, i.e. the most relevant list. We chose the query that generated the largest possible graph (query no. 8107157) of 4514 nodes. As there were many small partitions in this graph, we took the largest partition as the input for our experiment. The resulting graph has 3992 nodes with degree varying from 1 to 171 (mean=35) and a maximum function value at .
For nonconcave graphs, ExploreDescend needs to make multiple restarts. We set the number of restarts to and allocate the equally between all restarts, then, for each restart, equally between each node in the path. All other parameters are the same as before.
The results of the experiment is presented in Figure 2. In the intended setting, our algorithms significantly outperform Successive Reject. For very small budget, Simulated Annealing is better than ExploreDescend, but this is reversed as the budget gets bigger. Additionally, for this graph we don’t see the big advantage of Sim Annealing 5 over Sim Annealing 1 as was the case before. Outside of the intended setting, Successive Reject quickly becomes the best algorithm when the budget gets larger than the graph size. Although, this algorithm requires global information about the graph, which may not be always feasible.
3.3 Applications in GraphBased NearestNeighbor Classification
In this section, we use the proposed graphbased optimization methods in a graphbased nearest neighbor search problem. The graphbased nearest neighbor method takes a training set, constructs a proximity graph on this set, and when queried in the test phase, performs a hillclimbing search to find an approximate nearest neighbor. More details are given in Appendix A. This procedure is particularly well suited to bigdata problems; In such problems, points will have close neighbors, and so the geodesic distance with respect to even a simple metric such as Euclidean metric should provide an appropriate metric. Further, and as we will show, computational complexity of the graphbased method in the test phase scales gracefully with the size of the training set, a property that is particularly important in big data applications. The intuition is that, in bigdata problems, although a descend path might be longer, the objective function is generally more smooth and hence easier to minimize.
We apply the local search and simulated annealing methods along with an additional smoothing to the problem of nearest neighbor search. For Simulated Annealing, we call the resulting algorithm SGNN for Smoothed Graphbased Nearest Neighbor search. The pseudocode of the algorithm and more details are in Appendix A. The ExploreDescend is denoted by E&D in these experiments. We compared the proposed methods with the stateoftheart nearest neighbor search methods (KDTree and LSH) in two image classification problems (MNIST and COIL100). In an approximate nearest neighbor search problem, it is crucial to have sublinear time complexity, and thus SpectralBandit and SuccessiveReject are not applicable here.
Figure 11 (ad) shows the accuracy of different methods on different portions of MNIST dataset. The graphs in these experiments are not concave, but –nearly concave by Definition 1.4. The results for COIL100 are shown in Appendix A. As the size of training set increases, the prediction accuracy of all methods improve. Figure 11 (eh) shows that the test phase runtime of the SGNN method has a more modest growth for larger datasets. In contrast, KDTree becomes much slower for larger training datasets. The LSH method generally performs poorly, and it is hard to make it competitive with other methods. When using all training data, the SGNN method has roughly the same accuracy, but it has less than 20% of the test phase runtime of KDTree.
4 Conclusions and Future Work
We studied sample complexity of stochastic optimization of graph functions. We defined a notion of convexity for graph functions, and we showed that under the convexity condition, a greedy algorithm and the simulated annealing enjoy sample complexity bounds that are independent of the size of the graph. An interesting future work is the study of cumulative regret in this setting.
We showed effectiveness of the proposed techniques in a web document reranking problem as well as a graphbased nearest neighbor search problem. The computational complexity of the resulting nearest neighbor method scales gracefully with the size of the dataset, which is particularly appealing in bigdata applications. Further quantification of this property remains for future work.
Acknowledgement
TN was supported by the Australian Research Council Centre of Excellence for Mathematical and Statistics Frontiers (ACEMS).
References
 Improved algorithms for linear stochastic bandits. In NIPS, Cited by: §1.1.
 Algorithms for fast vector quantization. In IEEE Data Compression Conference, Cited by: Appendix A, §1.2.
 Best Arm Identification in MultiArmed Bandits. In COLT  23th Conference on Learning Theory  2010, Haifa, Israel, pp. 13 p.. External Links: Link Cited by: §1.1, §1, §2.1, §3, Theorem 1.
 Using confidence bounds for exploitationexploration tradeoffs. JMLR (3), pp. 397–422. Cited by: §1.
 Convergence rate of a simulated annealing algorithm with noisy observations. ArXiv eprints. External Links: 1703.00329 Cited by: §1.

Connectivity of the mutual knearestneighbor graph in clustering and outlier detection
. Statistics & Probability Letters 35 (1), pp. 33–42. Cited by: Appendix A.  Online optimization in armed bandits. In NIPS, Cited by: §1.1.

Simple regret for infinitely many armed bandits.
In
International Conference on Machine Learning
, Cited by: §1.1. 
Fast approximate knn graph construction for high dimensional data via recursive lanczos bisection
. Journal of Machine Learning Research 10, pp. 1989–2012. Cited by: Appendix A.  Monte carlo statistical methods. Springer New York. Cited by: §1.
 Fast construction of knearest neighbor graphs for point clouds. IEEE Transactions on Visualization and Computer Graphics 16 (4), pp. 599–608. Cited by: Appendix A.
 Efficient knearest neighbor graph construction for generic similarity measures. In WWW, Cited by: Appendix A.
 On nearestneighbor graphs. Discrete & Computational Geometry 17 (3), pp. 263–282. Cited by: Appendix A.

Action elimination and stopping conditions for the multiarmed bandit and reinforcement learning problems
. JMLR 7, pp. 1079–1105. Cited by: §1.1, §1.  Blackbox optimization of noisy functions with unknown smoothness. In NIPS, Cited by: §1.1.
 Fast approximate nearestneighbor search with knearest neighbor graph. In IJCAI, Cited by: Appendix A.
 Bestarm identification algorithms for multiarmed bandits in the fixed confidence setting. In 2014 48th Annual Conference on Information Sciences and Systems, CISS 2014, pp. 1–6. Cited by: §1.1.
 On the complexity of bestarm identification in multiarmed bandit models. JMLR 17, pp. 1–42. Cited by: §1.1.
 The sample complexity of exploration in the multiarmed bandit problem. JMLR 5, pp. 623–648. Cited by: §1.1, §1.
 Separators for spherepackings and nearest neighbor graphs. Journal of the ACM 44 (1), pp. 1–29. Cited by: Appendix A.
 Distributed computation of the knn graph for large highdimensional point sets. Journal of Parallel and Distributed Computing 67 (3), pp. 346–359. Cited by: Appendix A.
 Optimization of computer simulation models with rare events. European Journal of Operations Research (99), pp. 89–112. Cited by: §1.
 Graphbased ascent algorithms for function maximization. ArXiv eprints. External Links: 1802.04475 Cited by: §1.
 Gaussian process optimization in the bandit setting: no regret and experimental design. In ICML, Cited by: §1.
 Spectral bandits for smooth graph functions. In International Conference on Machine Learning, Cited by: §1.1, §3.
 Scalable knn graph construction for visual descriptors. In CVPR, Cited by: Appendix A.
References
 Improved algorithms for linear stochastic bandits. In NIPS, Cited by: §1.1.
 Algorithms for fast vector quantization. In IEEE Data Compression Conference, Cited by: Appendix A, §1.2.
 Best Arm Identification in MultiArmed Bandits. In COLT  23th Conference on Learning Theory  2010, Haifa, Israel, pp. 13 p.. External Links: Link Cited by: §1.1, §1, §2.1, §3, Theorem 1.
 Using confidence bounds for exploitationexploration tradeoffs. JMLR (3), pp. 397–422. Cited by: §1.
 Convergence rate of a simulated annealing algorithm with noisy observations. ArXiv eprints. External Links: 1703.00329 Cited by: §1.

Connectivity of the mutual knearestneighbor graph in clustering and outlier detection
. Statistics & Probability Letters 35 (1), pp. 33–42. Cited by: Appendix A.  Online optimization in armed bandits. In NIPS, Cited by: §1.1.

Simple regret for infinitely many armed bandits.
In
International Conference on Machine Learning
, Cited by: §1.1. 
Fast approximate knn graph construction for high dimensional data via recursive lanczos bisection
. Journal of Machine Learning Research 10, pp. 1989–2012. Cited by: Appendix A.  Monte carlo statistical methods. Springer New York. Cited by: §1.
 Fast construction of knearest neighbor graphs for point clouds. IEEE Transactions on Visualization and Computer Graphics 16 (4), pp. 599–608. Cited by: Appendix A.
 Efficient knearest neighbor graph construction for generic similarity measures. In WWW, Cited by: Appendix A.
 On nearestneighbor graphs. Discrete & Computational Geometry 17 (3), pp. 263–282. Cited by: Appendix A.

Action elimination and stopping conditions for the multiarmed bandit and reinforcement learning problems
. JMLR 7, pp. 1079–1105. Cited by: §1.1, §1.  Blackbox optimization of noisy functions with unknown smoothness. In NIPS, Cited by: §1.1.
 Fast approximate nearestneighbor search with knearest neighbor graph. In IJCAI, Cited by: Appendix A.
 Bestarm identification algorithms for multiarmed bandits in the fixed confidence setting. In 2014 48th Annual Conference on Information Sciences and Systems, CISS 2014, pp. 1–6. Cited by: §1.1.
 On the complexity of bestarm identification in multiarmed bandit models. JMLR 17, pp. 1–42. Cited by: §1.1.
 The sample complexity of exploration in the multiarmed bandit problem. JMLR 5, pp. 623–648. Cited by: §1.1, §1.
 Separators for spherepackings and nearest neighbor graphs. Journal of the ACM 44 (1), pp. 1–29. Cited by: Appendix A.
 Distributed computation of the knn graph for large highdimensional point sets. Journal of Parallel and Distributed Computing 67 (3), pp. 346–359. Cited by: Appendix A.
 Optimization of computer simulation models with rare events. European Journal of Operations Research (99), pp. 89–112. Cited by: §1.
 Graphbased ascent algorithms for function maximization. ArXiv eprints. External Links: 1802.04475 Cited by: §1.
 Gaussian process optimization in the bandit setting: no regret and experimental design. In ICML, Cited by: §1.
 Spectral bandits for smooth graph functions. In International Conference on Machine Learning, Cited by: §1.1, §3.
 Scalable knn graph construction for visual descriptors. In CVPR, Cited by: Appendix A.
Appendix A More Details for Experiments
First, we explain the graphbased nearest neighbor search. Let be a positive integer. Let be a proximity graph constructed on dataset in an offline phase, i.e. is the set of nodes of , and each point in is connected to its nearest neighbors with respect to some distance metric . In our experiments, we use the Euclidean metric. Given the graph and the query point , the problem is reduced to minimizing function over a graph . The algorithm is shown in Figures 12 and 13. The SGNN continues for a fixed number of iterations. In our experiments, we run the simulated annealing procedure for rounds, where is the size of the training set. See Figure 13 for a pseudocode. Finally, the SGNN runs the simulated annealing procedure several times and returns the best outcome of these runs. The resulting algorithm with random restarts is shown in Figure 12. The above algorithm returns an approximate nearest neighbor point. To find nearest neighbors for , we simply return the best elements in the last line in Figures 12. We use approximate nearest neighbors to predict a class for each given query. We construct a directed graph by connecting each node to its closest nodes in Euclidean distance. For smoothing, we tried random walks of length and . This means that, to evaluate a node, we run a random walk of length from that node and return the observed value at the stopping point as an estimate of the value of the node. This operation smoothens the function, and generally improves the performance. The SGNN method with is denoted by SGNN(1), and SGNN with , i.e. pure simulated annealing on the graph, is denoted by SGNN(0). For the SGNN algorithm, the number of rounds is in each restart.
The graph based nearest neighbor search has been studied by Arya and Mount (1993), Brito et al. (1997), Eppstein et al. (1997), Miller et al. (1997), Plaku and Kavraki (2007), Chen et al. (2009), Connor and Kumar (2010), Dong et al. (2011), Hajebi et al. (2011), Wang et al. (2012). In the worst case, construction of the proximity graph has complexity , but this is an offline operation. Choice of impacts the prediction accuracy and computation complexity; smaller means lighter training phase computation, and heavier test phase computation (as we need more random restarts to achieve a certain prediction accuracy). Having a very large will also make the test phase computation heavy.
We used the MNIST and COIL100 datasets, that are standard datasets for image classification. The MNIST dataset is a black and white image dataset, consisting of 60000 training images and 10000 test images in 10 classes. Each image is pixels. The COIL100 dataset is a colored image dataset, consisting of 100 objects, and 72 images of each object at every 5x angle. Each image is pixels, We use 80% of images for training and 20% of images for testing.
For LSH and KDTree algorithms, we use the implemented methods in the scikit learn library with the following parameters. For LSH, we use LSHForest with min hash match=4, #candidates=50, #estimators=50, #neighbors=50, radius=1.0, radius cutoff ratio=0.9. For KDTree, we use leaf size=1 and =50, meaning that indices of 50 closest neighbors are returned. The KDTree method always significantly outperforms LSH. For SGNN, we pick the number of restarts so that all methods have similar prediction accuracy.
Figure 22 (ad) shows the accuracy of different methods on different portions of COIL100 dataset. As the size of training set increases, the prediction accuracy of all methods improve. Figure 22
(eh) shows that the test phase runtime of the SGNN method has a more modest growth for larger datasets. In contrast, KDTree becomes much slower for larger training datasets. When using all training data, the proposed method has roughly the same accuracy, while having less than 50% of the test phase runtime of KDTree. Using the exact nearest neighbor search, we get the following prediction accuracy results (the error bands are 95% bootstrapped confidence intervals): with full data, accuracy is
; with 3/4 of data, accuracy is ; with 1/2 of data, accuracy is ; and with 1/4 of data, accuracy is .Next, we study how the performance of SGNN changes with the length of random walks. We choose and compare different methods on the same datasets. The results are shown in Figure 27. The SGNN(2) method outperforms the competitors. Interestingly, SGNN(2) also outperforms the exact nearest neighbor algorithm on the MNIST dataset. This result might appear counterintuitive, but we explain the result as follows. Given that we use a simple metric (Euclidean distance), the exact nearest neighbors are not necessarily appropriate candidates for making a prediction; Although the exact nearest neighbor algorithm finds the global minima, the neighbors of the global minima on the graph might have large values. On the other hand, the SGNN(2) method finds points that have small values and also have neighbors with small values. This stability acts as an implicit regularization in the SGNN(2) algorithm, leading to an improved performance.
These results show the advantages of using graphbased nearest neighbor algorithms; as the size of training set increases, the proposed method is much faster than KDTree.
Comments
There are no comments yet.