Sample Efficient Graph-Based Optimization with Noisy Observations

06/04/2020 ∙ by Tan Nguyen, et al. ∙ 10

We study sample complexity of optimizing "hill-climbing friendly" functions defined on a graph under noisy observations. We define a notion of convexity, and we show that a variant of best-arm identification can find a near-optimal solution after a small number of queries that is independent of the size of the graph. For functions that have local minima and are nearly convex, we show a sample complexity for the classical simulated annealing under noisy observations. We show effectiveness of the greedy algorithm with restarts and the simulated annealing on problems of graph-based nearest neighbor classification as well as a web document re-ranking application.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Stochastic optimization of a function defined on a large finite set frequently arises in many practical problems. Instances of this problem include finding the most attractive design for a webpage, the node with maximum influence in a social network, etc. There are a number of approaches to this problem. At one extreme, we can use global optimization methods such as simulated annealing, genetic algorithms, or cross-entropy methods 

(Rubinstein, 1997, Christian and Casella, 1999). Although limited theoretical performance guarantees are available, these methods perform well in a number of practical applications. In the other extreme, we can use the best-arm identification algorithms from bandit literature or hypothesis testing from statistics community (Mannor and Tsitsiklis, 2004, Even-Dar et al., 2006, Audibert and Bubeck, 2010)

. We have stronger sample complexity results for this family of algorithms. The sample complexity states that, with high probability, the algorithm will return a near-optimal solution after a small number of observations, and this number typically grows polynomially with the size of the decision set. These methods however often perform poorly when the size of the decision set is large. In such problems, it is important to exploit the structure of the specific problems to speed up the optimization. If appropriate features are available and it can be assumed that the objective function is linear in these features, then algorithms designed for bandit linear optimization are applicable 

(Auer, 2002)

. More generally, kernel-based or neural network based bandit algorithms are available for problems with non-linear objective functions 

(Srinivas et al., 2010).

We are particularly interested in problems where the item similarity is captured by a graph structure. In general, and with no further conditions, we cannot hope to show non-trivial sample complexity rates. We observe that in many real-world applications, the objective function is easy and hill-climbing friendly, in the sense that from many nodes, there exist a monotonic path to the global minima. We make this property explicit by defining a notion of convexity

for functions defined on graphs. Under this condition, we show that a hill-climbing procedure that uses a variant of best-arm identification as a sub-routine has a small sample complexity that is independent of the size of graph. In the presence of local minima, this greedy procedure might require many restarts, and might not be efficient in practice. Simulated annealing is commonly used in practice for such problems. We also define a notion of nearly convex functions that allows for existence of shallow local minima. We show that for nearly convex functions and using an appropriate estimation of function values, the classical simulated annealing procedure finds near optimal nodes after a small number of function evaluations.

While asymptotic convergence of simulated annealing is studied extensively in the literature, there are only few finite-time convergence rates for this important algorithm. Sreenivas Pydi et al. (2018) show finite-time convergence rates for the algorithm, but they consider only deterministic functions and their rates scale with the size of the graph. These results, that are obtained under very general conditions, do not quite explain success of the simulated annealing algorithm in large-scale problems. In practice, simulated annealing finds a near-optimal solution with a much smaller number of function evaluations. Our bounds are in terms of the convexity of the function, and the rates do not scale with the size of the graph. Additionally, our results hold in the noisy setting. Bouttier and Gavra (2017) show convergence rates for simulated annealing applied to noisy functions. Their results, however, appear to have several gaps. size=,color=red!20!white,size=,color=red!20!white,todo: size=,color=red!20!white,Yasin: explain more or remove this!

1.1 Related Work

A best-arm identification algorithm from bandit literature can find a near-optimal node, and the time and sample complexity is linear in the number of nodes (Mannor and Tsitsiklis, 2004, Even-Dar et al., 2006, Audibert and Bubeck, 2010, Jamieson and Nowak, 2014, Kaufmann et al., 2016). Such complexity is not acceptable when the size of the graph is very large. We are interested in designing and analyzing algorithms that find a near-optimal node, and the sample and computational complexity is sublinear in the number of nodes.

Bandit algorithms for graph functions are studied in Valko et al. (2014). The sample complexity of these algorithms scale with the smoothness of the function, but the computational complexity scales linearly with the number of nodes. We are interested in problems where the graph is very large, and we might have access to only local information. As we will see in the experiments, the SpectralBandit algorithm of Valko et al. (2014) is not applicable to problems with large graphs. Further, we study a different notion of regularity that is inspired by convexity in continuous optimization. Bandit problems for large action sets are also studied under a number of different notions of regularity (Bubeck et al., 2009, Abbasi-Yadkori et al., 2011, Carpentier and Valko, 2015, Grill et al., 2015).

1.2 Contributions

We cannot hope to achieve sublinear rate without further conditions on the function. We define a notion of strong convexity for functions defined on a graph. For strongly convex functions, we design an algorithm, called Explore-Descend, to find the optimal node with guaranteed error probability. The Explore-Descend algorithm uses a best-arm identification algorithm as a submodule. The best-arm identification problem is the problem of finding the optimal action given a sampling budget. For nearly convex functions, we show that the classical simulated annealing algorithm finds the global minima in a reasonable time. For convex functions, the Explore-Descend has lower sample complexity, but simulated annealing can handle non-convex functions.

We study the empirical performance of Explore-Descend and simulated annealing in two applications. First, we consider the problem of content optimization in a digital marketing application. In each round, a user visits a webpage that is composed of a number of components. The learning agent chooses the contents of these components and shows the resulting page to the user. The user feedback, in the form of click/no click or conversion/no conversion, is recorded and used to update the decision making policy. The objective is to return a page configuration with near-optimal click-through rate. We would like to find such a configuration as quickly as possible with a small number of user interactions. In this problem, each page configuration is a node of the graph, and two nodes are connected if the corresponding page configurations differ in only one component.

Our second application is the problem of nearest neighbor search and classification. Given a set of points , a query point , and a distance function , we are interested in solving . A trivial solution to this problem is to examine all points in and return the point with the smallest distance value. The computational complexity of this method is , and is not practical in large-scale problems. An approximate nearest neighbor method returns an approximate solution in a sublinear time. A class of approximate nearest neighbor methods that is particularly suitable for big-data applications is the graph-based nearest neighbor search Arya and Mount (1993). In a graph-based search, we construct a proximity graph in the training phase, and perform a hill-climbing search in the test phase. To improve the performance, we perform an additional smoothing procedure that replaces the value of a node by the average function values in a vicinity of the node. In practice, these average values are estimated by performing random walks, and hence the problem is a graph optimization problem with noisy observations. We show that the proposed graph optimization technique outperforms popular nearest neighbor search methods such as KDTree and LSH in two image classification problems. Interestingly, and compared to these methods, the computational complexity of the proposed technique appears to be less sensitive to the size of the training data. This property is particularly appealing in big data applications.

1.3 Notations

We use to denote the set . Let be a graph with nodes and the set of edges . Let be some unknown function defined on . Let be the global minimizer of . The goal is to find a node with small loss . In this paper, we study the problem where the evaluation of is noisy, such that we can only observe , where is a zero-mean

-sub-Gaussian random variable, meaning that for any

, .

Let denote the set of neighbors of node . Let . For simplicity, we assume all nodes have the same degree and we let denote the number of neighbors. We sometimes write to denote the function value . For two nodes , we use to denote all paths from to in the graph. We use to denote all paths starting from node . We use to denote the length of a path.

1.4 Convexity

The general discrete optimization problem is hard for an unrestricted function and graph . As such, we study a restricted class of problems that allow for efficient algorithms. Let be the amount of improvement if we move from node to the neighbor . We say path is -strongly convex if , for all . Sometimes we use to denote if the -strongly convex path is clear from the context. We use to denote . size=,color=blue!20!white,size=,color=blue!20!white,todo: size=,color=blue!20!white,Tan: can be negative at local max?

  • Definition 1 (Convex Functions) A function defined on a graph is (strongly) convex if from any node , there exists a (strongly) convex path to the global minima .

For a nearly convex function, as defined next, the convexity condition is not satisfied at all points.

  • Definition 2 (Nearly Convex Functions) Let be a parameter. Let be the set of points such that . We say function is -nearly convex if for any , there exists a and a such that and .

A path that satisfies the above conditions is called a low energy path. A convex function is also a -nearly convex function.

Lemma 1.

We have that . size=,color=cyan!20!white,size=,color=cyan!20!white,todo: size=,color=cyan!20!white,Anup: If are next to each other, then we should have ? size=,color=red!20!white,size=,color=red!20!white,todo: size=,color=red!20!white,Yasin: yes, which is fine, right?

Proof.

Let be a strongly convex path. Then

We have that

We conclude that , from which the statement follows. ∎

Concave and nearly concave functions are similarly defined. The convexity condition allows for efficient algorithms. The intuition behind our analysis is that, by strong convexity, if node is far from the global minima the improvement is large. When degree is relatively small, the local search methods have a sufficiently large probability of hitting a good direction (the strongly convex path). So either we are already close to the global minima, or we are far from the global minima and with a constant probability we make a large improvement.

2 Approximating the Global Minimizer

We analyze two algorithms for the graph-based optimization. The first algorithm is a local search procedure where in each round, the learner attempts to identify a good neighbor and move there. This algorithm is analyzed under the strong convexity condition. The second algorithm is the well-known simulated annealing procedure with an exponential transition function. We provide high probability error bounds for the greedy method, while the sample complexity of the more complicated simulated annealing is analyzed in expectation.

2.1 The Local Search Algorithm and Best-Arm Identification

The greedy approach is an intuitive approach to the graph optimization problem: We start from a random node, and at each node and given a sampling budget, we explore its neighbors to find the best neighbor. The problem that we solve in each node can be viewed as a fixed-budget best-arm identification problem Audibert and Bubeck (2010). In a fixed-budget setting, the learner plays actions for a number of rounds in some fashion and returns an approximately optimal arm at the end. An example of an algorithm designed for the fixed-budget setting is the SuccessiveReject algorithm of Audibert and Bubeck (2010). size=,color=red!20!white,size=,color=red!20!white,todo: size=,color=red!20!white,Yasin: explain why fixed budget algorithms are more appropriate

Before describing the best-arm identification algorithm, we introduce some notation. Let be an integer, be the set of arms in a bandit problem, be the mean value of arm , and be the empirical mean of arm after observations. Let be the optimal arm (break tie arbitrarily) and be its mean, and let be the optimality gap of arm , i.e. . Without loss of generality assume that . Let be the budget. Define

The SuccessiveReject algorithm is a procedure with rounds where in each round one arm is eliminated. Let be the remaining arms at the beginning of round . In round , each arm is selected for rounds. At the end of this round, the arm with the smallest empirical mean is removed.

Theorem 1 (Audibert and Bubeck (2010)).

The probability of error of the SuccessiveReject algorithm with a budget of is smaller than

In the Explore-Descend procedure, we use the above bandit algorithm to find the best neighbor in each round. More specifically, for a node , we solve a best-arm identification problem with action values , where is the th neighbor of node . size=,color=cyan!20!white,size=,color=cyan!20!white,todo: size=,color=cyan!20!white,Anup: Hill ’climbing’, descent direction and the good arm definitions are a bit confusing as sometimes they speak of ascent and sometimes descent We then move to the chosen neighbor and repeat the process until the budget is consumed. We call this approach “Explore and Descend” and it is detailed in Algorithm 1.

Input : graph , starting node , budget
Output :  such that with high probability
1 Per round budget such that ;
2 for  do
3       DescentOracle();
4      
return
Algorithm 1 Explore and Descend

The algorithm depends on the subroutine . This subroutine is an implementation of the best-arm bandit algorithms described earlier with the decision set and budget .

Corollary 1.

Let be a -strongly convex function, let be an arbitrary starting node, and let , , be per round budgets. Assume that is sufficiently large for the steepest descent path from to reach . Let be the probability of error of the Explore-Descend algorithm. Then is upper bounded by the following inequality:

(1)

where is the -th optimality gap at node .

Proof.

Let . In round , the Explore-Descend algorithm uses SuccessiveReject for the subroutine . Thus, by Theorem 1, the probability of error in round is upper bounded by

Using the union bound on the above inequality for round , and the loose upper bound , we obtain Equation 1. ∎

On the other hand, we can also apply the SuccessiveReject algorithm directly on , the set of all nodes of the graph. In this case, each node in the graph is one arm, and the graph structure is disregarded. The probability of error of SuccessiveReject, using Theorem 1 with and , is upper bounded by the following inequality:

(2)

Using the loose upper bound , the above can be written as:

(3)

Note that the error bounds of SuccessiveReject given in Equations 2 and 3 is vacuous when . In contrast, Equation 1 provides meaningful error bounds for Explore-Descend, even in this small budget regime, when . Additionally, we see that the error bound for Explore-Descend are independent of the size of the graph. Rather, it depends on , which in turn depends on the convexity constant . Larger means the function is steeper and fewer steps (smaller ) are required to reach the global optimum.

2.2 Nearly Convex Problems: Simulated Annealing

Given that we have access only to noisy observations, we consider the following Metropolis-Hastings Algorithm with exponential weights:

To simplify the analysis, we use a fixed time-independent inverse temperature, although in practice a time-varying inverse temperature might be preferable. We estimate each function value by samples. Next, we provide sample complexity bounds for the above procedure in expectation.

Theorem 2.

For a -strongly convex function , let and be the path generated by SimAnnealing from an arbitrary initial node . Given a constant and with the choice of , after rounds, we have

For a nearly convex function, let and let . With the choice of and after rounds, we have that

Proof.

Let be the closest point in on a low energy path from and let be the distance to this point from . Let be the set of paths of length less than and starting at such that at least one node on the path is not the same as . Consider the low energy sub-path starting at :

Let be the terminating node in a path . Given that function is -nearly convex, and given that noise is -sub-Gaussiansize=,color=cyan!20!white,size=,color=cyan!20!white,todo: size=,color=cyan!20!white,Anup: No mention about the noise in the theorem statement, probability that this low-energy path is taken by the algorithm is

Because is a low energy path, we have . Let , and . Notice that given , and are deterministic. We write

The first term on the right side is related to the event that the state follows the low energy path to the state for rounds, and then goes to the best immediate neighbor at state . The second term is related to the event that the state is not the same as after rounds. Finally, the last term is related to the event that the state stays in for the next rounds. If , then by Definition 1.4, we already have

Otherwise if , we continue as follows:

where the last step follows from inequality . By Definition 1.4, . Thus,

where we used in the last step. Let . We have,

where the second step holds by , , and , and is defined in the theorem statement. Using the fact that and , and a simple calculation shows that after rounds, we have that

size=,color=red!20!white,size=,color=red!20!white,todo: size=,color=red!20!white,Yasin: I think this is the best we can hope for. So we have something meaningful only if and are sufficiently small.

If , then we get that

After rounds, we have that

For a strongly convex function, following a similar argument, we have that

Thus, given and with the choice of in the theorem statement, after rounds, we have . ∎

size=,color=red!20!white,size=,color=red!20!white,todo: size=,color=red!20!white,Yasin: compare this with the complexity of ExploreDescend

For nearly convex problems, the error bound in this theorem is meaningful only if parameter is small. In our experiments data, this value is or .

3 Experiments

We implemented Explore-Descend and Simulated Annealing algorithms and compare them with SpectralBandit of Valko et al. (2014) and SuccessiveReject of Audibert and Bubeck (2010). The SuccessiveReject algorithm works by considering all nodes of the graph as a big multi-arm bandit problem. The SpectralBandit

uses the adjacency matrix to calculate the Laplacian and Eigenvectors. Both of these algorithms, therefore, require global information of the graph, while our algorithms assume only local information: from one node one can only access its neighbors.

We note that the SpectralBandit algorithm is originally designed to minimize the cumulative regret, which partly explains the poor performance in a best-arm identification problem. In the fixed budget best-arm identification setting, we run SpectralBandit

until the budget is consumed and output the most frequently pulled arm, which is, in our experiments, better than taking the arm with the best empirical mean. The algorithm has a high time complexity because of its reliance on matrix operations on matrices and vectors of dimension

. As SpectralBandit does not scale well with the size of the graph due to its matrix operations, we generated a small synthetic graph in order to evaluate it.

3.1 Applications in Graph-Structured Best-Arm Identification: Synthetic Data

First, we evaluated different algorithms on synthetic graphs, which are generated as follows. Each node of the graph is a point , where for some . Each node is connected to all of its eight immediate neighbors on the plain. Additionally, random edges are added in, such that the degree of each node is 15. The mean function value is . This mean is unknown to the algorithms, i.e. when the algorithm requests the value of , it is returned with a stochastic value: 1 with probability and 0 with probability . It is easy to see that this graph is concave by Definition 1.4.

The results of the experiment is presented in Figure 1. The performance measure is the average sub-optimality gap, i.e. , where is the solution returned by the algorithm. The average is taken over 1000 trials. Overall, our algorithms significantly outperform SpectralBandit and SuccessiveReject both in term of time and sub-optimality gap, especially when the budget is smaller than the graph size, which is our intended setting. For Simulated Annealing, we use a fixed inverse temperature . The result could be further improved by optimizing a schedule for this parameter. Interestingly, the number of pulls for each function evaluation also has significant impact on the performance, which can be seen from the plots for Sim Annealing 1 (1 pull) and Sim Annealing 5 (5 pulls) in Figure 1. As for Explore-Descend, we simply allocate the budget equally for each node in the descending path, with the maximum path length set to for and for . This algorithm is the fastest and also offers the best performance. Source code is available at https://github.com/tan1889/graph-opt

Figure 1: Comparing the performance of different algorithms on synthetic data. Two left figures: Small synthetic graph (441 nodes). Two right figures: Large synthetic graph (40401 nodes). Figures show average sub-optimality gaps over 1000 trials and run-time per trial of the algorithms. Same legend (on the second figure) for all figures.

3.2 Applications in Graph-Structured Best-Arm Identification: Web Document Reranking

To demonstrate the performance of our algorithm on real-world non-concave problems, we used data from Yandex Personalized Web Search Challenge to build a graph as follows. Each query forms a graph, whose nodes correspond to lists of 5 items (documents) for the given query. Two nodes are connected if they have 4 or more items in common at the same positions. The value of a node is a Bernoulli random variable with mean equal to the click-through rate of the associated list. The goal is to find the node with maximum click-through rate, i.e. the most relevant list. We chose the query that generated the largest possible graph (query no. 8107157) of 4514 nodes. As there were many small partitions in this graph, we took the largest partition as the input for our experiment. The resulting graph has 3992 nodes with degree varying from 1 to 171 (mean=35) and a maximum function value at .

Figure 2: Comparing the performance of algorithms on web document reranking data. Left: Average sub-optimal gap over 1000 trials. Right: Run-time (s) per trial. Same legend for both figures.

For non-concave graphs, Explore-Descend needs to make multiple restarts. We set the number of restarts to and allocate the equally between all restarts, then, for each restart, equally between each node in the path. All other parameters are the same as before.

The results of the experiment is presented in Figure 2. In the intended setting, our algorithms significantly outperform Successive Reject. For very small budget, Simulated Annealing is better than Explore-Descend, but this is reversed as the budget gets bigger. Additionally, for this graph we don’t see the big advantage of Sim Annealing 5 over Sim Annealing 1 as was the case before. Outside of the intended setting, Successive Reject quickly becomes the best algorithm when the budget gets larger than the graph size. Although, this algorithm requires global information about the graph, which may not be always feasible.

3.3 Applications in Graph-Based Nearest-Neighbor Classification

(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
Figure 11: Prediction accuracy and running time of different methods on MNIST dataset as the size of training set increases. (a,e) Using 25% of training data. (b,f) Using 50% of training data. (c,g) Using 75% of training data. (d,h) Using 100% of training data.

In this section, we use the proposed graph-based optimization methods in a graph-based nearest neighbor search problem. The graph-based nearest neighbor method takes a training set, constructs a proximity graph on this set, and when queried in the test phase, performs a hill-climbing search to find an approximate nearest neighbor. More details are given in Appendix A. This procedure is particularly well suited to big-data problems; In such problems, points will have close neighbors, and so the geodesic distance with respect to even a simple metric such as Euclidean metric should provide an appropriate metric. Further, and as we will show, computational complexity of the graph-based method in the test phase scales gracefully with the size of the training set, a property that is particularly important in big data applications. The intuition is that, in big-data problems, although a descend path might be longer, the objective function is generally more smooth and hence easier to minimize.

We apply the local search and simulated annealing methods along with an additional smoothing to the problem of nearest neighbor search. For Simulated Annealing, we call the resulting algorithm SGNN for Smoothed Graph-based Nearest Neighbor search. The pseudocode of the algorithm and more details are in Appendix A. The Explore-Descend is denoted by E&D in these experiments. We compared the proposed methods with the state-of-the-art nearest neighbor search methods (KDTree and LSH) in two image classification problems (MNIST and COIL-100). In an approximate nearest neighbor search problem, it is crucial to have sublinear time complexity, and thus SpectralBandit and SuccessiveReject are not applicable here.

Figure 11 (a-d) shows the accuracy of different methods on different portions of MNIST dataset. The graphs in these experiments are not concave, but –nearly concave by Definition 1.4. The results for COIL-100 are shown in Appendix A. As the size of training set increases, the prediction accuracy of all methods improve. Figure 11 (e-h) shows that the test phase runtime of the SGNN method has a more modest growth for larger datasets. In contrast, KDTree becomes much slower for larger training datasets. The LSH method generally performs poorly, and it is hard to make it competitive with other methods. When using all training data, the SGNN method has roughly the same accuracy, but it has less than 20% of the test phase runtime of KDTree.

4 Conclusions and Future Work

We studied sample complexity of stochastic optimization of graph functions. We defined a notion of convexity for graph functions, and we showed that under the convexity condition, a greedy algorithm and the simulated annealing enjoy sample complexity bounds that are independent of the size of the graph. An interesting future work is the study of cumulative regret in this setting.

We showed effectiveness of the proposed techniques in a web document re-ranking problem as well as a graph-based nearest neighbor search problem. The computational complexity of the resulting nearest neighbor method scales gracefully with the size of the dataset, which is particularly appealing in big-data applications. Further quantification of this property remains for future work.

Acknowledgement

TN was supported by the Australian Research Council Centre of Excellence for Mathematical and Statistics Frontiers (ACEMS).

References

  • Y. Abbasi-Yadkori, D. Pál, and C. Szepesvári (2011) Improved algorithms for linear stochastic bandits. In NIPS, Cited by: §1.1.
  • S. Arya and D. M. Mount (1993) Algorithms for fast vector quantization. In IEEE Data Compression Conference, Cited by: Appendix A, §1.2.
  • J. Audibert and S. Bubeck (2010) Best Arm Identification in Multi-Armed Bandits. In COLT - 23th Conference on Learning Theory - 2010, Haifa, Israel, pp. 13 p.. External Links: Link Cited by: §1.1, §1, §2.1, §3, Theorem 1.
  • P. Auer (2002) Using confidence bounds for exploitation-exploration trade-offs. JMLR (3), pp. 397–422. Cited by: §1.
  • C. Bouttier and I. Gavra (2017) Convergence rate of a simulated annealing algorithm with noisy observations. ArXiv e-prints. External Links: 1703.00329 Cited by: §1.
  • M.R. Brito, E.L. Chavez, A.J. Quiroz, and J.E. Yukich (1997)

    Connectivity of the mutual k-nearest-neighbor graph in clustering and outlier detection

    .
    Statistics & Probability Letters 35 (1), pp. 33–42. Cited by: Appendix A.
  • S. Bubeck, R. Munos, G. Stoltz, and C. Szepesvári (2009) Online optimization in -armed bandits. In NIPS, Cited by: §1.1.
  • A. Carpentier and M. Valko (2015) Simple regret for infinitely many armed bandits. In

    International Conference on Machine Learning

    ,
    Cited by: §1.1.
  • J. Chen, H. Fang, and Y. Saad (2009)

    Fast approximate knn graph construction for high dimensional data via recursive lanczos bisection

    .
    Journal of Machine Learning Research 10, pp. 1989–2012. Cited by: Appendix A.
  • P. R. Christian and G. Casella (1999) Monte carlo statistical methods. Springer New York. Cited by: §1.
  • M. Connor and P. Kumar (2010) Fast construction of k-nearest neighbor graphs for point clouds. IEEE Transactions on Visualization and Computer Graphics 16 (4), pp. 599–608. Cited by: Appendix A.
  • W. Dong, M. Charikar, and K. Li (2011) Efficient k-nearest neighbor graph construction for generic similarity measures. In WWW, Cited by: Appendix A.
  • D. Eppstein, M. S. Paterson, and F. F. Yao (1997) On nearest-neighbor graphs. Discrete & Computational Geometry 17 (3), pp. 263–282. Cited by: Appendix A.
  • E. Even-Dar, S. Mannor, and Y. Mansour (2006)

    Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems

    .
    JMLR 7, pp. 1079–1105. Cited by: §1.1, §1.
  • J. Grill, M. Valko, and R. Munos (2015) Black-box optimization of noisy functions with unknown smoothness. In NIPS, Cited by: §1.1.
  • K. Hajebi, Y. Abbasi-Yadkori, H. Shahbazi, and H. Zhang (2011) Fast approximate nearest-neighbor search with k-nearest neighbor graph. In IJCAI, Cited by: Appendix A.
  • K. Jamieson and R. Nowak (2014) Best-arm identification algorithms for multi-armed bandits in the fixed confidence setting. In 2014 48th Annual Conference on Information Sciences and Systems, CISS 2014, pp. 1–6. Cited by: §1.1.
  • E. Kaufmann, O. Cappé, and A. Garivier (2016) On the complexity of best-arm identification in multi-armed bandit models. JMLR 17, pp. 1–42. Cited by: §1.1.
  • S. Mannor and J. N. Tsitsiklis (2004) The sample complexity of exploration in the multi-armed bandit problem. JMLR 5, pp. 623–648. Cited by: §1.1, §1.
  • G. L. Miller, S. Teng, W. Thurston, and S. A. Vavasis (1997) Separators for sphere-packings and nearest neighbor graphs. Journal of the ACM 44 (1), pp. 1–29. Cited by: Appendix A.
  • E. Plaku and L. E. Kavraki (2007) Distributed computation of the knn graph for large high-dimensional point sets. Journal of Parallel and Distributed Computing 67 (3), pp. 346–359. Cited by: Appendix A.
  • R.Y. Rubinstein (1997) Optimization of computer simulation models with rare events. European Journal of Operations Research (99), pp. 89–112. Cited by: §1.
  • M. Sreenivas Pydi, V. Jog, and P.-L. Loh (2018) Graph-based ascent algorithms for function maximization. ArXiv e-prints. External Links: 1802.04475 Cited by: §1.
  • N. Srinivas, A. Krause, S. Kakade, and M. Seeger (2010) Gaussian process optimization in the bandit setting: no regret and experimental design. In ICML, Cited by: §1.
  • M. Valko, R. Munos, B. Kveton, and T. Kocák (2014) Spectral bandits for smooth graph functions. In International Conference on Machine Learning, Cited by: §1.1, §3.
  • J. Wang, J. Wang, G. Zeng, Z. Tu, R. Gan, and S. Li (2012) Scalable k-nn graph construction for visual descriptors. In CVPR, Cited by: Appendix A.

References

  • Y. Abbasi-Yadkori, D. Pál, and C. Szepesvári (2011) Improved algorithms for linear stochastic bandits. In NIPS, Cited by: §1.1.
  • S. Arya and D. M. Mount (1993) Algorithms for fast vector quantization. In IEEE Data Compression Conference, Cited by: Appendix A, §1.2.
  • J. Audibert and S. Bubeck (2010) Best Arm Identification in Multi-Armed Bandits. In COLT - 23th Conference on Learning Theory - 2010, Haifa, Israel, pp. 13 p.. External Links: Link Cited by: §1.1, §1, §2.1, §3, Theorem 1.
  • P. Auer (2002) Using confidence bounds for exploitation-exploration trade-offs. JMLR (3), pp. 397–422. Cited by: §1.
  • C. Bouttier and I. Gavra (2017) Convergence rate of a simulated annealing algorithm with noisy observations. ArXiv e-prints. External Links: 1703.00329 Cited by: §1.
  • M.R. Brito, E.L. Chavez, A.J. Quiroz, and J.E. Yukich (1997)

    Connectivity of the mutual k-nearest-neighbor graph in clustering and outlier detection

    .
    Statistics & Probability Letters 35 (1), pp. 33–42. Cited by: Appendix A.
  • S. Bubeck, R. Munos, G. Stoltz, and C. Szepesvári (2009) Online optimization in -armed bandits. In NIPS, Cited by: §1.1.
  • A. Carpentier and M. Valko (2015) Simple regret for infinitely many armed bandits. In

    International Conference on Machine Learning

    ,
    Cited by: §1.1.
  • J. Chen, H. Fang, and Y. Saad (2009)

    Fast approximate knn graph construction for high dimensional data via recursive lanczos bisection

    .
    Journal of Machine Learning Research 10, pp. 1989–2012. Cited by: Appendix A.
  • P. R. Christian and G. Casella (1999) Monte carlo statistical methods. Springer New York. Cited by: §1.
  • M. Connor and P. Kumar (2010) Fast construction of k-nearest neighbor graphs for point clouds. IEEE Transactions on Visualization and Computer Graphics 16 (4), pp. 599–608. Cited by: Appendix A.
  • W. Dong, M. Charikar, and K. Li (2011) Efficient k-nearest neighbor graph construction for generic similarity measures. In WWW, Cited by: Appendix A.
  • D. Eppstein, M. S. Paterson, and F. F. Yao (1997) On nearest-neighbor graphs. Discrete & Computational Geometry 17 (3), pp. 263–282. Cited by: Appendix A.
  • E. Even-Dar, S. Mannor, and Y. Mansour (2006)

    Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems

    .
    JMLR 7, pp. 1079–1105. Cited by: §1.1, §1.
  • J. Grill, M. Valko, and R. Munos (2015) Black-box optimization of noisy functions with unknown smoothness. In NIPS, Cited by: §1.1.
  • K. Hajebi, Y. Abbasi-Yadkori, H. Shahbazi, and H. Zhang (2011) Fast approximate nearest-neighbor search with k-nearest neighbor graph. In IJCAI, Cited by: Appendix A.
  • K. Jamieson and R. Nowak (2014) Best-arm identification algorithms for multi-armed bandits in the fixed confidence setting. In 2014 48th Annual Conference on Information Sciences and Systems, CISS 2014, pp. 1–6. Cited by: §1.1.
  • E. Kaufmann, O. Cappé, and A. Garivier (2016) On the complexity of best-arm identification in multi-armed bandit models. JMLR 17, pp. 1–42. Cited by: §1.1.
  • S. Mannor and J. N. Tsitsiklis (2004) The sample complexity of exploration in the multi-armed bandit problem. JMLR 5, pp. 623–648. Cited by: §1.1, §1.
  • G. L. Miller, S. Teng, W. Thurston, and S. A. Vavasis (1997) Separators for sphere-packings and nearest neighbor graphs. Journal of the ACM 44 (1), pp. 1–29. Cited by: Appendix A.
  • E. Plaku and L. E. Kavraki (2007) Distributed computation of the knn graph for large high-dimensional point sets. Journal of Parallel and Distributed Computing 67 (3), pp. 346–359. Cited by: Appendix A.
  • R.Y. Rubinstein (1997) Optimization of computer simulation models with rare events. European Journal of Operations Research (99), pp. 89–112. Cited by: §1.
  • M. Sreenivas Pydi, V. Jog, and P.-L. Loh (2018) Graph-based ascent algorithms for function maximization. ArXiv e-prints. External Links: 1802.04475 Cited by: §1.
  • N. Srinivas, A. Krause, S. Kakade, and M. Seeger (2010) Gaussian process optimization in the bandit setting: no regret and experimental design. In ICML, Cited by: §1.
  • M. Valko, R. Munos, B. Kveton, and T. Kocák (2014) Spectral bandits for smooth graph functions. In International Conference on Machine Learning, Cited by: §1.1, §3.
  • J. Wang, J. Wang, G. Zeng, Z. Tu, R. Gan, and S. Li (2012) Scalable k-nn graph construction for visual descriptors. In CVPR, Cited by: Appendix A.

Appendix A More Details for Experiments

First, we explain the graph-based nearest neighbor search. Let be a positive integer. Let be a proximity graph constructed on dataset in an offline phase, i.e. is the set of nodes of , and each point in is connected to its nearest neighbors with respect to some distance metric . In our experiments, we use the Euclidean metric. Given the graph and the query point , the problem is reduced to minimizing function over a graph . The algorithm is shown in Figures 12 and 13. The SGNN continues for a fixed number of iterations. In our experiments, we run the simulated annealing procedure for rounds, where is the size of the training set. See Figure 13 for a pseudo-code. Finally, the SGNN runs the simulated annealing procedure several times and returns the best outcome of these runs. The resulting algorithm with random restarts is shown in Figure 12. The above algorithm returns an approximate nearest neighbor point. To find nearest neighbors for , we simply return the best elements in the last line in Figures 12. We use approximate nearest neighbors to predict a class for each given query. We construct a directed graph by connecting each node to its closest nodes in Euclidean distance. For smoothing, we tried random walks of length and . This means that, to evaluate a node, we run a random walk of length from that node and return the observed value at the stopping point as an estimate of the value of the node. This operation smoothens the function, and generally improves the performance. The SGNN method with is denoted by SGNN(1), and SGNN with , i.e. pure simulated annealing on the graph, is denoted by SGNN(0). For the SGNN algorithm, the number of rounds is in each restart.

  Input: Number of random restarts , number of hill-climbing steps , length of random walks .   Initialize set   for  do      Initialize random point in .               end for   Return the best element in

Figure 12: The Optimization Method with Random Restarts

  Input: Starting point , number of hill-climbing steps , length of random walks .   for  do      Perform a random walk of length from . Let be the stopping state.      Let      Let be a neighbor of chosen uniformly at random.      Perform a random walk of length from . Let be the stopping state.      Let .      if  then         Update      else         Temperature         With probability , update      end if   end for   Return

Figure 13: The Smoothed-Simulated-Annealing Subroutine

The graph based nearest neighbor search has been studied by Arya and Mount (1993), Brito et al. (1997), Eppstein et al. (1997), Miller et al. (1997), Plaku and Kavraki (2007), Chen et al. (2009), Connor and Kumar (2010), Dong et al. (2011), Hajebi et al. (2011), Wang et al. (2012). In the worst case, construction of the proximity graph has complexity , but this is an offline operation. Choice of impacts the prediction accuracy and computation complexity; smaller means lighter training phase computation, and heavier test phase computation (as we need more random restarts to achieve a certain prediction accuracy). Having a very large will also make the test phase computation heavy.

We used the MNIST and COIL-100 datasets, that are standard datasets for image classification. The MNIST dataset is a black and white image dataset, consisting of 60000 training images and 10000 test images in 10 classes. Each image is pixels. The COIL-100 dataset is a colored image dataset, consisting of 100 objects, and 72 images of each object at every 5x angle. Each image is pixels, We use 80% of images for training and 20% of images for testing.

For LSH and KDTree algorithms, we use the implemented methods in the scikit- learn library with the following parameters. For LSH, we use LSHForest with min hash match=4, #candidates=50, #estimators=50, #neighbors=50, radius=1.0, radius cutoff ratio=0.9. For KDTree, we use leaf size=1 and =50, meaning that indices of 50 closest neighbors are returned. The KDTree method always significantly outperforms LSH. For SGNN, we pick the number of restarts so that all methods have similar prediction accuracy.

Figure 22 (a-d) shows the accuracy of different methods on different portions of COIL-100 dataset. As the size of training set increases, the prediction accuracy of all methods improve. Figure 22

(e-h) shows that the test phase runtime of the SGNN method has a more modest growth for larger datasets. In contrast, KDTree becomes much slower for larger training datasets. When using all training data, the proposed method has roughly the same accuracy, while having less than 50% of the test phase runtime of KDTree. Using the exact nearest neighbor search, we get the following prediction accuracy results (the error bands are 95% bootstrapped confidence intervals): with full data, accuracy is

; with 3/4 of data, accuracy is ; with 1/2 of data, accuracy is ; and with 1/4 of data, accuracy is .

(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
Figure 22: Prediction accuracy and running time of different methods on COIL-100 dataset as the size of training set increases. (a,e) Using 25% of training data. (b,f) Using 50% of training data. (c,g) Using 75% of training data. (d,h) Using 100% of training data.

Next, we study how the performance of SGNN changes with the length of random walks. We choose and compare different methods on the same datasets. The results are shown in Figure 27. The SGNN(2) method outperforms the competitors. Interestingly, SGNN(2) also outperforms the exact nearest neighbor algorithm on the MNIST dataset. This result might appear counter-intuitive, but we explain the result as follows. Given that we use a simple metric (Euclidean distance), the exact -nearest neighbors are not necessarily appropriate candidates for making a prediction; Although the exact nearest neighbor algorithm finds the global minima, the neighbors of the global minima on the graph might have large values. On the other hand, the SGNN(2) method finds points that have small values and also have neighbors with small values. This stability acts as an implicit regularization in the SGNN(2) algorithm, leading to an improved performance.

(a)
(b)
(c)
(d)
Figure 27: Prediction accuracy and running time of the SGNN method with random walks of length two (a) Accuracy on MNIST dataset using 100% of training data. (b) Running time on MNIST dataset using 100% of training data. (c) Accuracy on COIL-100 dataset using 100% of training data. (d) Running time on COIL-100 dataset using 100% of training data.

These results show the advantages of using graph-based nearest neighbor algorithms; as the size of training set increases, the proposed method is much faster than KDTree.