Efficient and Scalable Batch Bayesian Optimization Using K-Means

06/04/2018 ∙ by Matthew Groves, et al. ∙ ibm University of Warwick 0

We present K-Means Batch Bayesian Optimization (KMBBO), a novel batch sampling algorithm for Bayesian Optimization (BO). KMBBO uses unsupervised learning to efficiently estimate peaks of the model acquisition function. We show in empirical experiments that our method outperforms the current state-of-the-art batch allocation algorithms on a variety of test problems including tuning of algorithm hyper-parameters and a challenging drug discovery problem. In order to accommodate the real-world problem of high dimensional data, we propose a modification to KMBBO by combining it with compressed sensing to project the optimization into a lower dimensional subspace. We demonstrate empirically that this 2-step method is competitive with algorithms where no dimensionality reduction has taken place.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Bayesian optimization (BO) is a popular framework for the optimization of black-box functions, where the analytic form of the function being optimized is unknown, or too expensive to evaluate. BO has found extensive use for the optimization of machine learning algorithms

[Snoek, Larochelle, and Adams2012], and for experimental design of complex systems.

In its native form, BO is a sequential optimization procedure, since new information is required to update the posterior, and therefore the acquisition function. For many of the emerging uses of BO, this is a severe limitation since, due to the size of the optimization problem, data must be acquired in a highly parallel manner in order for the optimization to be completed in a relevant time frame. Several methods for parallelizing the BO process have been proposed, and will be reviewed in Section 2

. It is important to point out that there are two separate, yet complementary, approaches to the parallel BO problem. One is to minimize the strict number of function evaluations, typically achieved by a dynamically allocated batch size, and the other is to minimize the number of epochs for a given batch size. Whilst there are situations in which both are valid, the focus of this paper is to minimize the number of epochs, since there are a number of situations in which a fixed batch size is required; for example in the screening of potential pharmaceutical compounds in which there are a pre-determined number of ‘slots’ in which compounds can be tested.

2 Related work

ginsbourger2007 generalize the EI to the batch setting, proposing the qEI acquisition function for batches of q points. Unfortunately, identifying the points that jointly maximize qEI is difficult, as the computational cost of evaluating the function and its derivative scales poorly with increasing q.[Ginsbourger, Le Riche, and Carraro2007]

Several works have suggested heuristic approaches for approximating qEI, (see for example

[Snoek, Larochelle, and Adams2012], [Chevalier and Ginsbourger2013], [Wang et al.2016]). One popular qEI-based method is the Constant Liar (CL) approach of ginsbourger2010.[Ginsbourger, Le Riche, and Carraro2010] CL is a sequential batch building method, based on iteratively adding the point that maximizes the single point acquisition function, assuming that evaluating this point will return a particular constant ‘lie’ value, temporarily augmenting the model training set with this synthetic values and refitting the GP.

In recent work, gonzalez2016 propose an alternative batching method by approximating the repulsive effect when batching.[González et al.2016] Under a Gaussian Process prior, target values of nearby points in sample space are expected to be highly correlated. Thus, when choosing a batch of samples, we may wish for the batch members to be sufficiently far apart to maximize the information gained. To do this, the authors propose the Local Penalization (LP) method that sequentially assembles batches of samples by successively penalizing the acquisition function around points previously selected, using a penalization radius based on the estimated Lipschitz constant of the acquisition function surface.

The above methods all take a greedy sequential approach to batch building, iteratively adding points to the batch that maximize a particular criterion, like the locally-penalized acquisition function. In contrast, lobato2017 propose a fully parallel batch sampling technique using Thompson Sampling,

[Thompson1933] in which the posterior is sampled to generate a ‘panel of experts’ which are then polled in parallel as to which data point should be acquired.[Hernández-Lobato et al.2017]

ppes suggest Parallel Predictive Entropy Search (PPES), a non-greedy batch sampling approach aiming to maximize the expected information gain from sampling the chosen batch in terms of the expected reduction in differential entropy of the predictive distribution of the global maximizer given the sampling data.[Shah and Ghahramani2015]

b3o propose a novel batch selection method called Budgeted Batch Bayesian Optimization (B3O), which aims to build sample batches containing peaks of the acquisition function.[Nguyen et al.2017]

To find these peaks, whilst avoiding costly optimization routines, the authors propose a generalized slice sampling procedure. Slice sampling preferentially accepts samples from high density regions of the acquisition function surface, allowing peaks to be reliably estimated even with modest numbers of samples. Peak picking is then done using an Infinite Gaussian Mixture Model (IGMM)

[Rasmussen2000]

. b3o show empirically that B3O performs well on a variety of test functions and common BO applications, such as hyperparameter tuning. However, the inability of B3O to allow fixed batch sizes is a potential limitation, as real-world applications for batch BO can have an effective constraint on possible batch size, for example the number of available compute nodes (simulation), number of different molecules that a robotic assay can test simultaneously (drug discovery), or quantity of samples that can fit in a furnace (alloy hardening). Under-utilizing the available resources with smaller batch sizes costs information that could be gained at little additional cost, whereas choosing to allocate too many samples to a batch may be impossible.

3 Proposed Method

In the BO formalism, the target function is not directly optimized. In its place, an acquisition function is constructed using a probabilistic model based upon previously determined values for the function . Typically this model is a Gaussian process (GP),[Rasmussen and Williams2004]

although other models including neural networks have been used.

[Snoek et al.2015]

There are many different versions of the acquisition function, depending upon the type of optimization task which is being performed, but the most commonly used is expected improvement, EI,[Mockus1974] which is determined as follows:

(1)

where

denotes the CDF (cumulative distribution function) of the standard normal distribution,

denotes the PDF (probability density function) of the standard normal distribution, and

denotes the improvement, which can be expressed as:

(2)

where is the best target value observed so far, is the predicted mean and

is the corresponding variance.

At its core, this procedure is inherently serial, as it is based upon the updating of a probablistic model, and thus limited by data acquisition. Our contribution is twofold, firstly we propose a novel parallel (or batch) Bayesian optimization procedure based upon K-means, K-means Batch Bayesian Optimization (KMBBO), and secondly we propose a modification for the use of this method with very high-dimensional data using a dimensionality reduction step based upon compressed sensing.

The central aim of KMBBO is to efficiently select a batch of high quality points to evaluate, i.e, during each sampling epoch, we would like our batch to contain points from high-density regions of the acquisition function. However, modeling the landscape of the acquisition function directly is generally intractable, except in very low dimensions. In order to approximately learn the locations of peaks, we fit a K-Means clustering model to the collection of points in our sample space, chosen using slice sampling. Slice sampling draws samples uniformly from the volume under the acquisition function, and so will preferentially select samples from regions where the acquisition function value is highest.

K-Means [MacQueen1967] is one of the simplest and most commonly used clustering methods. Given a set of points , and number of clusters , the K-Means method will attempt to find a partition clustering the members of in order to minimize the within-cluster sum of squares distance between cluster members and the cluster centroid, i.e:

(3)

where represents the centroid of cluster . Thus, KMBBO allows the user to specify the batch size directly as the number of clusters for the K-Means method.

Input: Sampling domain , Initial samples , Batch size ,
epochs , slice samples
Batch size ,epochs , slice samples
For t = 1 to :
1. Fit GP model to training data
2. Collect slice samples:
3. Fit K-Means model to obtain centroids
4. Sample centroids:
5. Add newly observed values to dataset:
End for
Return
Table 1: K-Means Batch Bayesian Optimization (KMBBO)

To collect its slice samples, KMBBO utilizes the batch generalized slice sampling (BGSS) method described in [Nguyen et al.2017] where the joint density is defined as

(4)

where and is obtained through minimization using a non-convex global optimizer, thus not requiring the function to be non-negative, or a proper distribution. However, like standard slice sampling, BGSS scales poorly with the dimensionality of the sampling domain [Neal2003], making it impractical for use in high-dimensional settings. To address this we add a dimensionality reduction step based upon compressed sensing. Our use of the compressed sensing methodology is based upon the observation that most high-dimensional data follows a sparse encoding and thus is compressible. In the compressed sensing scheme, the aim is to reconstruct a signal using the smallest number of observations (which are linear functions of the components of the signal) possible. This is achieved by solving the basis pursuit problem, where we search for the sparsest matrix which can reconstruct the full matrix :

(5)

where is the change-of-basis matrix and is a set of randomly measured entries in matrix B.

We apply this method to the original feature space of a high dimensional problem, but instead of using the sparse solution to reconstruct the original function, we instead use it as a compressed basis in which to perform the BO sampling.

The upper bound on the lossless dimensionality reduction which can be achieved using compressed sensing is thus equivalent to the number of samples which are required for compressed sensing to perfectly recover from , which has been shown to scale as follows: [Candès and Wakin2008]:

(6)

where is the original number of features, is the number of non-zero elements and is the incoherency, which in general ranges from to .

Input: Domain , compression error tolerance ,
Batch size k,# epochs N, #slice samples
1. Draw samples from :
2. Compress domain using TwIsT:
3. Run KMBBO:
4. Decompress
Table 2: KMBBO with compressed sensing (CS-KMBBO)

Whilst some other methods, such as REMBO,[Wang et al.2013] have used a compressive scheme, the exact dimensionality of this compression was left as a parameter to tune. In CS-KMBBO, we instead use the Two-step Iterative Shrinkage/Thresholding (TwIsT)[Bioucas-Dias and Figueiredo2007] optimization technique - a variant of the popular Iterative Shrinking Thresholding algorithm (IsT) which is more robust to ill defined measurements - to determine the optimal dimensionality of the compression step. Whilst it is possible in some discrete problems, such as the drug discovery challenge tackled within this paper, to know the entire space of inputs, we recognize that this is not always the case. Thus, we sample 1,000 data points to perform the TwIST-based dimensionality optimization procedure to create a process which is transferable between discrete and continuous spaces.

4 Experiments

4.1 Comparison to Existing Methods

For this study we compare the performance of KMBBO to a range of currently used parallel BO methods, the details of which have been described in Section 2, using the Expected Improvement acquisition function, which has been shown to have strong theoretical guarantees [Vazquez and Bect2010]and empirical effectiveness [Snoek, Larochelle, and Adams2012]. In addition to Naieve qEI, the most basic parallel sampling method, we compare to Thompson sampling, Constant Liar (mean), Local Penalization, a batch predictive entropy search model to represent a non-greedy search strategy, and the dynamic batch method B3O. We investigate two metrics for success:

  1. The convergence of the search to the global minimum (where known) as a function of the number of epochs

  2. The robustness of the search, as demonstrated through sampling 100 repeat runs of the sampling experiment.

For this study, a batch size of 8 was arbitrarily chosen. Throughout the study the Bayesian model was provided through the use of a Gaussian process, which was created using a squared-exponential kernel with automatic relevance determination (ARD) as implemented in the Scikit-Learn library [Pedregosa et al.2011],

(7)

seeded with 10 randomly selected data points. The GP’s hyperparameters were optimized using gradient descent on the marginal likelihood. Finally, both B3O and KMBBO selected 200 slice samples when generating each batch to maintain consistency with [Nguyen et al.2017].

4.2 Optimization Tasks

4.2.1 Synthetic Functions

We test the ability of KMBBO to find the global extremes of three synthetic functions commonly used for benchmarking machine learning algorithms: Branin-Hoo (2D), Camelback-6 (2D), and Hartmann (6D) as described on the Virtual Library of Simulation Experiments test function database [Surjanovic and Bingham].

4.2.2 Svm

A common use for Bayesian optimization is for the tuning of hyperparmeters for machine learning models. In order to test the effectiveness of KMBBO for this task, we use it to determine optimal hyperparameters for a support vector machine for the Abalone regression task.

[Nash and Laboratories1994] In this context we tune three hyperparameters: (regularization parameter), (insensitive loss) for regression and

(RBF kernel function). The loss function is the root mean squared error of the prediction.

4.2.3 Drug Discovery

This is a task taken to illustrate the utility of this procedure for lead identification in drug discovery - where rapid identification of desirable compounds at low cost is essential. The target for maximization is the PEC50; a value which describes the potency of the drug. The data was taken from hits from Plasmodium falciparum (P. falciparum) whole cell screening originates from the GlaxoSmithKline Tres Cantos Antimalarial Set (TCAMS), Novartis-GNF Malaria Box Data set and St. Jude Children’s Research Hospital’s Dataset (EC50 in M against P. falciparum 3D7) as released through the Medicines for Malaria Venture website [mmv.org]. Each molecule was described using MAACS keys [Durant et al.2002]- a common cheminformatics descriptor, generated using the RDKit software [Landrum] resulting in a 167 dimensional optimization problem.

5 Results

Figure 1: The distribution of the ‘first encounter time’, i.e. when the global optimium is first located for the Branin-Hoo function. Statistics are generated from 100 repeats of the experiment.

5.1 Synthetic Functions

For the Branin-Hoo, we observe that both the Constant Liar and KMBBO methods are able to approach the minimum quickly, achieving low regret after only a few sampling epochs, with both B3O and Thompson sampling also reliably reaching finding the optimum before 8 sampling epochs. LP, however, performs poorly, achieving similar regret to Naieve qEI, with many iterations of each method failing to discover the minimum after 10 epochs. The performance of LP relies heavily on the quality of the Lipschitz constant estimate, which is calculated over the entire sampling domain. For the Branin-Hoo function, this is dominated by the quartic term away from the function minima, leading to a Lipschitz constant estimate poorly suited to the region around the optimum. Figure 1 shows the ‘first encounter time’ of the global optimum for each method. We see that, even though the initial reduction in regret between KMBBO and Constant Liar is similar, KMBBO is able to locate the optimal value earlier and more consistently than the other methods. All functions perform well for the Camelback task, although we observe that KMBBO converges to the true minimum faster than the other methods.

Figure 2:

The optimization performance for the 6 dimensional Hartmann function. Statistics are generated from 100 repeats of the experiment, and confidence intervals are calculated to 1 sigma.

The 6 dimensional Hartmann function is a more challenging optimization problem. We observe in Figure 2 that after 10 epochs are methods have still not yet managed to identify the global optimum. LP, B3O and KMBBO all performed well, achieving similar average regret values, but B3O and KMBBO performed more consistently, with lower variance on the regret obtained.

5.2 Tuning of Hyperparameters

KMBBO displays the best performance on the SVM hyper-parameter tuning task, shown in Figure 4. With a low dimensional sampling space, the slice sampling method used by B3O and KMBBO performs particularly well at approximating high density regions of the acquisition function. Indeed, the violin plot in 4 shows that, not only are KMBBO and B3O the best performers at minimizing RMSE, they also perform most consistently, with smallest error variance.

Figure 3:
Figure 4: Optimization performance for the tuning of the hyperparameters of an SVM, as displayed through the RMSD of the SVM for the abalone problem with respect to the number of epochs. Statistics are generated from 100 repeats of the experiment, and confidence intervals are bootstrapped to 1 sigma. Note that the batch-PES methodology is excluding from this plot, as its large variance over runs made interpretation of the performance other methodologies impossible. The results for this method can be seen in Table 3
Task
Branin-Hoo Camelback-6 Hartman SVM Malaria
Method Regret Std.Dev Regret Std.Dev Regret Std.Dev RMSE Std.Dev Regret Std.Dev
Naieve qEI 0.803 1.47 0.0276 0.0974 1.74 0.700 1.9453 0.00170 3.1185 0.8392
Thompson 0.00619 0.00186 0.0727 0.179 1.33 0.438 1.9430 0.00125 3.0000 1.0888
Constant Liar 0.00584 0.00129 0.0778 0.207 1.70 0.511 1.9430 0.000802 X X
LP 0.637 1.28 0.0292 0.0947 0.916 0.673 1.9441 0.00105 X X
KMBBO 0.00523 0.000488 0.0354 0.0616 0.922 0.311 1.9416 0.000577 2.3802 1.4003
B3O 0.00591 0.00170 0.130 0.338 0.882 0.320 1.9422 0.000580 X X
Batch PES 0.5486 0.3682 0.1619 0.0967 1.4257 0.4736 1.9406 0.7322 3.2626 0.7322
Table 3: Final performance after 10 sampling epochs for each method on each of the test problem cases over 100 repetitions. Best performance in each case is shown in bold. For the Malaria task, an X indicates that the method was not run, due to computational intractability or algorithmic instability.

5.3 Drug Discovery

The high dimensional nature of the drug discovery task presented significant challenges to several of the benchmark methods. In 167 dimensions, the slice sampling method used by B3O is unable to produce any reasonable approximation of the acquisition function surface with the original sampling budget of 200 and we found the substantial increase in samples required lead to prohibitively long running times. The LP method was hamstrung by the computational cost of approximating the Lipschitz constant in this high dimensional space, Furthermore, the Constant Liar methodology is reliant upon a high quality model, and thus is very sensitive to hyperparameter selection, and the addition of reasonable quality psuedo-inputs. Unfortunately, during our testing of this method for the drug discovery problem, a large number of runs failed due to a failure for the GP model to converge during the fitting task, and thus it is excluded from the results.

Of the remaining methods, Thompson sampling, qEI, batch-PES and CS-KMBBO are able to be used for this task. Figure 5 shows that KMBBO displays strong performance, reaching low regret after 10 sampling epochs- having sampled only 90 out of a potential circa 19,000 candidate molecules. Thompson sampling, qEI and Batch-PES display similar behaviors, discovering a local maximum on the potency landscape, but neither are able to discover molecules with as low regret as KMBBO. It is worth noting that due to the discrete nature of the search space, here regret is not a continuous function, for example a regret of 2 places you within the top 0.7% of values for the task.

Figure 5: Optimization performance for the Malaria drug discovery problem, as displayed through instantaneous regret with respect to the number of epochs. Statistics are generated from 10 repeats of the experiment, and confidence intervals are bootstrapped to 1 sigma.

5.4 Rankings

One way to measure the robustness of a search method is to compare the rankings of the search method of the whole range of tasks performed in this study. Since raw rankings can be misleading (a close second ranks the same as a search in which the gap between methods was much wider) we instead use a normalized ranking, , proposed in [Jasrasaria and Pyzer-Knapp2018]:

(8)

where represents the result of a particular strategy, the result of the best strategy, and represent the range of results encountered in the study. This results in a score bounded where 0 represents a perfect performance across tasks.

We calculate for both the performance of the optimization task, as measured by regret or RMSE where appropriate, and the variance of the task as measured over multiple runs.

Figure 6: Z score calculated for each of the parallel optimization strategies investigated in this study.

It can easily be seen from Figure 6 that KMBBO achieves a significantly better Z score than any other method for pure optimization performance, and also the best Z score, albeit by a smaller margin, than any other method for variance. This demonstrates both the class leading nature of KMBBO and also its strong reproducibility; a property which is key for Bayesian optimization, where each data point is expensive to acquire and thus reliability of a methodology is strongly desired.

5.5 Computational Cost

We have analyzed the complexity of the rate-limiting step for each of the methods used in this work, and performed additional empirical experiments looking at real-world running times . The poor dimensionality scaling of slice sampling () is common to the B3O method, and worse than the scaling of the LP method (). We address this in CS-KMBBO through the incorporation of compressed sensing for dimensionality reduction. Even when compression is not required, our empirical timings, shown in Figure 7, indicate that the runtime per sampling epoch for KMBBO is generally significantly smaller than for B3O, which we attribute to the simplicity and scalability of the K-Means algorithm compared to the IGMM used in B3O. However, it is worth keeping in mind that in the Bayesian Optimization framework, it is generally assumed that obtaining ground truth values by sampling the black-box function is substantially more expensive (in time/ computational cost) than the calculation of the sampling batch, which somewhat mitigates concerns about the computational cost of the sampling methodology as an expensive, yet efficient, sampling scheme will have less real-world cost than an inefficient, yet fast, alternative.

Figure 7: Runtimes of KMBBO and B30 in 2 and 6D. Runtime is calculated as seconds per sampling epoch.

5.6 Algorithmic Insight

In this section we discuss the different characteristics of the sampling methods through analyzing their sample selections for an easy to visualize 1 dimensional optimization problem.

Figure 8: Points selected to form the next batch for each sampling method when minimizing

, given 5 initial random points. The activation function is shown in blue, with non-zero regions shaded. The blue histogram shows the samples taken by BGSS.

Figure 8 shows the activation function curve and subsequent samples chosen by each of the sampling algorithms while minimizing the function , after 5 randomly chosen initial samples. This gives some visual insight into the behavior of each of the methods. We observe that all of the methods are able to identify the main peak of the acquisition function and allocate at least one sample nearby. Naieve qEI simply chooses the q points from the sample space closest to this peak, leading to highly local sampling, and insufficient exploration of other areas of density in the acquisition function. LP also does a good job of identifying the main acquisition function peak, and the local penalization factor ensures somewhat more exploration than with the Naieve qEI method. However, this still seems insufficient to cause exploration of other areas of density in the acquisition function. In contrast, Constant Liar is susceptible to over exploration and selects several low quality points. We posit that this is due to the assumption that the true value for each sample added to the batch is represented by the mean value of the GP prediction. Since the violation of this assumption can lead to large movements in the GP posterior, this can cause erratic behavior, and lead to these poor selections. In our toy example, B3O successfully identifies two of the acquisition function peaks, but does not represent the third. The IGMM used by B3O seems to be sensitive to the number of slice samples provided, as experiments with different numbers of slice samples lead to substantial variations in the number and location of the points chosen.

KMBBO is able to achieve a good balance between exploration and exploitation, with all three maxima in the acquisition function represented, with the remaining samples well distributed over the non-zero areas of the curve. When the number of local optima of the acquisition function is lower than the batch size, the quadratic penalization for within cluster distance used by K-Means ensures that the remaining cluster centroids will spread out over the set of slice sample values.

6 Summary

We propose a novel batch sampling algorithm for Bayesian optimization based upon K-means, K-means Batch Bayesian Optimization (KMBBO). KMBBO was tested in a variety of tasks, from common synthetic functions to the tuning of a machine learning algorithm, to a high-dimensional drug discovery problem. Over these tasks KMBBO displays superior sampling behaviors than other common Bayesian optimization methods, such as LP, Thompson sampling, Constant Liar, and B3O, delivering either optimal or close to optimal behavior in all tasks. It also delivered this performance more reliably than any other method, consistently showing the smallest standard deviation in results over 100 repetitions across tasks. This is a very important result since the major utility of Bayesian optimization is when each sample is expensive or difficult to collect, and thus reliability in optimization performance is strongly desirable. We also propose a modification to KMBBO, CS-KMBBO, for use in high dimensional problems, where the slice sampling in KMBBO adds significant computational overhead. In this adaptation, the optimal dimensionality is achieved through the use of the TWiST technique on a sampled subset of the problem space. CS-KMBBO shows better performance than all methods despite operating on a reduced dimensional data set. Finally, we discuss insights into the performance of KMBBO through the visualization of the batching process for a toy problem, and comparison to the other methods studied within this paper. Over a wide variety of tasks, it is inevitable that for any specific task, a particular sampling technique will have optimal performance, but the strong performance of KMBBO over the whole range of tasks and dimensions, makes it a reliable choice.

7 Acknowledgements

This work was supported by the STFC Hartree Centre’s Innovation Return on Research programme, funded by the Department for Business, Energy & Industrial Strategy.

References

  • [Bioucas-Dias and Figueiredo2007] Bioucas-Dias, J. M., and Figueiredo, M. A. 2007. A new twist: Two-step iterative shrinkage/thresholding algorithms for image restoration. IEEE Transactions on Image processing 16(12):2992–3004.
  • [Candès and Wakin2008] Candès, E. J., and Wakin, M. B. 2008. An introduction to compressive sampling. IEEE signal processing magazine 25(2):21–30.
  • [Chevalier and Ginsbourger2013] Chevalier, C., and Ginsbourger, D. 2013. Fast computation of the multi-points expected improvement with applications in batch selection. In International Conference on Learning and Intelligent Optimization, 59–69. Springer.
  • [Durant et al.2002] Durant, J. L.; Leland, B. A.; Henry, D. R.; and Nourse, J. G. 2002. Reoptimization of mdl keys for use in drug discovery. Journal of Chemical Information and Computer Sciences 42(6):1273–1280. PMID: 12444722.
  • [Ginsbourger, Le Riche, and Carraro2007] Ginsbourger, D.; Le Riche, R.; and Carraro, L. 2007. A multi-points criterion for deterministic parallel global optimization based on kriging. In NCP07.
  • [Ginsbourger, Le Riche, and Carraro2010] Ginsbourger, D.; Le Riche, R.; and Carraro, L. 2010. Kriging is well-suited to parallelize optimization. In Computational Intelligence in Expensive Optimization Problems. Springer. 131–162.
  • [González et al.2016] González, J.; Dai, Z.; Hennig, P.; and Lawrence, N. 2016. Batch bayesian optimization via local penalization. In Artificial Intelligence and Statistics, 648–657.
  • [Hernández-Lobato et al.2017] Hernández-Lobato, J. M.; Requeima, J.; Pyzer-Knapp, E. O.; and Aspuru-Guzik, A. 2017. Parallel and distributed thompson sampling for large-scale accelerated exploration of chemical space. In Proceedings of the 34th International Conference on Machine Learning.
  • [Jasrasaria and Pyzer-Knapp2018] Jasrasaria, D., and Pyzer-Knapp, E. O. 2018. Dynamic Control of Explore/Exploit Trade-Off In Bayesian Optimization. arXiv:1807.01279 [cs, stat]. arXiv: 1807.01279.
  • [Landrum] Landrum, G. RDKit: Open-source cheminformatics. bibtex: rdkit.
  • [MacQueen1967] MacQueen, J. 1967. Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Statistics, 281–297. Berkeley, Calif.: University of California Press.
  • [mmv.org] mmv.org. Malaria Box supporting information | MMV.
  • [Mockus1974] Mockus, J. 1974. On bayesian methods for seeking the extremum. In Optimization Techniques IFIP Technical Conference Novosibirsk, July 1–7, 1974, 400–404. Springer, Berlin, Heidelberg.
  • [Nash and Laboratories1994] Nash, W. J., and Laboratories, T. M. R. 1994. The Population biology of abalone (Haliotis species) in Tasmania. 1, Blacklip abalone (H. rubra) from the north coast and the islands of Bass Strait.
  • [Neal2003] Neal, R. M. 2003. Slice sampling. Annals of statistics 705–741.
  • [Nguyen et al.2017] Nguyen, V.; Rana, S.; Gupta, S.; Li, C.; and Venkatesh, S. 2017. arXiv preprint arXiv:1703.04842v2.
  • [Pedregosa et al.2011] Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; Vanderplas, J.; Passos, A.; Cournapeau, D.; Brucher, M.; Perrot, M.; and Duchesnay, E. 2011. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research 12:2825–2830.
  • [Rasmussen and Williams2004] Rasmussen, C. E., and Williams, C. K. I. 2004. Gaussian Processes for Machine Learning. MIT Press.
  • [Rasmussen2000] Rasmussen, C. E. 2000. The infinite gaussian mixture model. In Advances in neural information processing systems, 554–560.
  • [Shah and Ghahramani2015] Shah, A., and Ghahramani, Z. 2015. Parallel predictive entropy search for batch global optimization of expensive objective functions. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2, NIPS’15, 3330–3338. Cambridge, MA, USA: MIT Press.
  • [Snoek et al.2015] Snoek, J.; Rippel, O.; Swersky, K.; Kiros, R.; Satish, N.; Sundaram, N.; Patwary, M.; Prabhat, M.; and Adams, R. 2015. Scalable bayesian optimization using deep neural networks. In International Conference on Machine Learning, 2171–2180.
  • [Snoek, Larochelle, and Adams2012] Snoek, J.; Larochelle, H.; and Adams, R. P. 2012. Practical bayesian optimization of machine learning algorithms. In Advances in neural information processing systems, 2951–2959.
  • [Surjanovic and Bingham] Surjanovic, S., and Bingham, D. Virtual library of simulation experiments: Test functions and datasets. Retrieved May 16, 2018, from http://www.sfu.ca/~ssurjano.
  • [Thompson1933] Thompson, W. R. 1933. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25(3/4):285–294.
  • [Vazquez and Bect2010] Vazquez, E., and Bect, J. 2010. Convergence properties of the expected improvement algorithm with fixed mean and covariance functions. Journal of Statistical Planning and Inference 140(11):3088–3095.
  • [Wang et al.2013] Wang, Z.; Zoghi, M.; Hutter, F.; Matheson, D.; De Freitas, N.; et al. 2013. Bayesian optimization in high dimensions via random embeddings. In IJCAI, 1778–1784.
  • [Wang et al.2016] Wang, J.; Clark, S. C.; Liu, E.; and Frazier, P. I. 2016. Parallel bayesian global optimization of expensive functions. arXiv preprint arXiv:1602.05149.