Bayesian optimization is a sample-efficient strategy for black-box optimization Shahriari et al. (2016); Frazier (2018). In many practical circumstances, such as robotics systems, measuring performance is often complicated by having only an implicit understanding of utility, the existence of multiple competing metrics, or reliance on perceptual metrics that are not easily instrumented or quantified Wirth et al. (2017); Thatte et al. (2017); Pinsler et al. (2018). In such circumstances, standard Bayesian optimization maybe be infeasible: it may be more practical to explicitly incorporate comparative feedback into the search for a stakeholder’s most preferred outcome. Preference-based optimization offers an approach that relies only on pairwise comparative evaluations, rather than forcing the design of a single criteria or utility for which standard black-box optimization can be applied Cano et al. (2018); Burger et al. (2017).
This strategy of preference-based optimization, often involving a human-in-the-loop, has been developed over the past decade Brochu et al. (2010, 2008); Gonzalez et al. (2017); Houlsby et al. (2012); Thatte et al. (2017). These tools build on the foundation of sequential model-based Bayesian optimization Bergstra et al. (2011); Hutter et al. (2011); Snoek et al. (2012). The structure of these optimization strategies is to sequentially query the user with (generally) two possible options, of which the user must choose one.
In this article, we start from an existing preference model defined by a Gaussian process latent variable model; we describe an extension, first presented in Dewancker et al. (2016) to allow for the user to report configurations as equivalently preferable.
These so-called “ties” provide the opportunity for users to avoid stating a preference in circumstances where such imprecision exists; we present a short empirical analysis to consider the impact of allowing ties. We also show a synthetic example for this strategy to identify a user’s most preferred Pareto efficient outcomes without exploring the entire efficient frontier Ehrgott (2005); Knowles (2006).
Our eventual goal for this work is to be able to identify most-preferred outcomes for circumstances which lack quantifiable metrics. We use coloring artwork to define one such circumstances as a stand-in for application-specific or confidential circumstances. We present the desired experimental workflow, and also some of the complications which have occurred in initial testing. We hope that participating in this ICML Human in the Loop workshop will introduce us to others who have insights in the psychological complications involved in incorporating humans into similar training/tuning/iterative processes.
2 A Preference Model Supporting Ties
The use of Gaussian process latent variable models (GPLVM) for capturing user preferences has been well studied in the past Brochu et al. (2010, 2008); Chu & Ghahramani (2005); Guo et al. (2010). Previous preference models have required that the user consider two alternatives and state a binary preference. However, even experts occasionally have difficulty discerning two alternatives in terms of absolute preference. To this end, we extend the discrete preference observations with an additional third option: specifying equivalence between the two alternatives. Specifically, we adopt a modified Bradley-Terry model that supports ties, or configurations with equivalent preference Rao & Kupper (1967); summarized in Figure 2.
After such preferences have been stated, we denote the results as , where for . Among all the queries, there are unique locations: .
The model draws latent function vectorsfrom a Gaussian process prior where each entry corresponds to one of the unique query points the user has compared. The variables are drawn from normal priors and then transformed to form the lengthscales of the covariance function, used to populate , the covariance matrix.
Each variable is transformed to always lie within the bounds of the lengthscales, specified by and , and produce a vector of lengthscales . Here and .
The generalized Bradley-Terry model Rao & Kupper (1967) relates the observed discrete preference data to the latent function values associated with the two points compared by the user during an interactive query. The tie parameter , is inversely related to the precision with which a user can state a preference. A higher value for leads to more mass being placed in the equivalence bin of the categorical distribution over the three possible preference outcomes for two query points.
2.1 Variational Inference
In place of approximating the posterior with a multivariate Gaussian using the Laplace approximation around a MAP estimate of the latent variablesChu & Ghahramani (2005); Guo et al. (2010), we opt for an approximation that employs variational inference. We set out to approximate
, the posterior of the latent random variables, whereis the combined set of latent random variables in our model.
We use a mean field approximation strategy to construct our approximating distribution
: a factored set of Gaussians each parameterized by a mean and variance as shown below.
We rely on black box variational inference techniques Tran et al. (2016); Ranganath et al. (2013) to perform the optimization required to recover the variational parameters that minimize the reverse KL divergence between the true posterior distribution and the approximating distribution . In total there will be variational parameters; two for each of the entries in and two for each of the elements of . The posterior inference problem is transformed into a minimization of a tractable expected value Tran et al. (2016); Ranganath et al. (2013).
2.2 Sequential Preference Based Optimization
To determine the next point, to be presented to the user as comparison point, we adopt a strategy that searches the domain for where the expected improvement of latent function is highest relative to the current, most preferred point () Brochu et al. (2008). With our approximation of the posterior of the latent variables of the preference model, it is possible to explore the use of an integrated acquisition function, as proposed in Snoek et al. (2012). We can produce a Monte Carlo estimate of the integrated expected improvement.
denote the CDF and PDF of the standard normal distribution respectively. The valueis the latent function value associated with the currently most preferred configuration . The user is always asked to compare against the current most preferred point , and the most preferred point is updated as a result of this comparison. Algorithm 1 encapsulates the sequential optimization process using discrete preference observations.
Here, we initialize the search with samples from a latin hypercube sequence. We fix the quantities , , and we set and where is the length of the optimization domain in the th dimension.
2.3 Synthetic Numerical Experiments Involving Ties
We now experiment to understand the impact of allowing ties into the optimization process. To do so, we phrase the preference optimization process using a scalar function and a stated tolerance ,
In a first experiment, we consider a standard scalar test function as our , the Shekel05 function McCourt (2016). We analyze the convergence behavior of Algorithm 1 on this function with a range of tolerance values between and 1. Figure 2(a) shows the convergence behavior as dependent on tolerance.
Figure 4 shows the impact of varying tolerances on a multicriteria optimization problem using a multi-modal function in two dimensions. Success is defined as trying to reach the region containing the local minimum with the second lowest value; we approach this by balancing the competing objectives of minimizing function value and maximizing distance from the global minimum. To automate the testing process (pretend that a human was in the loop) we define a nonlinear scalarized quality function
where is the true minimum. As we can see in the bottom portion of Figure 4, the clearest impact of tighter tolerances seems to be the narrowing down of results to be closer to the Pareto frontier.
3 Proposed Artwork Coloring Examples
The eventual goal of our preference-based optimization strategy is to be able to identify most-preferred outcomes when a scalar function such as does not exist. We now present a testing situation which lacks any such : the coloring of an abstract piece of art. The shape in particular is shown in Figure 1 with two possible colorings presented, from which the user is expected to choose between them, or that they are both “roughly equal”.
The shape is a fractal, defined in Lee (2014), to have 3 polynomial roots which points in the complex domain converge to through Newton’s method at different rates. The search domain for this coloring problem is, at present, a 10 dimensional space defining 3 different colors associated with the 3 polynomial roots, transition speed between those 3 colors, and the base color (which has an effect of dulling the colors).
Our goal in running these experiments was to determine the performance of Algorithm 1 relative to a purely random search Bergstra & Bengio (2012) of the domain. In particular, users would be asked to run through some number of optimizations using random search and some number using Algorithm 1 and then a final comparison asking which would be preferred between the most-preferred from the two strategies.
3.1 Initial Testing and User Feedback
We built a webapp to provide a sequence of comparisons to users in the format of Figure 1; the incumbent most-preferred is always on the right-hand side. This was the first point of feedback that we received from users: “Why does the winner always stick around?” While it presents a viable optimization strategy, we may need to be more cognizant of the persistent presence of the during the optimization.
Additionally, some users complained that they felt “tricked” or that we “misled” them when learning that some of the optimizations were powered by random search rather than something more intelligent/adaptive. This was surely caused by running tests on people who had already been informed of the purpose of the testing (a form of the placebo effect, perhaps). Even though testers did not know whether they were running with random search or preference optimization (to avoid a true placebo effect), some frustration around the distinction was present.
A more severe complication, and one which we had failed to consider, is the user’s sense of completion. In the pure Algorithm 1, a budget would be set at the start. It seemed, though, that users were frustrated by this fixed budget. Some commented ”Do I keep going?” or ”I think I’m done.” part way through the budget; others said ”I want to keep going.” after reaching the budget.
3.2 Proposed Next Steps
One option to address the user discomfort with always seeing is comparisons is to consider more of a tournament strategy. While this is straightforward when randomly searching the space, it is more complicated to adapt Algorithm 1 for this situation.
To address the user’s discomfort in the process of random versus intelligent testing, we should consider not informing people how the testing process is designed and proceeding. Originally, we were afraid that testers would not find this opaque presentation reasonable, but it may be preferable to not give users any understanding as to the structure. One user commented “I wasn’t necessarily choosing what I thought was the most aesthetically pleasing but making choices to try and direct the system in a certain way that would generate better images down the road.” which is, obviously, not part of our optimization strategy.
The frustration users felt around the budget is reasonable, but not something that we immediately know how to address. One option might be to allow users to run the optimization until they “feel” it is complete; doing so changes the experimental structure but might yield additional insights into user behavior (especially regarding their tolerance).
Probably the most problematic component of this testing framework was something alluded to by users but not explicitly stated: a lack of understanding about the actual range of outcomes in the search space. Users asked ”Is this how it is supposed to look?” and ”What colors are available?” during the testing, which led us to realize that, without any sense of what could occur in the coloring, it would be difficult to confidently make judgments.
It is our hope that we can discuss these final two points with other researchers at this ICML HILL workshop and try to come away with strategies for future experimentation.
Special thanks to Eric Lee for helping prepare the graphics for the preference comparison experiments.
Bergstra & Bengio (2012)
Bergstra, J. and Bengio, Y.
Random search for hyper-parameter optimization.
Journal of Machine Learning Research, 13(Feb):281–305, 2012.
- Bergstra et al. (2011) Bergstra, J. S., Bardenet, R., Bengio, Y., and Kégl, B. Algorithms for hyper-parameter optimization. In Advances in Neural Information Processing Systems, pp. 2546–2554, 2011.
- Brochu et al. (2008) Brochu, E., de Freitas, N., and Ghosh, A. Active preference learning with discrete choice data. In Advances in Neural Information Processing Systems, pp. 409–416, 2008.
- Brochu et al. (2010) Brochu, E., Brochu, T., and de Freitas, N. A bayesian interactive optimization approach to procedural animation design. In Proceedings of the 2010 ACM SIGGRAPH/Eurographics Symposium on Computer Animation, pp. 103–112. Eurographics Association, 2010.
- Burger et al. (2017) Burger, R., Bharatheesha, M., van Eert, M., and Babuška, R. Automated tuning and configuration of path planning algorithms. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 4371–4376. IEEE, 2017.
- Cano et al. (2018) Cano, J., Yang, Y., Bodin, B., Nagarajan, V., and O’Boyle, M. Automatic parameter tuning of motion planning algorithms. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 8103–8109. IEEE, 2018.
- Chu & Ghahramani (2005) Chu, W. and Ghahramani, Z. Preference learning with gaussian processes. In Proceedings of the 22nd International Conference on Machine Learning, pp. 137–144. ACM, 2005.
- Dewancker et al. (2016) Dewancker, I., McCourt, M., and Ainsworth, S. Interactive preference learning of utility functions for multi-objective optimization. In NIPS Future of Interactive Learning Machines Workshop, 2016.
- Ehrgott (2005) Ehrgott, M. Multicriteria optimization, volume 491. Springer Science & Business Media, 2005.
- Frazier (2018) Frazier, P. I. Bayesian optimization. In Recent Advances in Optimization and Modeling of Contemporary Problems, pp. 255–278. INFORMS, 2018.
- Gonzalez et al. (2017) Gonzalez, J., Dai, Z., Damianou, A., and Lawrence, N. D. Preferential bayesian optimization. arXiv preprint arXiv:1704.03651, 2017.
- Guo et al. (2010) Guo, S., Sanner, S., and Bonilla, E. V. Gaussian process preference elicitation. In Advances in Neural Information Processing Systems, pp. 262–270, 2010.
- Houlsby et al. (2012) Houlsby, N., Huszar, F., Ghahramani, Z., and Hernández-Lobato, J. M. Collaborative gaussian processes for preference learning. In Advances in Neural Information Processing Systems, pp. 2096–2104, 2012.
- Hutter et al. (2011) Hutter, F., Hoos, H. H., and Leyton-Brown, K. Sequential model-based optimization for general algorithm configuration. In Learning and Intelligent Optimization, pp. 507–523. Springer, 2011.
ParEGO: a hybrid algorithm with on-line landscape approximation for
expensive multiobjective optimization problems.
IEEE Transactions on Evolutionary Computation, 10(1):50–66, 2006.
- Lee (2014) Lee, E. H. Alternating newton’s method. github.com/ericlee0803/NewtonEllipsoid, 2014.
- McCourt (2016) McCourt, M. Optimization Test Functions. github.com/sigopt/evalset, 2016.
Pinsler et al. (2018)
Pinsler, R., Akrour, R., Osa, T., Peters, J., and Neumann, G.
Sample and feedback efficient hierarchical reinforcement learning from human preferences.In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 596–601. IEEE, 2018.
- Ranganath et al. (2013) Ranganath, R., Gerrish, S., and Blei, D. M. Black box variational inference. arXiv preprint arXiv:1401.0118, 2013.
- Rao & Kupper (1967) Rao, P. and Kupper, L. L. Ties in paired-comparison experiments: A generalization of the bradley-terry model. Journal of the American Statistical Association, 62(317):194–204, 1967.
- Shahriari et al. (2016) Shahriari, B., Swersky, K., Wang, Z., Adams, R. P., and De Freitas, N. Taking the human out of the loop: A review of bayesian optimization. Proceedings of the IEEE, 104(1):148–175, 2016.
- Snoek et al. (2012) Snoek, J., Larochelle, H., and Adams, R. P. Practical bayesian optimization of machine learning algorithms. In Advances in Neural Information Processing Systems, pp. 2951–2959, 2012.
- Thatte et al. (2017) Thatte, N., Duan, H., and Geyer, H. A sample-efficient black-box optimizer to train policies for human-in-the-loop systems with user preferences. IEEE Robotics and Automation Letters, 2(2):993–1000, 2017.
- Tran et al. (2016) Tran, D., Kucukelbir, A., Dieng, A. B., Rudolph, M., Liang, D., and Blei, D. M. Edward: A library for probabilistic modeling, inference, and criticism. arXiv preprint arXiv:1610.09787, 2016.
- Wirth et al. (2017) Wirth, C., Akrour, R., Neumann, G., and Fürnkranz, J. A survey of preference-based reinforcement learning methods. The Journal of Machine Learning Research, 18(1):4945–4990, 2017.