Optimization problems are common in aerospace science and engineering. Practical examples include the design of vehicles, systems and structures, which require the evaluation of disciplinary models and objective functions that are frequently treated as black-box functions. Typically, an optimization algorithm operates sequentially by evaluating the objective function at a given point based on its previous evaluations till some stopping criteria is met. When the evaluation of the function is expensive, traditional methods for black-box optimization – in which a considerable number of evaluations is required – are poorly suited for such applications. Surrogate Based Optimization (SBO) can significantly improve the efficiency of the optimization procedure: the available information is exhausted and synthetized into a surrogate model to lower the amount of required expensive function evaluations thus saving time, resources and the associated costs [jones1998efficient, queipo2005surrogate, eldred2006formulations, robinson2008surrogate, forrester2009recent, bhosekar2018advances]. Efficiency can be further improved in a multifidelity setting, where we have cheaper, but potentially biased approximations to the function that can be used to assist the search of optimal points [kennedy2000predicting, forrester2007multi, fernandez2016review, park2017remarks, peherstorfer2018survey, beran2020comparison]. Within this context, we propose a scheme for resource-aware multifidelity active learning to reduce the computational time and cost associated with the optimization of black-box functions. We aim to achieve this goal through the optimal exploitation of computational budgets (time and computing resources) and of the information contained in the surrogate model (continuously updated while searching for the optimum).
Multifidelity active learning for the optimization of black-box functions has been popularly studied in the Bayesian Optimization (BO) setting [viana2013efficient, lam2015multifidelity, takeno2019multi]
, which consists of two components: (i) a Bayesian statistical model to approximate the objective function, and (ii) an acquisition function to decide where to sample next[brochu2010tutorial, frazier2018tutorial]
. The statistical models are almost invariably Gaussian Processes (GP), for their capability to model arbitrary complex functions, analytical tractability and profitability to estimate uncertainty in a probabilistic framework[williams2006gaussian, forrester2007multi, kennedy2000predicting, le2014recursive, cutajar2019deep]. The search for the optimum is guided by an acquisition function – computed on the statistical surrogate model – which defines a metric for evaluating the next point to sample, balancing the trade-off between a global exploration and a local exploitation of the surrogate. The BO framework for the multifidelity settings combines different information sources (the objective function and its approximations at different levels of fidelity) into a single surrogate model and implements active learning strategies by adaptively sampling from different fidelity levels.
Multifidelity Bayesian Optimization is largely explored in the literature [viana2014special, guo2018analysis, meliani2019multi, kontogiannis2020comparison]. However, many challenges are still open to the research community. The optimization of the multifidelity acquisition function is of critical importance for the implementation of an effective active learning strategy, and it may be computationally demanding: in real-world physics-based problems (e.g. the design of aerospace systems and vehicles), the acquisition function is defined over multidimensional domains and subject to non-trivial/non-convex constraints limiting the space of feasible and acceptable solutions [gardner2014bayesian, hernandez2016general, perrone2019constrained]. Moreover, the rationale behind the construction and optimization of the acquisition function, at each sampling step, is the balance between exploration and exploitation thrusts: exploitation involves greedily improving over an already good point and exploration is the attempt to gain information about the optimum in under-explored regions. This motivates the interest not only for the point-wise maximization of the acquisition function, but also for its overall form and shape assumed over the entire search domain. This aspect is crucial within an active learning process and contributes to the knowledge acquisition and uncertainty reduction towards the optimization of the black-box function. Finally, BO approaches commonly meet difficulties in optimally exploiting a given computational budget and greedy strategies are usually adopted, which simply maximize the acquisition function point-wise.
Stemming from these open challenges, this paper proposes a scheme for resource-aware multifidelity active learning to assist/inform and accelerate optimization. In particular, we present a computational approach to enable: (i) constraints-aware, space filling sampling; (ii) optimal allocation of available resources at each single step, including leveraging parallel computing architectures at best through the optimal distribution of sample evaluations; (iii) optimally informative multipoint and multifidelity sampling at each step.
To achieve these goals, we formulate the sampling task at each step of the BO as a knapsack problem to select multiple points and allocate resources for their evaluation. Specifically, this means identifying the best candidate locations and the associated fidelity sources in order to maximize the information gain that can be acquired during a parallel evaluation of the objective function, while accounting for the limited computational budget. Differently from most of the approaches, rather than explicitly optimizing the acquisition function [wilson2018maximizing]
, we evaluate it on a set of feasible points checked beforehand and then consider the problem of selecting an appropriate subset of candidate points with good informative properties, coherently with a knapsack problem approach. By splitting the feasibility check and the points selection tasks, it is possible to fast optimize even complex multifidelity acquisition functions constrained over a non-convex domain. The knapsack problem is implemented as a mixed-integer linear programming (MILP) model over the candidate points within the feasible domain. The domain is partitioned into strata to capture multiple features of the acquisition function by sampling it in wisely distributed locations[dambrosio2017milp]. During the (active) learning process the choice of the sampling locations is driven and refined at each step through adaptive discretization techniques. The optimized sampling procedure is aware of the computational time budget and of the parallel computing resources available, which are therefore leveraged to balance the trade-off between exploration and exploitation in a principled way. In addition, the optimal use of the available resources for the learning process permits a major contraction of the time required to approach and eventually achieve (or closely approximate) the optimum.
The paper is organized as follows. Section 2 discusses the setup of the Bayesian Optimization problem and its extension to multifidelity formulations. Section 3 introduces the Resource Aware Active Learning scheme (RAAL for short) with formulations for multipoint and multifidelity adaptive sampling. Section 4 demonstrates the RAAL scheme for the multifidelity optimization of a variety of standard analytical test functions and for classical benchmark problems. Finally, Section 5 summarizes the concluding remarks.
2 Optimization framework
2.1 Bayesian Optimization
Bayesian Optimization (BO) is a class of machine learning techniques for the efficient optimization of expensive black-box functions[brochu2010tutorial, frazier2018tutorial]. Let consider the constrained optimization problem in the form:
where is the input, is a feasible set in which it is easy to assess membership, and is the continuous objective function. In this context, the term black-box denotes functions that lack of any special structure, as concavity or linearity, or for which derivatives are not known. This is the case in a wide range of applications, such as design of engineering and control systems [mockus2012bayesian, forrester2008engineering, holicki2020controller], design of laboratory experiments [negoescu2011knowledge, packwood2017bayesian], model calibration [majda2010quantifying]lizotte2007automatic, brochu2010tutorial, cutler2014reinforcement]
, and hyperparameter tuning of machine learning algorithms[snoek2012practical, chakraborty2020transfer]. In the following, we will denote the solution to problem (1), and its location. The BO framework consists of two components: a Bayesian surrogate model for modelling the objective function, and an Acquisition Function (AF) for deciding where to sample next. The surrogate models are frequently in the form of Gaussian Processes (GP) that can provide efficient representations of complex functions and characterize model uncertainty in probabilistic frameworks (Section 2.1.1). The search for the optimum is guided by an acquisition function defined on the statistical surrogate and defines a metric for evaluating the next point to sample through a continuous trade-off between a global exploration and a local exploitation of the surrogate (Section 2.1.2).
2.1.1 Gaussian processes
The main building block of our approach is the Gaussian Process regression [williams2006gaussian]. Let consider a dataset of paired input/output observations , with and , generated by the unknown mapping function , where is the measurement noise. The GP regression defines a supervised problem in which we associate to the function a GP prior having mean 0 and covariance function , such that
Denoting the kernel matrix, such that , and , the predictive distribution of the GP is defined by the mean function
and the variance function
where and the
-dimensional identity matrix.
2.1.2 Acquisition function
Once we have a statistical model to represent our belief about the unknown function given , we need a sampling strategy or policy for selecting the new query point . In Bayesian optimization, the selection strategy utilizes the posterior distribution to guide the search and usually consists in the maximization of a quantity that measures how much information this query will provide, i.e. its expected utility. More formally, the unknown objective function will be evaluated at where
is the Acquisition Function (AF). Common acquisition functions are the Probability of Improvement (PI)[kushner1964new], Expected Improvement (EI) [jones1998efficient], entropy Search (ES) [hennig2012entropy] and Predictive Entropy Search (PES) [hernandez2014predictive]. The results of this work are obtained using the EI, which, given its analytical tractability and good trade-off between computational cost and accuracy, is the most widely used in the literature [jones1998efficient]. The Expected Improvement is defined as
where and are the predictive in equations in (3), is the value of the best sample so far and is the location of that sample. andthe standardized improvement
The parameter allows to tune the trade-off between exploration and exploitation, determining the relative importance of the posterior mean with respect to the potential improvement in region with high uncertainty, i.e. large .
2.2 Multifidelity Bayesian Optimization
Multifidelity optimization approaches leverage the availability of analysis models characterized by different levels of fidelity. Typically, high fidelity models consist of ground-truth observations, which are costly to obtain, and/or accurate computer representations of the physics which can be expensive to evaluate. Cheap low-fidelity models may come in various forms: coarser discretizations and resolutions of numerical models, simplified representations which neglect physical effects included in the more expensive high-fidelity models, or approximations through surrogate modeling techniques.
2.2.1 Multifidelity Gaussian processes
The Gaussian process regression can be extended to combine different sources of information in a single probabilistic model. For this purpose, let assume that observation values are available at different fidelity levels, where is the lowest fidelity and the highest. The training dataset is then composed by the paired input/output observation , generated by the unknown mapping function , where the measurement noise is assumed to have the same distribution over the fidelities. In this setting, the multifidelity Gaussian process regression (MF-GP) can be formulated using an autoregressive scheme [kennedy2000predicting], where the lowest fidelity function is characterized by a GP prior with kernel function , and the higher fidelities are defined recursively as
where is a constant factor that scales the contribution of the preceding fidelity to the following one, and models the bias between fidelities.
The autoregressive formulation implies the following
which can be interpreted as a Markov property: given the point , we can learn nothing more about from any other model evaluation , for [kennedy2000predicting, o1998markov]. A kernel function between a pair of samples can be written as
Denoting the kernel matrix, such that , the predictive distribution of the MF-GP is defined by the predictive mean and variance
where and .
2.2.2 Multifidelity acquisition function
The availability of multiple fidelity levels poses a new challenge for the Bayesian Optimization: not only we have to determine the location of the new sample to evaluate, but also the most convenient fidelity level to query. Different approaches can be found in the literature, as Multifidelity Expected Improvement (MFEI) [huang2006sequential], Multifidelity Predictive Entropy Search (MFES) [zhang2017information] or Multifidelity Max-value Entropy Search [takeno2019multi]. For consistency, our formulation is based on the MFEI, which preserves the good properties of its single fidelity counterpart. The MFEI (or Augmented EI) [huang2006sequential] is defined as
where the first term is simply the EI evaluated at the highest fidelity, therefore can be simply derived from (4). The utility functions and are defined as
where is designed to discount the utility when a lower fidelity evaluation is considered, whereas takes into account the stochastic nature of the unknown function due to the presence of noise, therefore for deterministic problems is . The function has been chosen to be straightforward to compute under the assumptions of the GP regression, and it holds and, for deterministic problems, if . For a more detailed analysis of the MFEI, please refer to [huang2006sequential].
3 Resource-Aware Active Learning
The computational improvements introduced by the adoption of multifidelity surrogates can be further enhanced by the optimal use of parallel or distributed computing architectures. This section introduces our Resource-Aware Active Learning algorithm (RAAL, for short) that leverages the availability of multiple sources of information at different levels of fidelity in conjunction with the possibility to distribute the evaluations across multiple computational resources (CPUs) in parallel. Section 3.1 outlines the main steps of the RAAL algorithm. We then discuss the two elements representing the core of the approach: the multipoint exploration/exploitation of the AF, in Section 3.2, and the optimization procedure for the multipoint multifidelity seeding in Section 3.3. In the following, for the sake of compactness, we write for .
In the conventional BO, one point is selected at each iteration and evaluated at a prescribed fidelity level. This information is then used to learn/update the surrogate model and the associated AF, before the next point selection can be made. In the remaining of this paper, this conventional BO is referred to as sequential BO. Differently, the RAAL scheme samples multiple points across different fidelities at each iteration. In addition, the RAAL scheme optimally allocates the computational resources available to take most advantage of parallel computing and/or distributed computing architectures.
We describe here the main steps of the RAAL strategy (Algorithm 1). We start from an initial set of feasible points that can be assembled through any Design of Experiment (DOE) procedure, for instance a Latin Hypercube Design (LHD), and then remove the unfeasible points . At the first iteration , a subset of points is selected together with a set of fidelity levels and used to run the first evaluations of the function , and hence obtain the dataset employed to learn an initial surrogate model and the associated AF, as described in Section 2. At this point, similarly to the sequential BO, the RAAL algorithm optimizes the AF, but with the notable difference of selecting a subset of points together with the associated fidelity levels . At the next iteration , parallel computational resources are used to built simultaneously the dataset , where , and, similarly to the sequential BO, the optimization loop is executed based on the augmented dataset . As it will be showed in the numerical results of Section 4, the seeding of more than one point at each BO iteration significantly speeds up the overall computational time, mitigating the impact of the main bottleneck in the BO process, i.e. the evaluation of the black-box function . The process is then iterated till a maximum number of iterations or a maximum computational budget is reached.
The RAAL multipoint selection comes with a number of favourable properties. As explained in more details in Section 3.2, the optimization of the AF – which may be complex, high-dimensional and multifidelity – is accomplished by evaluating it point-wise at points in and picking those points that cumulatively maximize the function over the search domain. By doing so it is possible to handle even complex constraints on the design space, since their feasibility is checked beforehand and not during the AF optimization. Besides, the selected points can be chosen to achieve a tunable exploration/exploitation trade-off of the AF, combining space-filling characteristics with selective exploitation of the AF shape. Another important aspect of the multipoint seeding is that the sampled points maximize the usage of the computational resources available in terms of the computational burden required to evaluate the points at the selected fidelity levels. The aim is to make the most out of the parallel resources in order to reduce the impact of the function evaluations on the BO iterations, also by properly allocating the different evaluation tasks to the parallel CPUs.
Summarizing, the points in the set and their associated fidelities , selected at each iteration of the RAAL algorithm for the maximization of the AF, (i) are feasible with respect to design constraints, (ii) are well-distributed and have tunable space-filling properties, and (iii) maximize the usage of available computational resources. Section 3.2 will describe how the points should be processed in order to take into consideration a trade-off between AF exploration and exploitation; Section 3.3 will describe the multipoint multifidelity optimization routine.
3.2 Optimal exploration/exploitation of the Acquisition Function in multipoint scenario
The optimization of the AF in the RAAL multipoint scenario relies on a tunable exploitation and exploration of the AF, which will be actively employed in the optimization procedure of Section 3.3 for the selection of multiple points at each BO iteration.
When a feasible set of points in the AF domain is given, the maximization (i.e. exploitation) of the AF itself can be done by simply evaluating the function in these points for each required level of fidelity, obtaining a set of values , with and , and then picking the highest value. In a standard BO scheme, this is an optimization-by-evaluation procedure that can be especially suitable when the numerical optimization of the AF is particularly challenging, mainly due to the presence of complex and non-convex feasibility constraints. While still convenient in a parallel-BO scheme, some attention must be paid: indeed, a greedy selection of only the best points would lead to oversampling the AF in a close neighborhood of the optimum, without consequently providing much additional information instead of picking the single optimum and, in fact, wasting computational resources.
A better strategy can be devised if we include aspects not only related to the exploitation, but also to the exploration of the AF. One way to explore the AF (and hence the domain of the original objective function) is to use experimental design techniques. A very common approach in this field is the LHD, which divides each dimension of a -dimensional space into equispaced levels (also known as strata or bins), where each level contains exactly one point in the design. In the literature (see [dambrosio2017milp] for a brief survey) there are different criteria to evaluate the goodness of an LHD configuration. A popular criterion is the
-discrepancy with respect to the uniform distribution, which measures the difference between the empirical distribution of a set of points and the multivariate uniform distribution over the same domain. Such uniform discrepancy is a multidimensional property that is difficult to evaluate, however computing the discrepancy along each one-dimensional projection is a much simpler task. Notice that, for example, the defining property of a LHD is that each one-dimensional projection has low uniform discrepancy along that dimension.
If we take a purely exploration point of view, our measure of the goodness of a set of points is related to the difference with the ideal one-dimensional distributions of the points along the axes. First we divide each continuous domain into a pre-specified number of bins with the mapping , and then project each point onto the axes:
where and represent the interval of the
-th domain. Then, each vectoris mapped onto an extended vector . If the points in a set were uniformly distributed, the projection along each axis would look like a univariate uniform distribution. Given equispaced strata of each variable , each stratum should contain the same number of points, i.e . In Section 3.3 we will see how to implement this in a proper optimization problem.
The discretization technique of Equation (13) produces a uniform grid without taking into consideration the shape of the AF, since the points in are discretized into partitions of equal length/width and recall as close as possible the measure of discrepancy from the uniform distribution. In order to balance the pure exploration given by the uniform gridding with the function exploitation of the AF, we propose a AF-weighted discretization technique. In place of transforming the points in on a purely geometric basis, the AF values are used to guide their discretization. First, for each dimension
, the points with an AF value smaller than a predefined quantile levelare removed. Subsequently, quantiles are computed on the AF values of the remaining points, where is the -th -quantile of the points along the dimension , with . Figure 1 offers an illustrative example: on the left we see the case with , resulting in a uniform grid, while on the right we see the impact of selecting a , which results in a more refined grid around the modes of the AF.
The parameter can also be automatically updated during the iterations of the RAAL algorithm, so as to favour exploration at the beginning of the algorithm, and foster exploitation as we get close to the depletion of the available computational budget. Formally, said the maximum allowed value for the parameter , the update rule can be specified as
where is the parameter value at iteration , is the current budget over the total and is the learning rate.
Once defined the quantiles for each dimension, each point in is projected onto the axis by the mapping
where , and , . In this case we highlighted the dependence on the parameter because, contrary to (13), the width of the bins is adjusted depending on the AF values distribution using the quantile-based method. This leads to a finer grid resolution in those areas where the AF has higher values, and therefore where it is more likely to sample more informative points. Similarly to the previous case, each vector is finally mapped onto an extended vector . Section 3.3 will show hot to implement proper constraints such that the projection along each axis looks like a univariate uniform distribution to obtain well distributed points over the defined grid.
3.3 Multipoint multifidelity seeding
When multiple computational resources are available at each iteration of the BO loop, we face the problem of how to best utilize these resources to gain as much information as possible from the current surrogate model and its related AF. Section 3.2 already discussed the proposed strategies for the multipoint maximization of the AF, in the direction of balancing exploitation and exploration. This section describes how these strategies can be embedded in an optimization program that takes into consideration three main aspects characterizing our multipoint multifidelity seeding, namely:
the maximization of the usage of the computational resources available and their optimal allocation for the evaluation of the objective function at different fidelity levels;
the optimal exploitation and exploration of the AF;
a sampling strategy compatible with the recursive GP model used to build the surrogate (Section 2.2.1).
From an high level perspective, the seeding routine is implemented as a knapsack problem, where the candidate points and the relative fidelity sources are selected so that the information acquired during a parallel evaluation of the objective function is maximized, and the computational load is less than or equal to the available parallel resources. Consequently, the optimization problem takes four groups of input parameters: the points to be selected , transformed and ‘decoded’ into their categorical version , thanks to the discretization procedures (13) or (15); the values of the AF evaluated at points and fidelities ; the computational cost of evaluating the objective function at fidelity ; the computational resources of each single computational unit , where is the number of available CPUs, to be allocated for the evaluation of at the fidelity levels that will be selected.
The decision variables that allow for the selection of the next points and their fidelities to be evaluated (i.e. the sets and in Algorithm 1) are arranged into two groups: variables equal iff the point at fidelity is chosen and assigned to the computational unit ; variables equal iff the point is chosen from , independently from the fidelity level. The variables are used in combination with the discretized data to select a set of points such that each stratum of the discretized grid is represented, in order to measure a (scaled) discrepancy with respect to the uniform distribution. It is worth reminding that, according to the discretization procedure chosen from Section 3.2, such measure enforces the property of weighted well-distributed points, hence balancing the exploration/exploitation of the AF itself over the selected points. On the other hand, variables determine the fidelity level and the computational unit assigned to point to minimize the waste of resources, accounting for the specific computational cost associated to the fidelity level .
The optimization routine is finally formulated as the following Mixed Integer Linear Programming (MILP) problem:
where is the total number of bins used in the processing of data .
The MILP objective induces a hierarchical order in the optimization of the two goals of maximizing the usage of resources and maximizing the AF: the former is prioritized over the latter through the weighting factor set such that . With constraints (16a) we impose that at most one point can be chosen that belongs to each bin , hence enforcing the well-distributed property described in Section 3.2. Constraints (16b) guarantee that the capacity of each computational unit is not violated when the evaluations of the objective function are assigned. Logical interdependence between the groups of variables and are is imposed by constraints (16c) and (16d), while (16e) assure that a single point cannot be evaluated on more than one CPU. Finally, the fidelity interdependence for the construction of the coherent auto-regressive GP model (described in Section 2) is implemented by (16f).
In this section we present numerical experiments demonstrating the performances of the RAAL algorithm compared to a standard sequential BO scheme (i.e. using a single CPU), for which we use a set of benchmark problems. In the following, we first describe the experimental setup followed in all the experiments, including the benchmarks description and the tests configurations; then we move to discuss the results obtained in both single and multifidelity versions of the benchmark problems.
We implemented the RAAL algorithm, its statistical models and acquisition functions in Python 3.7.3, leveraging functionality from the Emukit toolkit [emukit2019], while the MILP Optimization Routine of Section 3.3 was implemented with PuLP [mitchell2011pulp], a linear programming modeler written in Python, and solved by means of COIN CLP/CBC [lougee2003common].
4.1 Experimental setup
We conducted experiments on a variety of popular benchmark problems to test the efficiency and robustness of the proposed approach against the standard sequential BO, either in single (SF) and multifidelity (MF) settings. The benchmark functions were selected to exemplify different types of correlations among the fidelity levels, described in the following. Consistently with the already used notation, we denote the objective functions, and sort the fidelities in an increasing order . Accordingly, is the representation at the highest-fidelity and is considered the reference ground-truth.
Analytical Test 1
The first benchmark is the popular Forrester function [forrester2008engineering], one of the most common analytical benchmark in the literature. It is a 1-dimensional nonlinear function over the domain , defined as
with and . Its low fidelity level is given by the linear mapping
Analytical Test 2
The second benchmark is a sinusoidal squared 1-dimensional function [cutajar2019deep], with domain in the interval . The high fidelity function is defined as
which is a non linear function of the low fidelity variant, given by
Its ground truth solution to problem (1) is at .
Analytical Test 3
The third benchmark problem is the -dimensional Rosenbrock function [rosenbrock1960automatic], a non-convex function with domain in the interval and defined as
The global minimum lies in a narrow, parabolic valley and is located at . The low fidelity observations are given by a linear mapping defined as [bryson2017unified]:
Figure 2 illustrates the three analytical objective functions, together with their low fidelity alternatives. In the remaining of the paper “SF Test ” denotes the optimization of the -th Analytical Test problem in a single fidelity setting, where just the highest fidelity level is considered. Conversely, “MF Test ” indicates the optimization of the -th Analytical Test problem in a multifidelity setting, considering all the available fidelity levels as available sources of information.
For all the numerical results, same initial conditions were imposed to each algorithm configuration: an identical initial set , with cardinality dictated by the specific benchmark application, was selected randomly from the feasible set , drawn quasi-randomly via LHD over the feasible domain of each benchmark. We also allocated, for each experiment, the same maximum total computational budget to both the sequential BO and the RAAL algorithm, i.e. the highest level of fidelity can be evaluated the same number of times Finally, for all the analytical benchmarks, we set a unitary cost to the maximum fidelity level and a fractional cost to all the lower fidelity level, according to the following rule
In the RAAL algorithm, each available CPU was assigned with a computational budget capable of running a single evaluation of the objective function at the maximum fidelity level.
In the following results we report the Root Squared Error between the optimal solution computed at each step and the known global optimum (minimum) of the high fidelity function of each benchmark. The error is plotted as a function of the iterations used by each algorithm, for which we allocate the same total computational budget to fairly compare the results. This metric is directly related to the execution time taken by the sequential and parallel BO, given that the computational overhead of choosing the next information source and sample is omitted, as it is negligible compared to invoking an information source in real-world applications. All the numerical experiments were randomized over 20 runs, from different initial sets : all diagrams reports the median values (solid lines) together with all the other observations falling in the interval between the 25-th and 75-th percentiles (shaded areas). The hyperparameters of the kernel and mean functions of the GP surrogate models were optimized via Maximum Likelihood Estimation [forrester2007multi, forrester2009recent].
4.2 Single fidelity results
This section discusses the results observed for the single fidelity version (SF) of the artificial benchmarks; this set of experiments permits to investigate the impact of different parameter values of the RAAL algorithm on its performances, namely the accuracy and the speed of convergence to the known optimum. In particular, we focus our attention on different grid resolutions and different values of the learning rate parameter while varying the number of available CPUs (that is the number of points that can be evaluated simultaneously).
Figure 3 shows the results on the SF Test 1 benchmark, for which we chose an initial DOE of points. We run the tests with , and all the different parameters combinations resulting from two discretization levels and and three learning rates . First of all we can see how the RAAL algorithm achieves better results than the sequential BO in terms of convergence speed: the RAAL algorithm takes 2.5 iterations on average to reach the optimum in case 5 CPUs are employed, whereas the sequential BO takes on average 5 iterations. Similar results are obtained for all the parameter settings of the grid discretization, that is the number of bins and the learning rate . From these experiment, the learning rate seems not to have significant impact onto the convergence speed. This holds for the simple case of SF Test 1, whose AF shape may be fairly simple to be captured and exploited.
We move now to investigate the impact of different numbers of bins in the discretization grid and of the learning rate on the SF Test 2. For this test we used points as initial DOE and a maximum computational budget of . Interestingly, Figure 4 shows similar results to Figure 3: the parallel selection of multiple points leads to a significantly improved convergence speed, without compromising the performances in terms of accuracy. While the sequential BO takes 20 iterations to reach the optimum, the RAAL algorithm takes 10 iterations with 2 CPUs and less than 5 iterations with 5 CPUs. Another important aspect regards the resolution of the grid: increasing the resolution of the grid by doubling the number of bins from (top row) to (bottom row) helped the RAAL achieve a faster convergence and avoid local optima, represented by those plateau in the algorithm iterations. It may be deduced that, given the high multimodality of SF Test 2, the RAAL algorithm can benefit from a higher number of bins in the search grid, which allows a finer search and the movement from one local optima to another, till reaching the true objective function optimum. Lastly, higher learning rates degrade the performance in presence of local optima, both with a coarse and a finer grid. The main reason is that the adaptation yields denser sampling in the proximity of the peaks of the acquisition function, therefore mitigating the exploration thrusts of the uniform gridding which would be beneficial to skip out of local minima. This is confirmed by the behaviour observed for the uniform gridding, for which we record shorter stagnation at the local minimum.
An additional analysis was carried out on the multidimensional domain of the SF Test 3 (Rosenbrok), where we set uniform gridding for the entire optimization procedure () and bins for each dimension. Investigations were conducted for different dimensionality of the problem, namely for and . Also in this scenario, the RAAL BO outperforms the sequential BO in terms of convergence speed, even if with the sequential BO achieves, on average, a slightly better accuracy. A possible reason for this is that, ideally, the cardinality of the set of points to evaluate at each iteration increases with the dimensionality of the domain to sample.
4.3 Multifidelity results
In this section we describe the results for the multifidelity (MF) version of the analytical benchmarks and discuss the impact of the RAAL parameters. In addition, the outcomes are compared to the single fidelity experiments, in order to verify whether similar considerations can be drawn. Similarly to the single fidelity case, we investigate the impact of different parameters of the RAAL algorithm, that is the numbers of bins in the discretization grid and the learning rate . In particular we investigate their role for the 2 CPUs and the 5 CPUs architectures.
Figure 6 reports the results obtained for the benchmark MF Test 1, with a maximum computational budget . Here we use a uniform grid of bins and establish an initial set of 5 and 2 points for the low and high fidelity levels, respectively. The multipoint selection of the RAAL BO permits to sensitively accelerate the convergence to the optimum; this already emerges when 2 CPUs only are available. Moreover, the parallel selection of different points reduces the variability of the results across the experiments, that is, the proposed multipoint and multifidelity seeding enhances the robustness of the BO scheme with respect to a sequential approach.
We run the MF Test 2 with for all the settings of the algorithmic parameters resulting from two discretization levels and , and three learning rates , , and . Similarly to the MF Test 1, the initial DOE consists of 5 and 2 points for the low and high fidelity, respectively. Figure 7 shows that, also in this second multifidelity benchmark, the multipoint selection dramatically accelerates the search of the optimum which in many cases can be found in only 5 iterations when exploiting 5 CPUs. Furthermore, it is worth noticing that the RAAL algorithm performs better in this MF Test 2 rather than in its single fidelity version SF Test 2. The comparison of Figures 4 and 7 reveals that the RAAL algorithm always achieves the global optimum of the benchmark function in the MF setting, whereas it does not manage the same in the SF case, when 5 CPUs are used. In fact, the access to a lower fidelity and less costly representation of the objective function allows the RAAL algorithm to sample more points and to better explore the search space, which turns out to be very useful for highly multimodal problems of this kind. For what concerns the effect of different discretization levels, we can observe that a coarse leads to better optimization performance when we have a smaller number of CPUs, whereas a finer grid allows to achieve faster convergence when the number of CPUs is higher. This is related to trade-off between the exploration and the exploitation thrusts: the combination of a lower number of CPUs with a finer grid is too unbalanced towards the exploitation, whereas a coarse grid paired with a high number of CPUs (and therefore samples per iteration), biases the optimization in favour of the exploration. Lastly, differently from what observed for the single fidelity settings, increments of the learning rate do not have any significant impact on the convergence history of the multifidelity implementation of benchmark Test 2.
Eventually, experiments are reported for the mutidimensional benchmark Test 3 (Rosenbrock function) in the multifidelity scenario. Investigations have been conducted for and dimensional domains with different maximum budget , , and , respectively allocated; a uniform gridding is adopted to discretize each dimension with bins.
Similarly to what observed for the single fiedelity experiment, the results recorded for the multifidelity settings (Figure 8) demonstrate the faster convergence speed of the RAAL BO, which is particularly impressive in the highest dimensional domain of : the parallel multipoint selection of the RAAL algorithm leads to a smaller final error with respect to the true optimum, which was achieved in a little fraction of the iterations taken by the sequential BO.
5 Concluding Remarks
In this work we proposed a novel multipoint and multifidelity Bayesian Optimization (BO) scheme, with the objective of accelerating the optimization of expensive-to-evaluate black box functions. Our Resource Aware Active Learning (RAAL) algorithm is able to maximize the information gain to acquire at each step of the underlying BO methodology by seeding multiple points and the associated fidelities while optimally allocate parallel/distributed computational resources available for their evaluation. The core of the algorithm is the seeding procedure, implemented as a mathematical programming problem, which leverages in a principled way the computational time budget and parallel resources available to balance the trade-off between exploration and exploitation of the Acquisition Function (AF), leading a major speed up in the iterative optimization task. Another main characteristic of the RAAL algorithm is its general formulation, which can scale to any finite number of fidelities, handle any statistical model and deal with any AF. This should guarantee a wide applicability of the approach, without limiting its validity to any specific BO-related setting.
The performances of the approach were empirically evaluated on a number of well-known analytical benchmarks available in the literature, with non-linear and multimodal characteristics, tested with two fidelity levels for demonstration purposes. The results obtained for all the numerical experiments reveal a significant speed up of the RAAL algorithm in solving the optimization problem with respect to a standard BO scheme, where the AF is optimized and sampled in only one point. Interestingly enough, the RAAL achieves even better performances in multifidelity scenarios, demonstrating the ability to take full advantage of the lower fidelity and cheaper-to-evaluate approximation of the objective function in seeding more points and hence better explore the search domain at each algorithm iteration.
As potential extension of this work, we are currently investigating different opportunities. First, numerical results should be extended to physics-based applications and problems, for an additional validation of the approach for physics-based multidomain use cases. Another worthwhile investigation may regard the use of opportunely extended Multipoint Acquisition Functions, explicitly formulated so as to maximize the information gain either at the same BO iteration or over a look-ahead on future iterations, recalling a Dynamic Programming approach. Some attempts are already available in the literature, but they only focus on the single fidelity scenario. Lastly, a potential advancement of the algorithm can be its adaptation to the so-called Constrained Bayesian Optimization, where the objective function has to be optimized in presence of expensive-to-evaluate feasibility constraint, which usually involve the formulation of modified Acquisition Functions.
This work was supported by the IDA Center of Excellence in Cyber Physical Systems Grant No. 176474 under the Industrial Development Agency (Ireland) program.