1 Introduction
Mathematical optimization plays a central role in natural science. Indeed, most theoretical predictions are preceded by a calibration stage whereby the parameters of the model are optimized to reproduce a selected set of calibration data. In nuclear physics, the coupling constants of any theory of the strong interaction between protons and neutrons (nucleons) must be determined from experimental data before one can attempt to solve the Schrödinger equation to make quantitative predictions of the properties of atomic nuclei.
Typically, measured lowenergy nucleonnucleon () cross sections and the properties of light nuclei with mass number have been used for calibrating the and threenucleon () interaction sectors of the nuclear force, see e.g. Refs. [1, 2, 3] and references therein. However, it is still an open question—with several parallel and ongoing research efforts [4, 5, 6, 7]—how to best constrain the unknown coupling constants in theoretical descriptions of the nuclear interaction and how to incorporate the covariance structure of the experimental data and the model discrepancy [5, 8, 9, 10]. Recent theoretical studies [7, 11, 12, 13] indicate that scattering data and the properties of very light nuclei do not contain enough information to constrain all directions in the parameter domain at sufficient level. Instead, it has been proposed that the pool of fit data need to be augmented with complimentary data types, such as scattering cross sections, and/or the groundstate properties of nuclei with , or even empirical properties of infinite nuclear matter. However, modeling of such observables is typically much more complex and requires substantial computational effort ranging from hours to days for just one model evaluation, even on a supercomputer. Consequently, the optimization of the underlying model parameters will be difficult. The main focus of the present work is to investigate a possible strategy for mitigating this computational challenge.
Inspired by recent progress in the optimization of hyperparameters of deep neural networks
[14], we explore several Bayesian optimization (BayesOpt) strategies^{1}^{1}1we employ the BaysOpt implementation provided through the Python package GPyOpt [15]. for maximizing the likelihood of objective functions based on complex models in nuclear physics. BayesOpt originated more than 50 years ago [16], it was popularized in the 1990s, see e.g. [17, 18], and has since then been applied in various disciplines; from selecting experiments in material and drug design [19] to tuning eventgenerators in particle physics [20]. The central idea in BayesOpt is to construct a probabilistic surrogate model, usually a Gaussian process (), to capture our prior beliefs and knowledge about the objective function, , and iteratively confront the surrogate with actual data samples from and thereby refine our posterior of this function. The main advantage of BayesOpt is that we can incorporate prior beliefs in a straightforward way. The disadvantage lies in the arbitrariness and uncertainty of a priori information.In the following we will be dealing with complex models in nuclear physics. The origin of the underlying physics model and its parameters is briefly introduced in this section, while more details and relevant references are provided in B. Nucleons are made of quarks and gluons, and it is well known that the strong interaction between these fundamental particles is described in detail by quantum chromodynamics (QCD), which is part of the standard model of particle physics. It is equally wellknown that QCD is not perturbative in the lowenergy region relevant for nuclear structure physics. This prevents straightforward application of, e.g., perturbation theory to compute atomic nuclei starting from QCD. Instead, chiral effective field theory (EFT) [21, 22, 23] is constructed as a lowenergy approximation to QCD. This framework shows promising signs of being an operational approach to analyzing atomic nuclei while maintaining a firm link with the more fundamental theory. In EFT, the longranged part of the nuclear interaction is described in terms of pion exchanges, while the shortranged part is parameterized by contact interactions. The unknown parameters of this description are known as lowenergy constants (LECs). It is important to realize that any realistic description of the nuclear interaction has to introduce unknown coefficients, but it has been found that EFT is able to capture the physics with a relatively small number of parameters. Depending on the level of details included, there are typically – LECs.
Clearly, if model predictions of physical observables, such as scattering cross sections, are computationally expensive, we face the challenge of optimization with a limited budget of evaluations of the objective function. For the overwhelming majority of welldeveloped research problems there does not exist a universal optimization strategy that guarantees to arrive at a global optimum of an objective function in a finite number of iterations. Instead, any information regarding the mathematical or computational structure of the objective function, perhaps guided by the physical nature of the underlying problem, should play an important role in the choice or design of the optimization algorithm.
In this paper, we will systematically study the application of BayesOpt to optimization problems of increasing degree of complexity. The BayesOpt algorithm is presented in Sec. 2 with its main ingredients: the kernel and the acquisition function. The performance of different optimization algorithms can be compared using a data profile. This measure, as used in the present work, is introduced in Sec. 3. In order to benchmark the performance of BayesOpt with various settings in controlled problem conditions we will employ a selected set of six test functions in dimensional parameter domains. This study is presented in Sec. 4, while the test functions themselves are listed in A. The main focus of this work is found in Sec. 5 with the application of BayesOpt to a real nuclear physics problem. We will use BayesOpt to optimize the 12 LECs appearing at nexttonexttoleading order (NNLO) in EFT, using the protonneutron scattering data in the Granada database [24] with laboratory energies of the incident proton beam below 75 MeV. This case is complex enough to constitute a nontrivial problem from a physics as well as an optimization perspective. However, it is still computationally straightforward such that we can easily afford a detailed analysis of 12 different BayesOpt algorithms. Indeed, each evaluation of the objective function at a specific point in the parameter domain only takes a couple of seconds on a standard desktop computer. Still, evaluating the 4,096 corners in the hypercube of the corresponding parameter domain will take a couple of hours, so the premise for BayesOpt is well justified. We will compare the optimization performance of BayesOpt with the POUNDERs algorithm [25]. This is a simulationbased optimization algorithm that has been successfully applied in nonlinear leastsquares optimization in nuclear physics before [26, 27]. Finally, we end with a summary and outlook in Sec. 6.
2 Bayesian optimization
Without any loss of generality, we will frame the determination of the LECs in EFT as a minimization problem. Global minimization of a function , with input parameters that are perhaps subject to some constraints and typically belong to a compact input domain , is a longstanding and central problem in science. Here, we also specialize the formalism to scalar valued functions . Mathematically, we are trying to find a global minimizer:
(1) 
i.e. a point that fulfill for all . In general, this is an intractable problem unless we have detailed information about or that the parameter domain only contains a finite number of points. In reality, we are trying to find local minimizers to , i.e. points for which for all close to . A substantial complication arises if is computationally expensive to evaluate. Then it becomes even more important to adopt a wellfounded strategy, based on all present knowledge about , for carefully selecting a dataset of sequential function queries , where , such that we increase our chances of a rapid convergence towards . A BayesOpt framework can provide this. Moreover, at each iteration BayesOpt will consider all insofar collected data points and thereby take full advantage of the history of the optimization run. Note that we refer to a set of function evaluations as data. This should not be confused with experimental data. There are two main components in a BayesOpt algorithm;

A prior probabilistic belief for the function given some data . The prior is often a . This is updated in every iteration.

An acquisition function given some data
, i.e. a heuristic that balances exploration against exploitation and determines where to evaluate the objective function
next.
The next iterate, , is selected where we expect the minimizer , based on some utility function. Below, we will define two different acquisition functions , and show how to embed them in an iterative context for selecting sample points . In the following we will also drop the explicit data dependence in the notation for the acquisition function and only write . A pseudocode for BayesOpt is listed in Algorithm 1 and a pictorial exposition of a handful of BayesOpt iterations of a simple univariate function is provided in Fig. 1.
2.1 The prior: a Gaussian process
To model the prior for the objective function we use a Gaussian process with mean function and covariance matrix with entries . The mean and covariance functions fulfill expected relations; and for all . Any realvalued function is permissible, but for the corresponding covariance matrix must be positive semidefinite. A is one example of a stochastic process that is very useful in statistical modeling [28]. In brief, it is a collection of function evaluations at , with mean (often shifted to zero), that are jointly Gaussian, i.e.
(2) 
where
denotes a normally distributed random variable
with mean and covariance . After conditioning this prior with some data , with mean , we obtain the mean and covariance of the model for the prior according to(3) 
These explicit expressions follow from the fact that the marginal distribution of a multivariate Gaussian is also Gaussian. The mean and variance of the prior, i.e. the
, for the schematic example in Fig. 1 are indicated with a green line and a green shaded band, respectively.In this work we will consider three common types of covariance functions, also referred to as kernels:

Squared exponential:

Matern 3/2:

Matern 5/2:
where and is a set of hyperparameters for the kernel. The correlation length(s) indicate how far you need to move (along a particular direction in the parameter domain) for the function values to become uncorrelated. With automatic relevance determination (ARD), we optimize the vector of correlation lengths separately in each direction of the function domain
. Without ARD, the kernel hyperparameters are isotropic. In this work, we extract the hyperparameters using maximum likelihood estimation.
The characteristic features of the s based on the three kernels listed above are illustrated in Fig. 2. Their smoothness, equivalent to the typical correlation length, is one of the key differences between them. This feature of the kernel affects the prior modeling of the objective function, and as such the ensuing performance of BayesOpt. We will see this clearly in the analyses in Secs. 45.
2.2 The acquisition function
The acquisition function determines the most likely improvement to the currently best minimizer in the parameter domain. In Fig. 1 the mean of our posterior belief (green line) of the unknown values of the objective function (black line) is sequentially augmented with one new data point (black dot) in each iteration. The best candidate for further minimizing at iteration is the parameter that maximizes the acquisition function (red curve). Although the acquisition function is also optimized in it is significantly faster to evaluate, compared to the underlying objective function , since it only relies on draws from the prior . Still, the complexity of maximizing increases as we increase the dimensionality of the parameter domain . This aspect should not be underestimated, and it is in fact one of the main challenges with BayesOpt. Another challenge, although unlikely, could emerge if the set of collected data points becomes very large. Indeed, in each iteration the evaluation of the requires an inversion of an matrix, and the complexity of that operation naively scales as . Cholesky decomposition reduces this cost somewhat to . In reality, however, this is rarely a limiting factor since we typically resort to BayesOpt when only a small number of function evaluations can be carried out in the first place.
A very appealing feature of BayesOpt is its ability to select a new point in a region of where the prior model of is exhibiting a large uncertainty. This means that the algorithm can be rather explorative and therefore escape a local minimum of the objective function . Depending on the details of the acquisition function, the explorative nature is balanced with a certain degree of exploitation, i.e. to evaluate points in the parameter domain where the prior model for is exhibiting a low mean value. To study the explorationexploitation balance we will consider two of the most common acquisition functions; the expected improvement (EI) and the lower confidencebound (LCB). In the following, we denote by the insofar lowest recorded value of .
The expected improvement acquisition function is defined by the expectation value of the rectifier , i.e. we reward any expected reduction of in proportion to the reduction . This can be evaluated analytically
(4) 
where we dropped the explicit dependence on
in the third step, and the cumulative distribution function and standard normal distribution are denoted
and , respectively. In the last step we write the result in the standard normal variable . BayesOpt will exploit regions of expected improvement when the term dominates, while new, unknown regions will be explored when the second term dominates. For the expected improvement acquisition function, the explorationexploitation balance is entirely determined by the set of observed data and the kernel.The lower confidencebound acquisition function introduces an additional parameter that explicitly sets the level of exploration
(5) 
The maximum of this acquisition function will occur for the maximum of the enlarged confidence envelope of the . We use , which is a very common setting. Larger values of leads to even more explorative BayesOpt algorithms.
Besides being derivativefree—although derivatives can in fact be incorporated [29]—it is in many ways the explorative nature of BayesOpt that is most attractive. This feature is most easily observed in the optimization of a rather complex twodimensional function with several local minima, such as the Langermann function
(6) 
In Fig. 3 we show the sequence of 50 evaluation points following two initial points of BayesOpt with an Matern 3/2 kernel, expected improvement acquisition function without ARD.
It should be made clear that, given a limited computational budget, the success of BayesOpt hinges on the choice of kernel and acquisition function. In the example above with a Langermann function, the nonsmooth nature of the Matern 3/2 kernel is advantageous compared to, e.g., the squared exponential kernel. The explorationexploitation balance also leverages the success ratio of BayesOpt. To learn more about BayesOpt, it is obviously instructive to benchmark and compare different BayesOpt algorithms using wellknown test functions. For this purpose we first need to select a performance measure for optimization algorithms.
3 Measuring the performance of optimization algorithms
To analyze the performance of derivativefree optimization algorithms we follow Ref. [30] and define a data profile
(7) 
The data profile enables direct comparison between a set of optimization algorithms , all of which are applied to a set of welldefined optimization problems . For each , the performance measure denotes the number of function evaluations that are required for optimization algorithm applied to a problem to satisfy some convergence criterion. Thus, is the fraction of problems that can be solved within function calls. The performance measure can be further normalized to , where denotes the dimensionality of the parameter domain in problem . This is an attempt to account for some of the complexity due to a larger number of parameters in the problem. This dimensional scaling is only approximate and motivated by the function evaluations required to compute a dimensional simplex, i.e. a dimensional triangle of function evaluations.
Each combination of starting point and objective function (plus other possible constraints) constitutes a separate problem . The data profile is a monotonically increasing function between zero and one, and a large value of for small values of is better. In line with Ref. [30] we employ a convergence criterion
(8) 
This is fulfilled for any where the initial function value is reduced times the best possible reduction . We will set to be the lowest value of achieved by any solver . Although we will come close to the true solution in a few cases, it is highly unlikely that any derivativefree solver will arrive at . For that, one would typically have to resort to gradientbased optimization algorithms. With BayesOpt, we will consider a 90% reduction in the function value as a sign of convergence, i.e. set . For the purpose of comparing BayesOpt with other derivativefree optimization algorithms we will occasionally use . When the objective functions are represented by known test functions (described in Sec. 4 and A), we will set . Throughout this study we will only allow a maximum of 250 function calls per solver and problem, and set in case the convergence criterion is not fulfilled. This upper limit on should cover most scenarios that would call for the use of BayesOpt.
3.1 Selecting parameter starting points in
A multidimensional parameter domain of the objective function leads to the increasingly likely existence of multiple points corresponding to local and/or global minimizers. In reality, the choice of initial point, i.e. where we start the optimization run, will determine the local minimum that the optimizer converges to. When benchmarking, it will be necessary to start each solver from starting points in . In this work, we use an identical set of starting points for all optimizers in a dimensional domain. This is somewhat motivated by Wendel’s theorem [31] that states that for random points on the
sphere, the probability
that they all lie on some hemisphere is given by(9) 
Thus for random points on the sphere, . Although we do not distribute points on a hypersphere, we use it to motivate a rule of thumb. It should be pointed out that some authors argue for a slightly more expensive rule of thumb to select starting points when initializing computer simulations in [32].
A priori, we do not distinguish between different parts of the parameter domain, so we select the starting points using a spacefilling algorithm in the form of a quasirandom Sobol sequence. We employ a Python implementation of this algorithm. The mathematical underpinnings of the Sobol sequence are provided in the original paper [33] and a discussion related to the numerical implementation is given in e.g. Ref. [34]. This is a socalled lowdiscrepancy sequence, which in fact is the opposite of a random sequence. It is designed to generate each successive sample point as far away as possible from all existing sample points. This tends to sample the space more uniformly than pseudorandom numbers, at least for lowerdimensional domains. We will not go beyond in any part of this work. Although the Sobol sequence can exhibit gaps in multidimensional spaces, it has several advantages. In addition to fast generation, a Sobol sequence is straightforward to augment with additional sample points. We remind the reader that Latin hypercube sampling (LHS) [35], which is a different kind of algorithm for generating spacefilling samples, is in most cases not easy to augment while preserving the latin hypercube structure. The spacefilling differences between Sobol, LHS, and conventional pseudorandom numbers in are illustrated in Fig. 4.
4 Analyzing a set of test functions
Before we tackle an optimization problem in nuclear physics, we will explore and benchmark the performance of BayesOpt on a set of test functions. To this end, we have selected a set of six multivariate and continuous functions , each defined on some domain . The functions are defined for any , but we will only consider . The set of test functions reflects an average of various spatial characteristics. Twodimensional graphical representations are shown in Figs. 5 and 6, with explicit expressions given in A.
A comparison of two or more optimization algorithms on a finite set of test functions is neither fair nor conclusive. Indeed, although one algorithm (or class of methods) appears to be more successful in finding optima, it is clear from the no free lunch theorem in optimization that the averaged performance of any two algorithms on the set of all possible functions will be the same. Here, we merely set out to compare and analyze the performance of BayesOpt on a limited set of characteristically different continuous test functions.
We now turn our attention to the resulting data profiles when applying BayesOpt for finding the minimizer in each one of these test functions with parameter domains in dimensions. In total, we will analyze 12 BayesOpt algorithms composed from combining three kernels, two acquisition functions, and with or without ARD. We use a Sobol sequence to initiate each BayesOpt algorithm at different starting points in each parameter domain in . Remember that we refer to each specific combination of starting point and test function as a problem. Combined, with the different solvers (optimization algorithm with specific settings) this is a rather large dataset and we analyze it from several different angles. The data profiles for different versions of BayesOpt applied to all six test functions in , see Fig. 7 (top row), indicate that , i.e. that % of the problems are converged at the level within 50 function evaluations. The performances of the expected improvement and the lower confidencebound acquisition functions are very similar. Although, expected improvement exhibits a slightly better and more coherent performance across all kernels, and at 150 function evaluation more than 80% of the test functions in are converged.
ARD, i.e. to learn each individual length scale hyperparameter in the kernel, is often an efficient way of pruning irrelevant features. We find that exploiting ARD increases the performance of BayesOpt once data from a sufficient number of function evaluations has informed the kernel on the possible anisotropic structure of the objective function. This becomes even clearer if we enforce a stronger convergence criterion with , see Fig. 7 (bottom row). There seems to be a slight advantage of using ARD with the expected improvement acquisition function.
In , and for this set of functions, the Matern kernels perform slightly better than the squared exponential. In general, the Matern kernels are better tailored to nonsmooth objectives. This becomes even clearer when we study the performance of all BayesOpt algorithms on the Ackley function (one hole on a semiflat surface with several periodic shallow local minima) in , see Fig. 8.
From the characteristics of the data profiles shown in Fig. 8, it is obvious that the exploratory nature of the lower confidencebound acquisition function is highly advantageous for finding the global minimum hiding on a surface covered with local minima. Clearly, it is important to tailor the BayesOpt acquisition function and underlying kernels to the spatial structure of the objective function. This result also reflects the two fundamental and competing aspects of BayesOpt. If we incorporate prior beliefs, or even knowledge, about the objective function, then BayesOpt will perform rather well. On the other hand, the usefulness of BayesOpt will consequently be limited by the arbitrariness and uncertainty of a priori information. This is further complicated by the fact that we typically resort to BayesOpt when we know very little about the objective function in the first place, since it is computationally expensive to evaluate.
Data profiles for the test functions as we increase the dimensionality of the parameter domain from (in Fig. 7) to and are shown in Fig. 9.
We can conclude that having a larger number of parameters provides a significant challenge to BayesOpt, as for all optimization algorithms. Indeed, in dimensions it takes more than function evaluations to reach a data profile value of . The more exploratory lower confidencebound acquisition function exhibits a slightly larger performance spread with respect to different kernels. This becomes even more prominent as we increase the dimensionality of the parameter domain to . For the set of test functions that we have employed it is marginally advantageous to be more exploratory for higher dimensional objective functions. We also note that the potential benefit of ARD requires data from more than function evaluations. This is natural since ARD introduces more hyperparameters that need to be determined. For all problems in dimensions, BayesOpt with Matern kernels converge faster than BayesOpt with a squared exponential kernel. As above, the squared exponential kernel does not capture the highfrequency modes that are present in some of the functions.
5 Bayesian optimization of the nucleonnucleon interaction
An EFT of the nuclear interaction essentially corresponds to a lowenergy parameterization of QCD in a fashion that is consistent with the symmetries of the more fundamental theory. To devise a true EFT description of the nuclear interaction is a major intellectual challenge in ab initio nuclear theory. Several avenues are currently being explored. The physical underpinnings of the objective function we seek to minimize are described in somewhat more detail in B. Regarding the optimization of the LECs, the inclusion of calibration data from more complex atomic nuclei and lowenergy nuclear processes, e.g. scattering, in the objective function are currently being considered in the community. Such extensions of the calibration data clearly increases the information content—but does so at the expense of an increased complexity in the computer modeling.
A specific aim of this work is to analyze the performance of BayesOpt for determining the parameters in an EFT model of the interaction. For this, we use the protonneutron sector of an EFT at NNLO with 12 parameters and try to find the vector of model parameters that are in agreement with existing experimental data on protonneutron scattering cross sections. This type of observable is a measure of the probability of an incident particle (a proton) with a given kinetic energy to scatter into a certain solid angle element due to mutual interaction with the target particle (a neutron^{2}^{2}2Since free neutrons decay within minutes, the neutron target is a composite material containing neutrons.). We have deliberately chosen this class of calibration data since it does not render particularly challenging model evaluations. One evaluation of the full objective function takes a merely couple of seconds on a standard desktop. Still, the complexity of the physical model provides a nontrivial testing ground for assessing nascent applications of BayesOpt in ab initio nuclear physics.
The experimental dataset is composed of groups of measurements where the th group consists of measured cross sections, with associated random measurement uncertainty, , for , with a common normalization constant and corresponding experimental systematic error . We employ the measurement errors as reported by each experimenter. Each group of data originates from a specific experiment. We restrict the inclusion of experimental data in the present case to laboratory scattering energies below 75 MeV. This ensures a rather fast (seconds) evaluation of the objective function and therefore enables a more detailed analysis of BayesOpt. Experimental errors across measurement groups are considered independent and identically distributed. The normalization constant, together with its uncertainty, represents the systematic uncertainty of the measurement. For an absolute measurement, the normalization is given by . Usually this means that the statistical and systematic errors have been combined with . Certain experiments are not normalized at all. Instead, only the angular or energydependence of the cross section was determined. For these groups of data, is solved for in the optimization by minimizing the discrepancy between the model prediction and the experimental data points . For such freely renormalized data, the optimal is easily obtained in closed form. For practical purposes, the normalization error can be considered infinite in these cases, and will therefore not contribute to the objective function.
In summary, we seek to find the parameter vector that minimizes the deviation between the model and the experimental data, as measured by the objective function defined below:
(10) 
This type of (nonlinear) leastsquares objective function, where we assume a normal model for the parameters, appears naturally in most parameter estimation problems. In a setting defined by EFT it would be natural to also incorporate a theoretical model discrepancy term motivated by the lowenergy EFT expansion itself [5, 8, 9, 36, 37]. However, in this paper we will focus on the challenges of mathematical optimization that are associated with a numerically costly objective function, and for simplicity therefore only incorporate the experimental errors of the data.
We define three different parameter domains for minimizing the objective function; small, medium, and large, see Tab. 1. They differ by the level of included prior knowledge regarding the range of plausible parameter values.
Domains  

parameter  small  medium  large 
(0.2,0.1)  (5,+5)  (5,+5)  
(+2,+3)  (5,+5)  (5,+5)  
(0.2,0.1)  (5,+5)  (5,+5)  
(1,+1)  (5,+5)  (5,+5)  
(1,+1)  (5,+5)  (5,+5)  
(1,+1)  (5,+5)  (5,+5)  
(1,+1)  (5,+5)  (5,+5)  
(1,+1)  (5,+5)  (5,+5)  
(1,+1)  (5,+5)  (5,+5)  
(0.76,0.72)  (0.76,0.72)  (5,+5)  
(3.66,3.56)  (3.66,3.56)  (5,+5)  
(+2.41,+2.47)  (+2.41,+2.47)  (5,+5) 
The limits of the large parameter domain is based on the most naive estimate without any significant prior information. In contrast, the limits of the medium domain are partly informed by prior data. Specifically, the three parameters (LECs ) associated with the longrange part of the nuclear interaction are constrained to ranges determined from a separate analysis of pionnucleon scattering data [38, 39]. The small domain is further informed by previous experience of typical values for the LECs in the shortranged contact potential.
The data profiles of the BayesOpt algorithms applied to all domains of the nuclear physics objective function in Eq. 10 are plotted in Fig. 10. Clearly, the expected improvement acquisition function performs slightly better than the lower confidencebound acquisition function. It is difficult to draw any conclusions regarding an optimal choice of kernel. Compared to the test functions, the squared exponential kernel seems to work a little bit better than the Matern 5/2 kernel. This result is also somewhat surprising since the performance of the squared exponential kernel is nearly identical to the Matern 3/2 kernel.
When we analyze the performance of BayesOpt in each of the three parameter domains separately; small, medium, large, see Fig. 11, we note that the medium and large domains stand out. In the former, the expected improvement acquisition with the Matern 3/2 kernel and ARD algorithm achieves a 50% performance already after 50 iterations. This is the top performing BayesOpt algorithm with the nuclear physics objective function. In the small and large domains the Matern 3/2 kernel performs best without ARD. Since BayesOpt will work best with a sensible prior, this suggests rather short correlation lengths. In the large domain, the lower confidencebound acquisition shows a tendency to perform better than expected improvement. This is perhaps not too surprising since the parameter domain is large enough to benefit from substantial exploration. Clearly, the ARD kernels require much more data in larger spaces in order to prune irrelevant features of the objective function.
5.1 Comparison with the POUNDERs algorithm
To facilitate a benchmark and further analysis of the strengths and weaknesses of BayesOpt, we have also optimized the nuclear physics objective function using the POUNDERs (Practical Optimization Using No Derivatives for sums of Squares) optimization algorithm [25]. This is a welltested derivativefree algorithm for optimizing nonlinear least squares problems. Indeed, it has been applied successfully in basic nuclear physics research since almost a decade [26, 27]. The key feature of POUNDERs is that it assumes a leastsquares structure of the objective function, i.e. that it consists of a sum of squared residuals written as
(11) 
Tailoring an optimization algorithm to exploit this mathematical structure, i.e. that each term is a squared function of the parameters , is very fruitful. A quadratic model,
(12) 
for an objective function , at the current iterate , with and , is a very common choice. If the corresponding derivatives are known the subproblem of minimizing can be solved quite efficiently. However, derivatives and are considered unavailable for the present problems and only a set of function values , and residual values , for some set can be accessed. POUNDERs sets up a quadratic model for each residual in Eq. 11
(13) 
centered around the current iterate . The model for each residual is defined by the parameters , , and
. These model parameters are determined by solving an interpolation problem in
, see Ref. [25] and references therein. Once the model parameters are obtained, they can be used to approximate the derivatives of the objective function. In principle, the first and secondorder derivatives of with respect to are given by and , respectively. Consequently, POUNDERs sets up a master model for the objective function(14) 
It should be noted that the master model itself does not interpolate the objective over the interpolation set .
POUNDERs is a trustregion method. The master model is believed to approximate in a neighborhood of the current iterate , where
(15) 
The master model is therefore minimized over some with radius . The solution to determines the next iterate and a new trust region radius is determined based on how good the model prediction was, see Ref [25] for a full specification of the algorithm. We are running POUNDERs in the default mode, meaning that we only have to input , , some initial trustregion radius , and lower and upper bounds of the domain . We also scale the problem such that the bounds correspond to a unit hypercube . This scaling is not performed in BayesOpt.
In Fig. 12 we present the data profiles for POUNDERs applied to the physics objective function in the small,medium, and large domains and compare with BayesOpt. The results are only compared with the expected improvement acquisition function as it was shown in Fig. 11 to perform significantly better than the lower confidencebound acquisition function for this optimization problem. We remind the reader that all algorithms are initiated from an identical set of 24 starting points for each parameter domain.
As seen clearly in Fig. 12, the choice of initial trustregion radius determines the performance of POUNDERs. Setting the initial radius too small () hampers the POUNDERs algorithm by trapping it in a shallow local minimum. This is not an issue with BayesOpt which continues to improve as more and more function values extends the data vector . As we have already noted several times, the overall performance of BayesOpt depends crucially on the prior. We also note that when POUNDERs performs well, it does so within rather few function evaluations, but halts once it is trapped in a local minimum. The advantages of BayesOpt are most prominent when optimizing over the medium domain. Regardless of kernel, BayesOpt performs rather well even with few function evaluations. In the large domain, the good performance of POUNDERs clearly indicates the usefulness of encoding prior knowledge about the mathematical structure of the objective function. There is likely some large scale structure in the objective function that the BayesOpt kernel would benefit to learn about. Therefore, it would probably be advantageous to amend the BayesOpt prior with a polynomial regression term fitted to the first few evaluations of the objective function.
BayesOpt is not intended for pinpointing the exact location of an optimum. In the neighborhood of an optimum most objective functions can be well approximated by a quadratic polynomial. For this reason, POUNDERs will always outperform BayesOpt when decreasing in the convergence criterion given in Eq. 8. For , i.e. a convergence criterion corresponding to a 99% reduction of the objective function, BayesOpt will only reach a performance of , whereas POUNDERs can approach for an optimal choice of the initial trustregion radius , see Fig. 13.
6 Summary and Outlook
Some of the most interesting optimization problems in nuclear physics, as well as other fields of science, typically render computationally expensive objective functions defined on multidimensional parameter domains. Moreover, derivatives with respect to those parameters are usually not accessible.
In this paper we explore the potential benefits of BayesOpt (Sec. 2) for efficiently exploring the parameter space of a EFT (B) in computationally complex circumstances. A local minimum, with realistic physical properties, in this parameter domain allows for numerical simulations of atomic nuclei and therefore improves our understanding of the origin, evolution, and structure of all matter in the universe. The underlying optimization problem is therefore well known in the nuclear physics research community and several classes of numerical optimization algorithms have already been employed; derivativebased [5] as well as derivativefree approaches [27].
BayesOpt presupposes a prior on the objective function, usually in the form of a . The original optimization challenge is transformed to a design problem that boils down to choosing an appropriate acquisition function to facilitate an explorationexploitation balance. This choice is encoded in a utility function that decides where to collect training data for the . Several choices of kernels and utility functions exist. Our initial studies of BayesOpt applied to a set of six test functions with parameter domains in clearly demonstrate the importance of a sensible prior assumption of the objective function, see Fig. 7. From this analysis it is also clear that BayesOpt performs rather well in lowdimensional ( and ) parameter domains. It turns out that the choice of acquisition function is even more important than the choice of kernel, see Fig. 8. This is something we see also when we study the data profiles of BayesOpt applied to the nuclear physics objective, see Fig. 11.
Our main findings and conclusions can be summarized as follows:

In general, BayesOpt will never find a narrow minimum nor be useful for extracting the exact location of any optimum. For that to work, in anything but a trivial case, it is necessary to have detailed information about the objective function and successfully encode this prior knowledge into the algorithm. This is a situation that we typically do not have, since BayesOpt is applied to computationally expensive objective functions. Instead, BayesOpt might find use as a first stage in a hierarchical optimization protocol to identify an interesting region of the parameter domain. It might also be advantageous to design an acquisition function that is more explorative during the first iterations, and then switch to an acquisition function that exploits more than it explores.

When we compare with the POUNDERs algorithm, Sec. 5.1—a derivativefree optimization algorithm that successfully incorporates the squaredsum structure of the objective function—we find that BayesOpt in ab initio nuclear physics would probably benefit from a prior with a polynomial regression term to efficiently capture the large scale structure of the objective function.

We find that the choice of acquisition function is more important than the specific form of the kernel. For the present case, the expected improvement acquisition function performed slightly better than the lower confidencebound in smaller parameter domains, while more exploration as achieved with the lower confidencebound acquisition function, was shown to be beneficial in larger domains.

The kernel can be improved with ARD tuning of the hyperparameters. However, this feature is only useful if a minimum number of iterations can be afforded. In fact, the ARD kernels requires significantly more data in larger spaces in order to prune irrelevant features of the objective function.

Although we employ an objective function consisting of independent scattering cross sections, and all of them with similar and low computational cost, a multifidelity scenario is equally probable. Consider e.g. to calibrate an EFT model of the nuclear interaction using scattering cross sections and data on bound states in multinucleon systems such as isotopes of oxygen and calcium. In such scenarios, where the computational cost of solving the Schrödinger equation for boundstates of a nucleus with nucleons naively grow exponentially with , it would be interesting to study the benefits of existing BayesOpt frameworks that can maximize information gathering across multiple functions with varying degrees of accuracy and computational cost. See e.g. Ref. [40] and references therein.
Acknowledgments
The research leading to these result received financial support from the BigData@Chalmers initiative. This research also received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No 758027), the Swedish Research Council under Grants No. 201500225 and 201704234, and the Marie Sklodowska Curie Actions, Cofund, Project INCA 600398. Computations were performed on resources provided by the Swedish National Infrastructure for Computing (SNIC) at NSC, Linköping.
Appendix A Test functions
In this appendix we provide the expressions for the six test functions that we used for initial analysis of BayesOpt.

Ackley: one hole on a semiflat surface with shallow local minima.
(16) 
Deceptive: very challenging multivariate test function for which the total size of the region with local minima is times larger than the region with the global minimum. This function has local minima in .
(17) 
Rastrigin: spherical function with cosine modulation to generate frequent local minima.
(18) 
Rosenbrock: classic test function with minimum located in very shallow valley.
(19) 
Schwefel: smooth surface with several local minima and a global minimum located far away in a corner, which in turn is far away from the second best local minimum.
(20) 
Sphere: in many ways the simplest possible test function. It is convex and unimodal.
(21)
Appendix B Physics model: EFT for neutronproton scattering at NNLO
In the interest of keeping this paper somewhat selfcontained, we briefly review the formalism that underpins the physics model we are optimizing. To that end, we include a selected set of expressions that are representative for the full theoretical framework for computing neutronproton () scattering cross section starting from a potential description in EFT. Exhaustive reviews of EFT are given in [21, 22, 23]. The model evaluation of an scattering cross section to be confronted with an experimentally determined value at a given onshell momentum proceeds in three steps:

compute the momentumspace protonneutron interaction potential.

use the potential to compute the quantummechanical scattering amplitude by solving the nonrelativistic LippmannSchwinger equation.

use the amplitude to compute a model value for the scattering cross section by evaluating the spinscattering matrix.
b.1 The interaction potential at NNLO in Eft
The interaction potential in EFT consists of nonpolynomial terms that describe the longrange part corresponding to pion exchanges, and contacttype interactions given by polynomial expressions corresponding to a shortrange part:
(22) 
where and denote the final and initial nucleon momenta in the centerofmass system (CMS), and the special case corresponds to an onshell momentum. The potential can be written as a systematic expansion with highorder terms being less important than loworder ones. In this work we employ a potential including terms up to NNLO in EFT. This means that there are terms also at leadingorder (LO) and nexttoleading order (NLO). In general, at higher orders there are more pion exchanges and higherorder polynomial terms in momenta that flow through the contact diagrams. At LO, the onepion exchange potential is given by
(23) 
where is the momentum transfer, and are the spin and isospin operators of nucleon 1 and 2, , , and denote the axialvector coupling constant, the pion decay constant, and the pion mass, respectively. We use MeV and throughout this work. Higher order corrections to onepion exchange renormalize the coupling constants and , and the LO longranged part is considered parameterfree in this work. Up to NNLO, leading and subleading twopion exchange enters and the longranged part of the interaction is given by
(24) 
where and the loop functions are here written as
(25) 
and the socalled spectral function cutoff is set to MeV. The longranged part contains three of 12 unknown LECs that we seek to constrain using BayesOpt, denoted , , and in Eq. (24). The remaining nine LECs control the shortranged part of the NNLO potential, which can be written as a linear combination of terms polynomial in the initial and final momenta
(26) 
b.2 Protonneutron scattering observables
Protonneutron elastic scattering observables are calculated from the spinscattering matrix [41, 42]. This is a matrix in spinspace that operates on the initial state to give the scattered part of the final state. is related to the conventional scattering matrix by , where is the relative momentum between the nucleons. The matrix for the scattering channel with angular momentum can be parameterized by the Stapp phase shifts [43]. The Stapp phase shifts are calculated from the potential by solving the LippmannSchwinger equation. This equation describes quantummechanical scattering, and is an integral equation of the Fredholm type that can be solved as a matrix equation. In our application, and for each value of the onshell momentum , this amounts to inverting a 200by200 matrix followed by a matrixvector multiplication. The matrix inversion prevents linearizing this particular EFT model in its parameters. Although, the matrix inverse is not particularly timeconsuming in the present case, it should be pointed out that the complexity of the corresponding quantummechanical equations for describing scattering states, or bound states, of more than two nucleons typically scale exponentially with the particle number.
References
References
 [1] V. G. J. Stoks, R. A. M. Klomp, M. C. M. Rentmeester, and J. J. de Swart. Partialwave analysis of all nucleonnucleon scattering data below 350 mev. Phys. Rev. C, 48:792–815, Aug 1993.
 [2] Robert B. Wiringa, V. G. J. Stoks, and R. Schiavilla. An Accurate nucleonnucleon potential with charge independence breaking. Phys. Rev. C, 51:38–51, 1995.
 [3] Doron Gazit, Sofia Quaglioni, and Petr Navratil. ThreeNucleon LowEnergy Constants from the Consistency of Interactions and Currents in Chiral Effective Field Theory. Phys. Rev. Lett., 103:102502, 2009.
 [4] P. Reinert, H. Krebs, and E. Epelbaum. Semilocal momentumspace regularized chiral twonucleon potentials up to fifth order. Eur. Phys. J., A54(5):86, 2018.
 [5] B D Carlsson, A Ekström, C Forssén, D Fahlin Strömberg, G R Jansen, O Lilja, M Lindby, B A Mattsson, and K A Wendt. Uncertainty Analysis and OrderbyOrder Optimization of Chiral Nuclear Interactions. Physical Review X, 6(1):011019–23, February 2016.
 [6] M. Piarulli, L. Girlanda, R. Schiavilla, R. Navarro Pérez, J. E. Amaro, and E. Ruiz Arriola. Minimally nonlocal nucleonnucleon potentials with chiral twopion exchange including resonances. Phys. Rev. C, 91(2):024003, 2015.

[7]
Serdar Elhatisari, Ning Li, Alexander Rokash, Jose Manuel Alarcón, Dechuan
Du, Nico Klein, Bingnan Lu, UlfG Meissner, Evgeny Epelbaum, Hermann Krebs,
Timo A Lähde, Dean Lee, and Gautam Rupak.
Nuclear Binding Near a Quantum Phase Transition.
Physical Review Letters, 117(13):132501–5, September 2016.  [8] J. A. Melendez, S. Wesolowski, and R. J. Furnstahl. Bayesian truncation errors in chiral effective field theory: Nucleonnucleon observables. Phys. Rev. C, 96:024003, Aug 2017.
 [9] E. Epelbaum, H. Krebs, and U.G. Meißner. Precision nucleonnucleon potential at fifth order in the chiral expansion. Phys. Rev. Lett., 115:122301, Sep 2015.
 [10] S Wesolowski, N Klco, R J Furnstahl, D R Phillips, and A Thapaliya. Bayesian parameter estimation for effective field theories. Journal of Physics G: Nuclear and Particle Physics, 43(7):074001, 2016.
 [11] A Ekström, G R Jansen, K A Wendt, G Hagen, T Papenbrock, B D Carlsson, C Forssén, M HjorthJensen, P Navratil, and W Nazarewicz. Accurate nuclear radii and binding energies from a chiral interaction. Physical Review C, 91(5):494–7, May 2015.
 [12] V Lapoux, V Somà, C Barbieri, H Hergert, J D Holt, and S R Stroberg. Radii and Binding Energies in Oxygen Isotopes: A Challenge for Nuclear Forces. Physical Review Letters, 117(5):052501–6, July 2016.
 [13] G Hagen, M HjorthJensen, G R Jansen, and T Papenbrock. Emergent properties of nuclei from ab initio coupledcluster calculations. Physica Scripta, 91(6):063006–15, May 2016.

[14]
Jasper Snoek, Hugo Larochelle, and Ryan P. Adams.
Practical bayesian optimization of machine learning algorithms.
In Proceedings of the 25th International Conference on Neural Information Processing Systems  Volume 2, NIPS’12, pages 2951–2959, USA, 2012. Curran Associates Inc.  [15] The GPyOpt authors. Gpyopt: A bayesian optimization framework in python. http://github.com/SheffieldML/GPyOpt, 2016.
 [16] H J Kushner. A new method of locating the maximum point of an arbitrary multipeak curve in the presence of noise. Journal of Fluids Engineering, Transactions of the ASME, 86(1):97–106, January 1964.
 [17] Jonas Mockus. Bayesian approach to global optimization: theory and applications. Mathematics and Its Applications Soviet Series. Springer, Dordrecht, 1989.
 [18] Donald R. Jones, Matthias Schonlau, and William J. Welch. Efficient global optimization of expensive blackbox functions. Journal of Global Optimization, 13(4):455–492, Dec 1998.
 [19] Diana M. Negoescu, Peter I. Frazier, and Warren B. Powell. The knowledgegradient algorithm for sequencing experiments in drug discovery. INFORMS Journal on Computing, 23(3):346–363, 2011.
 [20] P. Ilten, M. Williams, and Y. Yang. Event generator tuning using bayesian optimization. Journal of Instrumentation, 12(04):P04028, 2017.
 [21] P F Bedaque and U Van Kolck. Effective field theory for fewnucleon systems. Annual Review of Nuclear and Particle Science, 52(1):339–396, 2002.
 [22] E. Epelbaum, H.W. Hammer, and UlfG. Meißner. Modern theory of nuclear forces. Rev. Mod. Phys., 81:1773–1825, Dec 2009.
 [23] R Machleidt and D R Entem. Chiral effective field theory and nuclear forces. Physics ReportsReview Section Of Physics Letters, 503(1):1–70, 2010.
 [24] R. Navarro Pérez, J. E. Amaro, and E. Ruiz Arriola. Coarsegrained potential analysis of neutronproton and protonproton scattering below the pion production threshold. Phys. Rev. C, 88:064002, Dec 2013.
 [25] Stefan M. Wild. Solving derivativefree nonlinear least squares problems with POUNDERS. In Tamas Terlaky, Miguel F. Anjos, and Shabbir Ahmed, editors, Advances and Trends in Optimization with Engineering Applications, pages 529–540. SIAM, 2017.
 [26] M. Kortelainen, T. Lesinski, J. Moré, W. Nazarewicz, J. Sarich, N. Schunck, M. V. Stoitsov, and S. Wild. Nuclear energy density optimization. Phys. Rev. C, 82:024313, Aug 2010.
 [27] A. Ekström, G. Baardsen, C. Forssén, G. Hagen, M. HjorthJensen, G. R. Jansen, R. Machleidt, W. Nazarewicz, T. Papenbrock, J. Sarich, and S. M. Wild. Optimized chiral nucleonnucleon interaction at nexttonexttoleading order. Phys. Rev. Lett., 110:192502, May 2013.
 [28] Carl Edward Rasmussen and Christopher K I Williams. Gaussian processes for machine learning. The MIT Press, Cambridge, MA, 2006.
 [29] Jian Wu, Matthias Poloczek, Andrew G Wilson, and Peter Frazier. Bayesian optimization with gradients. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 5267–5278. Curran Associates, Inc., 2017.
 [30] Jorge J. Moré and Stefan M. Wild. Benchmarking derivativefree optimization algorithms. SIAM Journal on Optimization, 20(1):172–191, 2009.
 [31] J. G. Wendel. A problem in geometric probability. MATHEMATICA SCANDINAVICA, 11(0):109–112, 1962.
 [32] Jason L. Loeppky, Jerome Sacks, and William J. Welch. Choosing the sample size of a computer experiment: A practical guide. Technometrics, 51(4):366–376, 2009.
 [33] I.M Sobol’. On the distribution of points in a cube and the approximate evaluation of integrals. USSR Computational Mathematics and Mathematical Physics, 7(4):86 – 112, 1967.
 [34] Paul Bratley and Bennett L. Fox. Algorithm 659: Implementing sobol’s quasirandom sequence generator. ACM Trans. Math. Softw., 14(1):88–100, March 1988.
 [35] M. D. McKay, R. J. Beckman, and W. J. Conover. A comparison of three methods for selecting values of input variables in the analysis of output from a computer code. Technometrics, 21(2):239–245, 1979.
 [36] E. Epelbaum, H. Krebs, and U. G. Meißner. Improved chiral nucleonnucleon potential up to nexttonexttonexttoleading order. The European Physical Journal A, 51(5):53, May 2015.
 [37] R J Furnstahl, N Klco, D R Phillips, and S Wesolowski. Quantifying truncation errors in effective field theory. Physical Review C, 92(2):090–20, August 2015.
 [38] Martin Hoferichter, Jacobo Ruiz de Elvira, Bastian Kubis, and UlfG Meissner. Matching PionNucleon RoySteiner Equations to Chiral Perturbation Theory. Physical Review Letters, 115(19):192301–6, November 2015.
 [39] Martin Hoferichter, Jacobo Ruiz de Elvira, Bastian Kubis, and UlfG Meissner. RoySteinerequation analysis of pionnucleon scattering. Physics ReportsReview Section Of Physics Letters, 625:1–88, April 2016.
 [40] Jialin Song, Yuxin Chen, and Yisong Yue. A general framework for multifidelity bayesian optimization with gaussian processes, 2018.
 [41] J. Bystricky, F. Lehar, and P. Winternitz. Formalism of nucleonnucleon elastic scattering experiments. J. Phys. France, 39(1):1–32, 1978.
 [42] P. La France and P. Winternitz. Scattering formalism for nonidentical spinor particles. J. Phys. France, 41(12):1391–1417, 1980.
 [43] H. P. Stapp, T. J. Ypsilantis, and N. Metropolis. Phaseshift analysis of 310mev protonproton scattering experiments. Phys. Rev., 105:302–310, Jan 1957.