## 1 Introduction

In materials design and discovery, we face the problem of choosing the chemical structure, composition, or processing conditions of a material to meet design criteria. The traditional approach is to use iterative trial and error, in which we (1) choose some material design that we think will work well based on intuition, past experience, or theoretical knowledge; (2) synthesize and test the material in physical experiments; and (3) use what we learn from these experiments in choosing the material design to try next. This iterative process is repeated until some combination of success and exhaustion is achieved.

While trial and error has been extremely successful, we believe that mathematics and computation together promise to accelerate the pace of materials discovery, not by changing the fundamental iterative nature of materials design, but by improving the choices that we make about which material designs to test, and by improving our ability to learn from previous experimental results.

In this chapter, we describe a collection of mathematical techniques, based on Bayesian statistics and decision theory, for augmenting and enhancing the trial and error process. We focus on one class of techniques, called Bayesian optimization (BO), or Bayesian global optimization (BGO), which use machine learning to build a predictive model of the underlying relationship between the design parameters of a material and its properties, and then use decision theory to suggest which design or designs would be most valuable to try next. The most well-developed Bayesian optimization methods assume that (1) the material is described by a vector of continuous variables, as is the case, e.g., when choosing ratios of constituent compounds, or choosing a combination of temperature and pressure to use during manufacture; (2) we have a single measure of quality that we wish to make as large as possible; and (3) the constraints on feasible materials designs are all , so that any unknown constraints are incorporated into the quality measure. There is also a smaller body of work on problems that go beyond these assumptions, either by considering discrete design decision (such as small molecule design), multiple competing objectives, or by explicitly allowing unknown constraints.

Bayesian optimization was pioneered by [33], with early development through the 1970s and 1980s by Mockus and Zilinskas [37, 36]. Development in the 1990s was marked by the popularization of Bayesian optimization by Jones, Schonlau, and Welch, who, building on previous work by Mockus, introduced the Efficient Global Optimization (EGO) method [28]. This method became quite popular and well-known in engineering, where it has been adopted for design applications involving time-consuming computer experiments, within a broader set of methods designed for optimization of expensive functions [5]

. In the 2000s, development of Bayesian optimization continued in statistics and engineering, and the 2010s have seen additional development from the machine learning community, where Bayesian optimization is used for tuning hyperparameters of computationally expensive machine learning models

[47]. Other introductions to Bayesian optimization may be found in the tutorial article [6] and textbooks [9, 44], and an overview of the history of the field may be found in [45].We begin in Section 2 by introducing the precise problem considered by Bayesian Optimization. We then describe in Section 3 the predictive technique used by Bayesian Optimization, which is called Gaussian Process (GP) regression. We then show, in Section 4, how Bayesian Optimization recommends which experiments to perform. In Section 5 we provide an overview of software packages, both freely available and commercial, that implement the Bayesian Optimization methods described in this chapter. We offer closing remarks in Section 6.

## 2 Bayesian Optimization

Bayesian optimization considers materials designs parameterized by a -dimensional vector . We suppose that the space of materials designs in which takes values is a known set .

For example, could give the ratio of each of different constituents mixed together to create some aggregate material. In this case, we would choose to be the set . As another example, setting , could give the temperature () and pressure () used in material processing. In this case, we would choose to be the rectangle bounded by the experimental setup’s minimum and maximal achievable temperature on one axis, and , and the minimum and maximum achievable pressure on the other. As a final example, we could let be the temperatures used in some annealing schedule, assumed to be decreasing over time. In this case, we would set to be the set .

Let be the quality of the material with design parameter . The function is unknown, and observing requires synthesizing material design and observing its quality in a physical experiment. We would like to find a design for which is large. That is, we would like to solve

(1) |

This is challenging because evaluating is typically expensive and time-consuming. While the time and expense depends on the setting, synthesizing and testing a new material design could easily take days or weeks of effort and thousands of dollars of materials.

In Bayesian optimization, we use mathematics to build a predictive model for the function based on observations of previous materials designs, and then use this predictive model to recommend a materials design that would be most valuable to test next. We first describe this predictive model in Section 3, which is performed using a machine learning technique called Gaussian process regression. We then describe, in Section 4, how this predictive model is used to recommend which design to test next.

## 3 Gaussian Process regression

The predictive piece of Bayesian optimization is based on a machine learning technique called Gaussian process regression. This technique is a Bayesian version of a frequentist technique called kriging, introduced in the geostatistics literature by South-African mining engineer Daniel Krige [30], and popularized later by Matheron and colleagues [35], as described in [8]. A modern monograph on Gaussian process regression is [41], and a list of software implementing Gaussian process regression may be found at [40].

In Gaussian process regression, we seek to predict based on observations at previously evaluated points, call them . We first treat the case where can be observed exactly, without noise, and then later treat noise in Section 3.5. In this noise-free case, our observations are for .

Gaussian process regression is a Bayesian statistical method, and in Bayesian statistics we perform inference by placing a so-called prior probability distribution

on unknown quantities of interest. The prior probability distribution is often called, more simply, the

prior distribution or, even more simply, the prior. This prior distribution is meant to encode our intuition or domain expertise regarding which values for the unknown quantity of interest are most likely. We then use Bayes rule, together with any data observed, to calculate a posterior probability distribution on these unknowns. For a broader introduction to Bayesian statistics, see the textbook [19] or the research monograph [4].In Gaussian process regression, if we wish to predict the value of at a single candidate point , it is sufficient to consider our unknowns to be the values of at the previously evaluated points, , and the new point at which we wish to predict. That is, we take our unknown quantity of interest to be the vector . We then take our data, which is , and use Bayes rule to calculate a posterior probability distribution on the full vector of interest, , or, more simply, just on .

To calculate the posterior, we must first specify the prior, which Gaussian process regression assumes to be multivariate normal. It calculates the mean vector of this multivariate normal prior distribution using a function, called the mean function and written here as , which takes a single as an argument. It applies this mean function to each of the points to create an -dimensional column vector. Gaussian process regression creates the covariance matrix of the multivariate normal prior distribution using another function, called the covariance function or covariance kernel and written here as , which takes a pair of points as arguments. It applies this covariance function to every pair of points in to create an matrix.

Thus, Gaussian process regression sets the prior probability distribution to,

(2) |

The subscript “0” in and indicate that these functions are relevant to the prior distribution, before any data has been collected.

We now discuss how the mean and covariance functions are chosen, focusing on the covariance function first because it tends to be more important in getting good results from Gaussian process regression.

### 3.1 Choice of covariance function

In choosing the covariance function , we wish to satisfy two requirements.

The first is that it should encode the belief that points and near each other tend to have more similar values for and . To accomplish this, we want the covariance matrix in (2) to have entries that are larger for pairs of points that are closer together, and closer to 0 for pairs of points that are further apart.

The second is that the covariance function should always produce positive semidefinite covariance matrices in the multivariate normal prior. That is, if is the covariance matrix in (2), then we require that for all column vectors (where is assumed to have the appropriate length, ). This requirement is necessary to ensure that the multivariate normal prior distribution is a well-defined probability distribution, because if is multivariate normal with mean vector and covariance matrix

, then the variance of

is , and we require variances to be non-negative.Several covariance functions satisfy these two requirements. The most commonly used is called the squared exponential, or Gaussian kernel, and is given by,

(3) |

This kernel is parameterized by parameters: , and .

The parameter controls how much overall variability there is in the function . We observe that under the prior, the variance of is . Thus, when is large, we are encoding in our prior distribution that is likely to take a larger range of values.

The parameters controls how quickly the function varies with . For example, consider the relationship between some point and another point . When is small (close to ), the covariance between and is , giving a correlation between and of nearly . This reflects a belief that and are likely to be very similar, and that learning the value of will also teach us a great deal about . In contrast, when is large, the covariance between and is nearly , given a correlation between and that is also nearly , reflecting a belief that and are unrelated to each other, and learning something about will teach us little about .

#### Going beyond the squared exponential kernel

There are several other possibilities for the covariance kernel beyond the squared exponential kernel, which encode different assumptions about the underlying behavior of the function . One particularly useful generalization of the squared exponential covariance kernel is the Matérn covariance kernel, which allows more flexibility in modeling the smoothness of .

Before describing this kernel, let be the Euclidean distance between and , but where we have altered the length scale in each dimension by some strictly positive parameter . Then, the squared exponential covariance kernel can be written as, .

With this notation, the Matérn covariance kernel is,

where is the modified Bessel function. If we take the limit as , we obtain the squared exponential kernel ([41], Section 4.2 page 85).

The Matérn covariance kernel is useful because it allows modeling the smoothness of in a more flexible way, as compared with the squared exponential kernel.
Under the squared exponential covariance kernel, the function is infinitely mean-square differentiable^{1}^{1}1Being “mean-square differentiable” at in the direction given by the unit vector means that the limit exists in mean square. Being “-times mean-square differentiable” is defined analogously., which may not be an appropriate assumption in many applications.
In contrast, under the Matérn covariance kernel, is -times mean-square differentiable if and only if . Thus, we can model a function that is twice differentiable but no more by choosing , and a function that is once differentiable but no more by choosing .

While the squared exponential and Matérn covariance kernels allow modeling a wide range of behaviors, and together represent a toolkit that will handle a wide variety of applications, there are other covariance kernels. For a thorough discussion of these, see Chapter 4 of [41].

Both the Matérn and squared exponential covariance kernel require choosing parameters. While it certainly is possible for one to choose the parameters and (and in the case of Matérn) based on one’s intuition about , and what kinds of variability is likely to have in a particular application, it is more common to choose these parameters (especially and ) adaptively, so as to best fit previously observed points. We discuss this more below in Section 3.6. First, however, we discuss the choice of the mean function.

### 3.2 Choice of mean function

We now discuss choosing the mean function . Perhaps the most common choice is to simply set the mean function equal to a constant,

. This constant must be estimated, along with parameters of the covariance kernel such as

and , and is discussed in Section 3.6.Beyond this simple choice, if one believes that there will be trends in that can be described in a parametric way, then it is useful to include trend terms into the mean function. This is accomplished by choosing

where are known functions, and , along with , are parameters that must be estimated.

A common choice for the , if one chooses to include them, are polynomials in up to some small order. For example, if , so is two-dimensional, then one might include all polynomials up to second order, , , , , , setting . One recovers the constant mean function by setting .

### 3.3 Inference

Given the prior distribution (2) on , and given (noise-free) observations of …, , the critical step in Gaussian process regression is calculating the posterior distribution on

. We rely on the following general result about conditional probabilities and multivariate normal distributions. Its proof, which may be found in Section

8, relies on Bayes rule and algebraic manipulation of the probability density of the multivariate normal distribution.###### Proposition 1.

Let be a -dimensional multivariate normal random column vector, with mean vector and covariance matrix . Let be two integers summing to . Decompose , and as

so that and are -column vectors, and is a matrix, for each .

If and are invertible, then, for any , the conditional distribution of given that is multivariate normal with mean

and covariance matrix

We use this proposition to calculate the posterior distribution on , given .

Before doing so, however, we first introduce some additional notation. We let indicate the column vector , and we let indicate the sequence of vectors . We let , and similarly for other functions of , such as . We introduce similar additional notation for functions that take pairs of points , so that is the matrix

is the row vector
,
and
is the column vector

.

This notation allows us to rewrite (2) as

(4) |

We now examine this expression in the context of Proposition 1. We set , , , , , , , and .

Then, applying Proposition 1, we see that the posterior distribution on given observations is normal, with a mean and variance given by,

(5) | ||||

(6) |

The invertibility of (and also ) required by Proposition 1 depends on the covariance kernel and its parameters (typically called hyperparameters), but this invertibility typically holds as long as these hyperparameters satisfy mild non-degeneracy conditions, and the are distinct, i.e., that we have not measured the same point more than once. For example, under the squared exponential covariance kernel, invertibility holds as long as and the are distinct. If we have measured a point multiple times, then we can safely drop all but one of the measurements, here where observations are noise-free. Below, we treat the case where observations are noisy, and in this case including multiple measurements of the same point is perfectly reasonable and does not cause issues.

Figure 1 shows the output from Gaussian process regression. In the figure, circles show points , the solid line shows as a function of , and the dashed lines are positioned at

, forming a 95% Bayesian credible interval for

, i.e., an interval in whichlies with posterior probability 95%. (A credible interval is the Bayesian version of a frequentist confidence interval.) Because observations are noise-free, the posterior mean

interpolates the observations .### 3.4 Inference with just one observation

The expressions (5) and (6) are complex, and perhaps initially difficult to assimilate. To give more intuition about them, and also to support some additional analysis below in Section 4, it is useful to consider the simplest case, when we have just a single measurement, .

In this case, all matrices in (5) and (6) are scalars, , and the expressions (5) and (6) can be rewritten as,

(7) | ||||

(8) |

#### Intuition about the expression for the posterior mean

We first examine (7). We see that the posterior mean of , , which we can think of as our estimate of after observing , is obtained by taking our original estimate of , , and adding to it a correction term. This correction term is itself the product of two quantities: the error in our original estimate of , and the quantity . Typically, will be positive, and hence also . (Recall, is a variance, so is never negative.) Thus, if is bigger than expected, will be positive, and our posterior mean will be larger than our prior mean . In contrast, if is smaller than expected, will be negative, and our posterior mean will be smaller than our prior mean .

We can examine the quantity to understand the effect of the position of relative to on the magnitude of the correction to the posterior mean. Notice that only enters this expression through the numerator. If is close to , then will be large under the squared exponential and most other covariance kernels, and positive values for will also cause a strong positive change in relative to . If is far from , then will be close to , and will have little effect on .

#### Intuition about the expression for the posterior variance

Now we examine (8). We see that the variance of our belief on under the posterior, , is smaller than its value under the prior, . Moreover, when is close to , will be large, and the reduction in variance from prior to posterior will also be large.

Conversely, when is far from , will be close to , and the variance under the posterior will be similar to its value under the prior.

As a final remark, we can also rewrite the expression (8) in terms of the squared correlation under the prior, , as

We thus see that the reduction in variance of the posterior distribution depends on the squared correlation under the prior, with larger squared correlation implying a larger reduction.

### 3.5 Inference with noisy observations

The previous section assumed that can be observed exactly, without any error. When is the outcome of a physical experiment, however, our observations are obscured by noise. Indeed, if we were to synthesize and test the same material design multiple times, we might observe different results.

To model this situation, Gaussian process regression can be extended to allow observations of the form,

where we assume that the are normally distributed with mean and constant variance, , with independence across . In general, the variance is unknown, but we treat it as a known parameter of our model, and then estimate it along with all the other parameters of our model, as discussed below in Section 3.6.

These assumptions of constant variance (called homoscedasticity) and independence make the analysis significantly easier, although they are often violated in practice. Experimental conditions that tend to violate these assumptions are discussed below, as are versions of GP regression that can be used when they are violated.

#### Analysis of independent homoscedastic noise

To performance inference under independent homoscedastic noise, and calculate a posterior distribution on the value of the function at a given point

, our first step is to write down the joint distribution of our observations

and the quantity we wish to predict, , under the prior. That is, we write down the distribution of the vector .We first observe that is the sum of and another vector, . The first vector has a multivariate normal distribution given by (4). The second vector is independent of the first and is also multivariate normal, with a mean vector that is identically , and a covariance matrix . The sum of two independent multivariate normal random vectors is itself multivariate normal, with a mean vector and covariance matrix given, respectively, by the sums of the mean vectors and covariance matrices of the summands. This gives the distribution of as

(9) |

where is the

-dimensional identity matrix.

As we did in Section 3.3, we can use Proposition 1 with the above expression to compute the posterior on given . We obtain,

(10) | ||||

(11) |

Figure 2 shows an example of a posterior distribution calculated with Gaussian process regression with noisy observations. Notice that the posterior mean no longer interpolates the observations, and the credible interval has a strictly positive width at points where we have measured. Noise prevents us from observing function values exactly, and so we remain uncertain about the function value at points we have measured.

#### Going beyond homoscedastic independent noise

Constant variance is violated if the experimental noise differs across materials designs, which occurs most frequently when noise arises during the synthesis of the material itself, rather than during the evaluation of a material that has already been created. Some work has been done to extend Gaussian process regression to flexibly model heteroscedastic noise (i.e., noise whose variance changes)

[24, 31, 1, 52]. The main idea in much of this work is to use a second Gaussian process to model the changing variance across the input domain. Much of this work assumes that the noise is independent and Gaussian, though [52] considers non-Gaussian noise.Independence is most typically violated, in the context of physical experiments, when the synthesis and evaluation of multiple materials designs is done together, and the variation in some shared component simultaneously influences these designs, e.g., through variation in the temperature while the designs are annealing together, or through variation in the quality of some constituent used in synthesis. We are aware of relatively little work modeling dependent noise in the context of Gaussian process regression and Bayesian optimization, with one exception being [17].

### 3.6 Parameter Estimation

The mean and covariance functions contain several parameters. For example, if we use the squared exponential kernel, a constant mean function, and observations have independent homoscedastic noise, then we must choose or estimate the parameters . These parameters are typically called hyperparameters because they are parameters of the prior distribution. ( is actually a parameter of the likelihood function, but it is convenient to treat it together with the parameters of the prior.) While one may simply choose these hyperparameters directly, based on intuition about the problem, a more common approach is to choose them adaptively, based on data.

To accomplish this, we write down an expression for the probability of the observed data in terms of the hyperparameters, marginalizing over the uncertainty on . Then, we optimize this expression over the hyperparameters to find settings that make the observed data as likely as possible. This approach to setting hyperparameters is often called empirical Bayes

, and it can be seen as an approximation to full Bayesian inference.

We detail this approach for the squared exponential kernel with a constant mean function. Estimating for other kernels and mean functions is similar. Using the probability distribution of from (9), and neglecting constants, the natural logarithm of this probability, (called the “log marginal likelihood”), can be calculated as

where applied to a matrix indicates the determinant.

To find the hyperparameters that maximize this log marginal likelihood (the neglected constant does not affect the location of the maximizer), we will take partial derivatives with respect to each hyperparameter. We will then use them to find maximizers of and analytically, and then use gradient-based optimization to maximize the other hyperparameters.

Taking a partial derivative with respect to , setting it to zero, and solving for , we get that the value of that maximizes the marginal likelihood is

Define as the matrix with components

where . Then and can be written in terms of as . The log marginal likelihood (still neglecting constants) becomes

Taking the partial derivative with respect to , and noting that does not depend on , we solve for and obtain

Substituting this estimate, the log marginal likelihood becomes

(12) |

The expression (12) cannot in general be optimized analytically. Instead, one typically optimizes it numerically using a first- or second-order optimization algorithm, such as Newton’s method or gradient descent, obtaining estimates for and . These estimates are in turn substituted to provide an estimate of , from which estimates and may be computed. Finally, using and the estimated value of , we may estimate and .

### 3.7 Diagnostics

When using Gaussian process regression, or any other machine learning technique, it is advisable to check the quality of the predictions, and to assess whether the assumptions made by the method are met. One way to do this is illustrated by Figure 3, which comes from a simulation of blood flow near the heart, based on [43], for which we get exact (not noisy) observations of

This plot is created with a technique called leave-one-out cross validation. In this technique, we iterate through the datapoints , , and for each , we train a Gaussian process regression model on all of the data except , and then use it, together with , to predict what the value should be. We obtain from this a posterior mean (the prediction), call it

, and also a posterior standard deviation, call it

. When calculating these estimates, it is best to separately re-estimate the hyperparameters each time, leaving out the data . We then calculate a 95% credible interval , and create Figure 3 by plotting “Predicted” vs. “Actual”, where the “Actual” coordinate (on the x-axis) is , and the “Predicted” value (on the y-axis) is pictured as an error bar centered at with half-width .If the uncertainty estimates outputted by Gaussian process regression are behaving as anticipated, then 95% of the credible intervals will intersect the diagonal line Predicted=Actual. Moreover, if Gaussian process regression’s predictive accuracy is high, then the credible intervals will be short, and their centers will be close to this same line Predicted=Actual.

This idea may be extended to noisy function evaluations, under the assumption of independent homoscedastic noise. To handle the fact that the same point may be sampled multiple times, let be the number of times that a point was sampled, and let be the average of the observed values at this point. Moreover, by holding out all samples of and training Gaussian process regression, we would obtain a normal posterior distribution on that has mean and standard deviation .

Since is then the sum of and some normally distributed noise with mean and variance , the resulting distribution of is normal with mean and standard deviation .

From this, a 95% credible interval for is then . We would plot Predicted vs. Observed by putting this credible interval along the y-axis at x-coordinate . If Gaussian process regression is working well, then 95% of these credible intervals will intersect the line Predicted=Observed.

For Gaussian process regression to best support Bayesian optimization, it is typically most important to have good uncertainty estimates, and relatively less important to have high predictive accuracy. This is because Bayesian optimization uses Gaussian process regression as a guide for deciding where to sample, and so if Gaussian process regression reports that there is a great deal of uncertainty at a particular location and thus low predictive accuracy, Bayesian optimization can choose to sample at this location to improve accuracy. Thus, Bayesian optimization has a recourse for dealing with low predictive accuracy, as long as the uncertainty is accurately reported. In contrast, if Gaussian process regression estimates poor performance at a location that actually has near-optimal performance, and also provides an inappropriately low error estimate, then Bayesian optimization may not sample there within a reasonable timeframe, and thus may never correct the error.

If either the uncertainty is incorrectly estimated, or the predictive accuracy is unsatisfactorily low, then the most common “fixes” employed are to adopt a different covariance kernel, or to transform the objective function . If the objective function is known to be non-negative, then the transformations and are convenient for optimization because they are both strictly increasing, and so do not change the set of maximizers (or minimizers). If is not non-negative, but is bounded below by some other known quantity , then one may first shift upward by .

### 3.8 Predicting at more than one point

Below, to support the development of the knowledge-gradient method in Sections 4.2 and 8.3, it will be useful to predict the value of at multiple points, , with noise. To do so, we could certainly apply (10) and (11) separately for each , and this would provide us with both an estimate (the posterior mean) and a measure of the size of the error in this estimate (the posterior variance) associated with each . It would not, however, quantify the relationship between the errors at several different locations. For this, we must perform the estimation jointly.

As we did in Section 3.5, we begin with our prior on , which is,

We then use Proposition 1 to compute the posterior on given , which is multivariate normal with mean vector and covariance matrix given by,

(13) | ||||

(14) |

### 3.9 Avoiding matrix inversion

The expressions (10) and (11) for the posterior mean and variance in the noisy case, and also (7) and (8) in the noise-free case, include a matrix inversion term. Calculating this matrix inversion is slow and can be hard to accomplish accurately in practice, due to the finite precision of floating point implementations. Accuracy is especially an issue when has terms that are close to , which arises when data points are close together.

In practice, rather than calculating a matrix inverse directly, it is typically faster and more accurate to use a mathematically equivalent algorithm, which performs a Cholesky decomposition and then solves a linear system. This algorithm is described below, and is adapted from Algorithm 2.1 in Section 2.3 of [41]. This algorithm also computes the log marginal likelihood required for estimating hyperparameters in Section 3.6.

## 4 Choosing where to sample

Being able to infer the value of the objective function at unevaluated points based on past data , is only one part of finding good designs. The other part is using this ability to make good decisions about where to direct future sampling.

Bayesian optimization methods addresses this by using a measure of the value of the information that would be gained by sampling at a point. Bayesian optimization methods then choose the point to sample next for which this value is largest. A number of different ways of measuring the value of information have been proposed. Here, we describe two in detail, expected improvement [36, 28], and the knowledge gradient [15, 46], and then survey a broader collection of design criteria.

### 4.1 Expected Improvement

Expected improvement, as it was first proposed, considered only the case where measurements are free from noise. In this setting, suppose we have taken measurements at locations and observed . Then

is the best value observed so far. Suppose we are considering evaluating at a new point . After this evaluation, the best value observed will be

and the difference between these values, which is the improvement due to sampling, is

where indicates the positive part function.

Ideally, we would choose to make this improvement as large as possible. Before actually evaluating , however, we do not know what this improvement will be, so we cannot implement this strategy. However, we do have a probability distribution on , from Gaussian process regression. The expected improvement, indicated , is obtained by taking the expectation of this improvement with respect to the posterior distribution on given .

(15) |

where indicates the expectation with respect to the posterior distribution.

The expectation in (15

) can be computed more explicitly, in terms of the normal cumulative distribution function (cdf)

, and the normal probability density function (pdf)

. Recalling from Section 3.3 that , where and are given by (5) and (6), and integrating with respect to the normal di