Active Multi-Information Source Bayesian Quadrature

03/27/2019 ∙ by Alexandra Gessner, et al. ∙ Amazon Max Planck Society 0

Bayesian quadrature (BQ) is a sample-efficient probabilistic numerical method to solve integrals of expensive-to-evaluate black-box functions, yet so far,active BQ learning schemes focus merely on the integrand itself as information source, and do not allow for information transfer from cheaper, related functions. Here, we set the scene for active learning in BQ when multiple related information sources of variable cost (in input and source) are accessible. This setting arises for example when evaluating the integrand requires a complex simulation to be run that can be approximated by simulating at lower levels of sophistication and at lesser expense. We construct meaningful cost-sensitive multi-source acquisition rates as an extension to common utility functions from vanilla BQ (VBQ),and discuss pitfalls that arise from blindly generalizing. Furthermore, we show that the VBQ acquisition policy is a corner-case of all considered cost-sensitive acquisition schemes, which collapse onto one single de-generate policy in the case of one source and constant cost. In proof-of-concept experiments we scrutinize the behavior of our generalized acquisition functions. On an epidemiological model, we demonstrate that active multi-source BQ (AMS-BQ) allocates budget more efficiently than VBQ for learning the integral to a good accuracy.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Integrals of expensive-to-evaluate functions arise in many scientific and industrial applications, for example when complex simulations have uncertain inputs and we wish to estimate the expected outcome. This could be a fluid dynamics model, e.g., in meteorology or astrophysics, in which every function query corresponds to running a simulation on a super-computer. Within reasonable budget, integration using Monte Carlo may not be feasible and alternative numerical integration schemes are needed that require fewer simulation runs (i.e., function evaluations). Bayesian quadrature (

bq)—a means of constructing posterior measures over the unknown value of the integral OHagan1991, Diaconis1988,Briol2015b—mitigates a high sample demand by encoding known or assumed structure of the integrand such as smoothness or regularity, usually via a Gaussian process (gp) measure. With its increased ‘data’111See e.g., Hennig2015 and Cockayne2017 for a discussion on ‘data’ in numerical solvers. efficiency, bq is a natural choice in a setting where function evaluations are precious Rasmussen2003. In the past, bq

 has been applied in reinforcement learning Paul2018, filtering Kersting2016, and has been extended to encode properties of probabilistic integrals Osborne2012b,Osborne2012a,Gunter2014,Chai2018.

A complementary approach to integrating costly functions is to make use of related, cheaper information sources about the integrand of interest. The task of finding approximations to computationally demanding numerical models is an area of active research all on its own (see e.g., Benner2017). Examples of information sources of diverse cost and quality include numerical models that are run at different resolutions, computational models that are approximated by an analytical expression, or different kinds of models for the same target. Multi-source modeling is a statistical technique for harvesting information from related functions by constructing correlated surrogates over multiple sources. When the information sources are hierarchical in that they are ordered from most to least informative, this concept is known as multi-fidelity modeling Kennedy2000,Peherstorfer2018,Forrester2007,LeGratiet2012. The notion of multi-source models is more overarching and includes settings in which sources do not exhibit an easily identifiable order, if any. Each of the sources has its own cost function that quantifies the cost of evaluating the source at a certain input. Integrands with input-dependent cost arise for example when querying the integrand involves running a simulation that needs to be refined for certain values of input. A linear instance of a multi-source model is a multi-output gp aka. co-kriging Alvarez2011a. bq with multi-output gps to integrate several related functions has been studied by Xi2018, who impose properties on data—which they assume given—to prove theoretical guarantees and consistency of the Bayes estimator.

bq leverages active learning schemes similar to Bayesian optimization Shahriari2016 or experimental design Atkinson2007,Yu2006. Through its argumentum maximi, an acquisition function identifies optimal future locations to query the integrand according to a user-defined metric. Metrics of interest in bq

 are information gain on the value of the integral or the predictive variance of the integral. By optimizing the target

per cost, active multi-source bq (ams-bq) trades off improvement on the target (the integral) and ressources spent. In Bayesian optimization, this setting has been explored by Poloczek2017.

We summarize our contributions:

  • [leftmargin=*, topsep=0pt, itemsep=3pt]

  • We lay the foundations for active bq in the multi-source setting for the task of integrating an expensive function that comes with cheaper approximations. In particular, we assign cost to function evaluations and generalize vbq acquisition functions to acquisition rates.

  • We analyze these acquisition rates and discover that some exhibit a sane, others pathological behavior which can be traced back to the meaning of the acquisition function. We argue that these pathologies were not present in classically used vbq acquisition functions since most are degenerate in their policy and do not rely on values of the acquisition function, while cost-adapted rates inevitably do so. In other words, all considered multi-source acquisition functions collapse on a single policy for vbq, as a corner case of ams-bq; we discuss the implications.

  • We conduct proof-of-concept experiments which show that ams-bq improves upon vbq in that it spends less budget on learning the integral to an acceptable precision.

2 Model

We wish to estimate the integral over the information source of interest (the primary source), w.l.o.g. indexed by 1, ,

and integrated against the probability measure

on ,

(1)

in presence of not necessarily ordered or orderable secondary information sources , with . Each source comes with an input-dependent cost which must be invested to query at location . For ease of interpretation and numerical stability we set and . This is equivalent to assuming there exists a and a s.t. and then normalizing w.r.t. , i.e., . In other words, no query takes an infinite amount of resources, nor does any evaluation come for free. Normalization is no requirement and in practice, neither nor need to be known.

In the following sections we review the tools for building a statistical model that allows us to harvest information from both the primary and the secondary sources for learning the integral of Eq. (1), before turning to the decision theoretic problem of how to actively select locations and sources to query next in Section 3.

2.1 Vanilla Bayesian Quadrature

Let , be a function and a probability measure on that has an intractable integral . In vanilla Bayesian quadrature (vbq), we express our epistemic uncertainty about the value of

through a random variable

. The distribution over is obtained by integrating a Gaussian process (gp) prior that is placed over the integrand , i.e. , where , denotes the prior mean function and , the covariance function or kernel. Observations come in form of potentially noisy function evaluations222Noise free evaluations are usually assumed in bq, but that might not be true for a black-box integrand.  with . Let denote the matrix of input locations and the set of corresponding observations, summarized in (see Rasmussen2006 for an introduction to gp inference). With the closure property of gps, the posterior over when conditioning on

is a univariate Gaussian distribution with posterior mean

and variance that are integrals over the gp’s posterior mean and covariance . These expressions are detailed below for the more general multi-source case that vbq is a subset of and further derivations can be found in Briol2015b. So as not to replace an intractable integral by another intractable integral, the kernel is chosen to be integrable against .

2.2 Multi-Source Models

We consider linear multi-source models, which can equally be phrased as multi-output Gaussian processes Alvarez2011a over the vector-valued function

, . Non-linear models for multi-source modeling exist and have been considered by Perdikaris2017. They do however come with the additional technical difficulty that the model may not be integrable analytically—a sensible pre-requisite for bq—and are thus another beast altogether. The notation mimics the single-output case, that is, , where is an matrix-valued covariance function. More precisely, the covariance between two sources and at inputs and is . The kernel encodes not only characteristics of the individual sources (e.g., smoothness), but crucially the correlation between them. In the multi-source setting, observations come in source-location-evaluation triplets with and source-dependent observation noise as usually only one element of is being observed.333This can be re-written in the multi-output gp framework in terms of linear projections of onto the observed space. See Section A in the supp. mat. for details. The dataset contains data triplets from evaluating elements of at locations with corresponding sources and observations . The gp posterior over has mean and covariance

(2)

with the kernel Gram matrix and . A summary of the notation used can be found in Table 1 in the supplementary material (supp. mat.).

2.3 Multi-Source Bayesian Quadrature

The multi-source model of Section 2.2 can be integrated and gives rise to a quadrature rule similar to vbq (cf. sec. 2.1). Let denote the random variable representing the integral of interest of Eq. (1). The posterior over given data triplets is a univariate Gaussian with mean and variance

(3)

where is the kernel mean and the initial error, both of source 1. Just as in vbq, the kernel is required to be integrable analytically.

We choose an intrinsic coregionalization model (icm) Alvarez2011a with kernel

(4)

where is a positive definite matrix. Eq. (4) is a simple extension of a standard kernel to the multi-source case which factors the correlation between the sources and input locations. If is integrable analytically, will be, too, and thus retains the favorable property of a bq-kernel. A typical choice for is the squared-exponential, aka. rbf kernel with no dependence on the sources and . This model can easily be extended e.g., to a linear model of coregionalization (lmc) without challenging integrability of . This would untie the lenthscales between sources, but would also introduce additional generally unknown kernel parameters. The simpler icm is also used by Xi2018 to establish convergence rates for a multi-output bq rule.

3 Active Learning

./figure01

Figure 1: The multi-source acquisition-cube for a few of the possible acquisition functions. mi, ivr, and ip stand for ‘mutual information’, ‘integral variance reduction’, and ‘integral precision’, respectively. The forward arrows ([origin=c]-150) denote the special case of one source only () as in the case of vbq. The downward facing arrows () denote the special case where the cost is not dependent on the locations . The double-lines () between nodes denote that these acquisition functions are equivalent in the sense that they yield the same optimal . The two grayed-out acquisitions for ip highlight that they exhibit non-favorable behavior as described above. The bottom front row in the cube denotes the special case of vbq ( and ) where all three acquisition policies (mi, ivr, ip) coincide.

Active learning

describes the automation of the decision-making about prospective actions to be taken by an algorithm in order to achieve a certain learning objective. A heuristic measure of improvement towards the specified goal (here: learning the value of an integral) is defined through a

utility function. It transfers the decision-theoretic problem to an optimization problem, but usually an unfeasible one. Therefore, the utility is commonly approximated by an acquisition function. Optimizing the acquisition function induces an acquisition policy that pins down what action to take next. To obtain a sequence of actions, the considered method (here: bq) is placed in a loop where it is iteratively fed with new observations in the general multi-step look-ahead (non-myopic) approach. A myopic approximation is to instead optimize for a single new observation triplet at a time. Besides feasibility, the lack of exact model knowledge motivates a loop in which the model is repeatedly updated with new observations.

In Section 3.1, we recapitulate several utilities that are commonly used for vbq. All these utilities give rise to the same acquisition policy in the absence of cost and are thus not greatly differentiated between in the literature. Intriguingly, the policies do not coincide for ams-bq if cost is accounted for in the acquisition functions, as will be shown and discussed in Section 3.2.

3.1 Cost-Insensitive bq Acquisitions

In the absence of any notion of evaluation cost (or if all sources come at the same cost), the utility functions from vbq generalize straightforwardly to the multi-source case. The vbq case can be recovered by setting the number of sources to one.

3.1.1 Mutual Information

From an information theoretic perspective, new source-location pairs can be chosen such that they jointly maximize the mutual information (mi) between the integral and a set of new but yet unobserved observations with . In terms of the individual and joint differential entropies over and , . Sections 2.2 and 2.3 imply that both and

are normally distributed and so is their joint. The differential entropy of a multivariate normal distribution with covariance matrix

is . Since there is no explicit dependence on the value of , we (sloppily) express the mutual information as a function of the new source-location pairs ,

(5)

where we introduce the scalar correlation

(6)

, with the noise-corrected posterior covariance matrix . In the one-step look-ahead case (),

(7)

is the bivariate correlation between and .

3.1.2 Variance-Based Acquisitions

Variance-based approaches attempt to select such that the variance on shrinks maximally. As mi, the integral variance reduction (ivr) normalized by the current integral variance can be written in terms of correlation as

(8)

Eq. (8) is a monotonic transformation of Eq. (5) and therefore, both utility functions share the same global maximizer . In fact, any monotonic transformation of Eq. (6), whether interpretable or not, gives rise to the same acquisition policy. This is because the policy only depends on the locations, but not the value of the utility function’s global maximum. Hence, in vbq

 it is equivalent to consider maximal shrinkage of the variance of the integral, minimization of the integral’s standard deviation, or maximal increase of the integral’s precision (

ip), to name a few—they all lead to the same active learning scheme and have thus not been greatly distinguished between in previous work on active vbq.

3.2 Cost-Sensitive Multi-Information Source Acquisitions

When there is a location and/or source-dependent cost associated to evaluating the information sources (cf. Section 2), the utility function should trade off the improvement made on the integral and the budget spent for function evaluations. This is achieved by considering the ratio of a cost-insensitive bq utility and the cost function . Such a ratio can be interpreted as an acquisition rate and bears the units of the utility function divided by units of cost. The notion of a rate becomes clearer when considering for example the mutual information utility (Eq. (5)) with cost measured in terms of evaluation time: the unit is , i.e., a rate of information gain.
This construction has an important consequence: Modification of the vbq utility function (i.e., the numerator), even by a monotonic transformation, changes the maximizer of the cost-adapted acquisition rate and hence, also the acquisition policy. In other words, the degeneracy of bq acquisition functions in terms of the policy they induce in the absence of cost is lifted when evaluation cost is included, firstly, because the argmax of each acquisition is shifted differently with cost, and, secondly, because acquisition values from different sources are discriminated against each other now. As will be discussed below, not all monotonic transformations yield a sensible acquisition policy; indeed, some display pathological behavior.

The adapted non-myopic acquisition rates for the bq utilities mutual information (mi Eq. (5)) and integral variance reduction (ivr Eq. (8)) are

(9)
(10)

where we have dropped the factor in mi as an arbitrary scaling factor. It is evident that these acquisition rates do no longer share their maximizer; yet they still induce a meaningful acquisition scheme. Both mi and ivr have the property to be zero at and thus never select points that are uncorrelated with the integral , no matter the cost, e.g., locations that have already been observed exactly (with ). Such points do not update the posterior of the integral when conditioned on. In vbq these locations are the minimizers of all acquisition functions and thus excluded no matter their value. This is not ensured for the cost-adapted acquisition rates and therefore, they additionally require the numerator to be zero at . Hence, not every monotonic transformation of the bq utility produces a sane acquisition policy in the presence of cost.
Consider for example the valid transformation , which is at . Maximizing this utility function corresponds to maximizing the negative integral variance, i.e., minimizing the integral variance, which is very commonly done in vbq. Since , is negative everywhere and gets larger (takes a smaller negative value) with larger cost. Hence when maximized, this acquisition would favor expensive evaluations.
More subtle is the misbehavior of the integral precision (ip) which is positive everywhere and has the desired behavior of favoring low-cost evaluations. In terms of the squared correlation from Eq. (6) (with simplified notation for convenience), the numerator of the ip acquisition rate can be written as . This expression is non-zero at and therefore, it does not exclude points of zero correlation when they come at sufficiently cheap cost, and in experiments we observe it getting stuck re-evaluating at the location of minimum cost ad infinitum. We conjecture that this is because ip only encodes an absolute scale of the integral variance but does not quantify any “improvement” on the integral value.
Figure 1 illustrates the augmentation of utility functions from vbq with multiple information sources and cost.

./figure02

Figure 2: right: vbq acquisitions as functions of the squared correlation ; left: vbq acquisitions as a function of univariate and myopic step () for a synthetic . Without cost, their maximizers coincide (top), but when divided by an input-dependent cost (bottom), the maximizers disperse (indicated by the dashed vertical lines) (middle). For implications, cf. Section 3.2.

Figure 2 displays the behavior of a few acquisitions, mi, ivr, and ip. The right plot shows these acquisitions as used in vbq in terms of the squared correlation (Eq. (6)) in the absence of cost. All acquisitions are strictly monotonically increasing functions of . Among the sane acquisition rates that are zero at , the differences in the corresponding policy can also be understood from the functional dependence on . mi diverges at perfect correlation . Therefore, and since the cost lies in , mi will always take a ‘perfect step’ to learn the integral exactly, i.e.,  it will always select the points with correlation , if the step is available and no matter the cost. ivr, however, is finite at and trades off cost and correlation even if the perfect with exists. These interpretations are reinforced by the left three plots of Figure 2, in which we plot mi, ivr, and ip versus a univariate for the synthetic choice , and a myopic step (). In the pure vbq situation, the locations of all their maxima coincide, but as soon as a non-constant cost is applied, the shapes of the acquisition functions become relevant which discriminates their and lifts the degeneracy in policies. mi tends more towards higher correlation than ivr, the maximizer of which moves further towards locations of lower cost. While mi and ivr act differently, they are both sensible choices for acquisition functions in ams-bq. In fact for low to mid-ranged values of where mi is approximately a linear function of they roughly coincide.
The choice of acquisition ultimately depends on the application and the user, who may choose which measures of improvement on the integral and cost to trade off.

4 Experiments

./figure03

Figure 3: Demonstration of the sequential selection of new source-location pairs to query using the mi acquisition in a two-source setting with source and location dependent cost (Figure 8). Left column: The gp; right column: the acquisition function for the primary (solid) and secondary source (dashed) for three consecutive iterations. Vertical orange lines indicate the location and source of the new query.

The key practical applications for ams-bq is solving integrals of expensive-to-evaluate black-box functions that are accompanied by cheaper approximations, potentially in a setting where a finite budget is available. Typical applications are models of complex nonlinear systems that need to be tackled computationally. With evaluations being precious, the goal is to get a decent estimate of the integral with as little budget as possible, rather than caring about floating-point precision. In the experiments, we focus on the rear vertices of the acquisition cube Figure 1, i.e., multiple sources with source and input-dependent or only source-dependent cost, and separate them into two main experiments:

  1. [leftmargin=*, topsep=0pt, itemsep=3pt]

  2. A synthetic multi-source setting with cost that varies in source and location for the purpose of exploring and demonstrating the behavior of the acquisition functions derived in Section 3.

  3. An epidemiological model of the spread of a disease with uncertain input, in which two sources correspond to simulations that differ in cost as well as quality of the quantity of interest.

Additionally, we present a bivariate experiment with three sources in the supplementary material Section D. We take a myopic approach to all scenarios in that we optimize the acquisition for a single source-location pair a time.

4.1 Multiple Sources, Non-Constant Cost

We initially consider a synthetic two-source setting with univariate input. The cost functions depend on both source and location. The experiment’s purpose is to demonstrate our findings from Section 3 and convey intuition about the behavior of the novel acquisition functions. The sources we consider have been suggested by Forrester2007 with the primary source and the secondary source for . The cost functions both take the form of a scaled and shifted logistic function in a way that the cost lies in , cf. Figure 8 in the supp. mat.  The costs of both sources converge to the same value close to ; for larger , is two orders of magnitude cheaper than . Figure 3 shows snapshots of three consecutive query decisions taken by the mi multi-source acquisition. The gp model (depicted in the left column) has been initialized with 3 datapoints in the primary and 5 in the secondary source and merely the noise variance was constrained to . The mi acquisition given the current state of the gp is shown on the right—the top left frame is shown for mi, ivr, and ip in Figure 7 in the supp. mat. to emphasize the pathology of ip and to highlight the subtle difference mi and ivr in practice. The acquisition function is optimized using the l-bfgs-b optimizer in scipy. We observe that ams-bq does not query where the source costs are almost identical for (see Figure 9 in supp. mat.). This is because the two sources are not perfectly correlated and evaluating always conveys more information about than . The fact that decreases with increasing is nicely represented in the increasing height of the maxima of the dashed acquisition function for the secondary source in the top left frame of Figure 3.

./figure04

Figure 4: Top left: the wigglified Forrester functions with and primary/secondary source, respectively; top right: the cost functions used; bottom: relative error with two std. deviations (shaded) as a function of normalized cost for the ams-bq acquisitions mi and ivr compared to vbq and a percentile estimator (pe). Vertical dashed lines are a visual help to indicate the cost spent to achieve acceptable accuracy.

For assessing the performance of ams-bq, we compare against vbq and a percentile estimator (pe) that both operate on the primary source. The latter is obtained by separating the domain into intervals that contain the same probability mass and summing up the function values at these nodes. For the uniform integration measure used here, this is equivalent to a right Riemann sum. We assume that gp inference comes at negligible cost as compared to the function evaluations and thus consider cost to be incurred purely by querying the information sources.
To render the integration problem more difficult, we modify the Forrester functions to vary on a smaller length scale by adding a sinusoidal term and adapting some parameters, s.t. and which we integrate from 0 to 1 against a uniform measure (cf. Figure 4, top left). To avoid over- or underfitting, we set a conservative gamma prior on the lengthscale with a mode at a small fraction of the domain for both vbq and ams-bq, and assume zero observation noise. With six444Due to the construction of

more hyperparameters than

vbq, ams-bq is more prone to over-/underfitting, and we further set a prior on the coregionalization matrix (cf. Section 2.3) with parameters estimated from the initial three data points using empirical Bayes. This is to avoid initial over- or under-estimation of the correlation between sources, which would either cause the active scheme to select only or only , respectively. Compared to the previous experiment, the cost is changed to have a minimum, but still composed of a sum of logistic functions and normalized to be in (Figure 4, top right). The effect of these cost functions on the final state is depicted in Figure 10 in the supp. mat.  Furthermore, this setting reveals the pathology of the ip acquisition (cf. Section 3.2) that everlastingly re-evaluates the secondary source at the location of minimal cost. The convergence behavior of the well-behaved acquisition functions mi and ivr are displayed in Figure 4 (bottom) in comparison to vbq and pe. The hyperparameters of the gp are optimized after every newly acquired node, both for vbq and ams-bq. Figure 4 shows the superior performance of both ams-bq methods in arriving close the true integral with little budget. The vertical jumps in the ams-bq methods occur when is evaluated at cheaper cost.

4.2 A Simulation of Infections

./figure05

Figure 5: Relative error vs. budget spent for the sir model for the maximum number of simultaneously infected individuals (left) and for the time after which the maximum occurs (right). A cost of 1 corresponds to one evaluation of the primary source.

We now consider multi-source models in which sources come with input-independent cost, a.k.a. multi-fidelity models (bottom rear mi vertex in Figure 1

). We choose an epidemiological model in which evaluating the primary source requires running numerous stochastic simulations and the secondary source solves a system of ordinary differential equations. Epidemiological models deal with simulating the propagation of an infectious disease through a population. The

sir model forms the base for many compartmental models and assumes a population of fixed size where at any point in time, each individuum is in one of three states—susceptible, infected, and recovered (sir)—with sizes , and

Kermack1927. The dynamics are determined by stochastic discrete-time events of individuals changing infection state, for which Poisson processes (i.e.,  exponentially distributed inter-event times) are commonly assumed (see e.g., Daley1999). In the thermodynamic limit where

is large, the average dynamics is governed by a system of odes that does not admit a generic analytic solution. There are two parameters in the sir model: the infection rate , and the recovery rate . Model details and experiment setup can be found in Section C (supp. mat.).
For the ams-bq experiment, we assume that we know , but we are uncertain about . We are interested in the expected maximum number of simultaneously infected individuals and the time this maximum occurs , which might be relevant for vaccination planning. Querying the primary source for the quantities of interest as a function of requires numerous realizations of a stochastic four-compartments epidemic model (an extension to the sir model) using the Gillespie algorithm Gillespie1976,Gillespie1977. For each trajectory, the maximum value and time are computed and henceforth averaged over. In our implementation, each query of takes on a laptop’s cpu. The secondary source solves the system of odes for given and computes the maximum value and time for the resulting function , which takes about to evaluate. As in previous experiments, we set a gamma prior on the lengthscale, a prior on the coregionalization matrix , and the noise variance to zero as in Section 4.1. Both vbq and ams-bq are given the same initial value of , and ams-bq additionally gets the value of at the same location, as well as one more random datum from . This is justified since ams-bq needs to learn more hyperparameters than vbq and secondary source evaluations are very cheap. Otherwise, if the initial evaluations of were further apart than the prior lengthscale from the locations of the initial primary datum, virtually zero correlation would be inferred between the sources, and the primary source would be evaluated until a sampled location roughly coincides with locations where the secondary sources have been evaluated.

./figure06

Figure 6: Evaluation sequences of primary and secondary source in the sir experiment.

Figure 5 shows the relative error of the ams-bq estimator against normalized cost as compared to vbq and pe for (left) and (right). The horizontal dashed line shows , i.e., the integral of the secondary source with one evolution of a Monte Carlo estimator of . This illustrates that simply using the secondary source for the integral estimate might be computationally cheap, but results in an unknown bias of the integral estimate. In the left plot, ams-bq achieves a good estimate with one additional evaluation of only, while vbq takes another six evaluations. Again, the vertical jumps for ams-bq are caused by evaluations of . The initial high confidence on the integral is caused by the choice of prior on the output scale from the initial data, which is located in the tail of the gamma prior on . Figure 6 displays the order in which ams-bq evaluates primary and secondary source.

5 Discussion

The multi-source model presented in Section 2.2 can be extended in various ways to increase its expressiveness by using a more elaborate kernel (e.g., one lengthscale per source), or by encoding knowledge about the functions to be integrated, e.g., a probabilistic integrand. Other applications might come with the complication that the cost function is unknown a priori and needs to be learned during the active bq-loop from measurements of the amount of resource required during the queries. A simple example of this was presented in Section 4.2

where the cost was parameterized by a constant which was then estimated during the initial observations. A probabilistic (in contrast to parametrized) model upon the cost would induce an acquisition function which is not only conditioned on the uncertain model predictions but also on the uncertain cost predictions. Furthermore, as in other active learning schemes, non-myopic steps for acquiring multiple observations

at once might be beneficial especially when the multi-source model is already well-known, i.e., it does not benefit from being re-fitted after seeing new observations, or when multiple evaluations of sources come at lower cost than evaluating sequentially. On the experimental side, more elaborate applications of ams-bq in areas of active research are reserved for future work.

5.1 Conclusion

We have placed multi-source bq in a loop and thus enabled active learning to infer the integral of a primary source while including information from cheaper secondary sources. We discovered that utilities that yield redundant acquisition policies in vbq give rise to various policies, some desirable and others pathological, when evaluation cost is accounted for. Our experiments illustrate that with the sensible acquisition functions, the ams-bq algorithm allocates budget to information retrieval more efficiently than traditional methods do for solving expensive integrals.

Acknowledgements

The authors thank Andrei Paleyes for advice in software-related matters, as well as Philipp Hennig, Motonobu Kanagawa, and Aaron Klein for useful discussions. Alexandra Gessner acknowledges support by the International Max Planck Research School for Intelligent Systems (IMPRS-IS).

Supplementary Material

Active Multi-Information Source Bayesian Quadrature

Appendix A Multi-source models and multi-output GPs

We have seen in Section 2.2 that linear multi-source models can be phrased in terms of multi-output gps. Typically, the goal of multi-output gps is to model a vector-valued function and observations come as a vector , where . In multi-source models, we wish to observe only elements of , i.e.,  . These observations can be written as projections of the vector-valued observations,

(11)

where denotes a vector with a 1 in the coordinate and zero elsewhere. Let denote the vector of stacked vector-valued noisy observations . Then the corresponding observations of elements is

(12)

where is a sparse matrix. Note the delicate notational difference between the observations of single elements of , , and a single evaluation of the vector-valued function . The covariance matrix between all of the observations is

(13)

where and is an matrix with every element a 1. Also, . With the following mappings, we arrive at the multi-source notation introduced in Section 2;

(14)

Hence, the notational detour over vector-valued observations is not required and evaluations of individual sources can be incorporated easily in the multi-source model. From the mappings Eq. (14) follow the posterior mean and covariance of the multi-source model Eq. (2).

Appendix B Additional plots for Section 4.1

Section 4.1 showed two examples to demonstrate the behavior of our derived acquisition functions. All relevant details and cross-references are in the captions.

./figure07

Figure 7: mi, ivr, and ip acquisitions for the top row of Figure 3. mi and ivr do not differ a lot, i.e., the correlation is rarely large enough for mi to leave the linear regime. mi puts slightly more emphasis on the primary source where is close to 1. This indicates that the correlation between and quite large there. The bottom plot displays the pathology of ip, where the acquisition for the secondary source essentially follows the inverse cost .

./figure08

Figure 8: The cost used for the experiment in Figure 3.

./figure09

Figure 9: A later state for the experiment shown in Figure 3. Note the absence of evaluations for small where and are similar.

./figure10

Figure 10: Final state of the gp for the second experiment explained in Section 4.1 and shown in Figure 4 (‘wigglified Forrester’). Note the increasing density of evalutions of the secondary source where the cost is minimal, and the lack of queries where . The leftmost evaluation is at the primary source. See Figure 4 for the cost functions. In this experiment, the ip acquisition exclusively evaluates at the location of the minimum of the secondary cost function and is thus stuck.

Appendix C Details for the infections model

c.1 The SIR model and extensions

When the population size is large, the sir (susceptible, infected, recovered) model can be described by the following system of ordinary differential equations,

(15)

in which is the rate of infection and the rate of recovery. It is the most basic of a series of compartmental epidemiological models. Various extensions exist to accommodate additional effects e.g., vital dynamics, immunity, incubation time (cf. e.g., Hethcote2000). Some of these extensions serve as a general model refinement, others are relevant to specific diseases.

Statistical properties, however, are not captured by the description through ode

s and call for a stochastic model. The Gillespie algorithm Gillespie1976, Gillespie1977 enables discrete and stochastic simulations in which every trajectory is an exact sample of the solution of the ‘master equation’ that defines a probability distribution over solutions to a stochastic equation. In the

sir

model, the rate constants are time-independent and thus, the underlying process is Markovian in which the event times are Poisson distributed. Here, an event denotes the transition of one individual from one compartment to another (e.g.,  

).

c.2 Experimental setup

./figure11

Figure 11: Demonstration of the sir and seir models for . See text for details.

For the ams-bq experiment, we assume that we know the recovery rate , but we are uncertain about the infection rate . Therefore, we rescale the odes and place a shifted gamma prior on that starts at and has shape and scale parameters 5 and 4 respectively. With this prior we encode our belief that the infection rate is significantly larger than the recovery rate so an offset of the epidemic is very likely. Also, we set the population size to to be well below the thermodynamic limit and set one individual to be infected initially. We are interested in the expected maximum number of simultaneously infected individuals and the time this maximum occurs , which might be relevant for vaccination planning. Querying the primary source for the quantities of interest as a function of requires numerous realizations of a stochastic four-compartments epidemic model using the Gillespie algorithm Gillespie1976,Gillespie1977; in addition to the base model (sir), we include the state ‘exposed’, in which individuals are infected but not yet infectious. The modified system of odes that also account for assumed known incubation time are

(16)

where we set . We absorb the prior on in the black-box function for all methods.

Figure 11 shows the sir and seir models (Eq. (15) and (16), respectively) with 10 stochastic trajectories (thin lines). The solid lines indicate the mean of 100 of these stochastic realizations, and the dashed lines show the solution of the odes, in both cases for the sir model. We also use the sir model for solving the odes even though the stochastic model simulates the seir model. The purpose of this is to mimic a case where secondary sources are simplified simulations in that minor components are deprecated. In the stochastic case, there is not always an outbreak of the disease, i.e., the initially infected individuum recovers before infecting someone else. This causes the average to level off significantly below 1. For the integrals, only outbreaks are taken into account. The corresponding integrands for the quantities of interest are shown in Figure 12.

./figure12

Figure 12: Integrands used for the epidemiological model. Solid lines denote the primary source (i.e., stochastic simulations), dashed lines indicate the secondary source (solving the system of odes). It is apparent from the function that simply integrating the cheap source introduces a significant bias.

Appendix D Bivariate linear combination of Gaussians

We construct an integrand (primary source) in the 2D-domain as a linear combination of normalized Gaussian basis functions

(17)

i.e., . For this, we sample means uniformly in the 2D domain. We then sample corresponding covariance matices according to , , and . The scalar weights are sampled from a standard Gaussian and can be negative. Thus

is not a probability density function but rather a linear combination of Gaussians with varying location, shape, and weight. We then construct secondary sources

and consecutively by adding uniform noise to the means, and additive uniform noise to the diagonal of the covariance matrices. Thus, with each additional source, each of the means get randomly but consecutively shifted up and right, and the basis functions , randomly become wider and flatter. Additionally we consecutively add Gaussian random noise to the weights which ensures that the true integrals of the secondary sources differ from the integral of the primary source. All sources are depicted in Figure 13; the primary source on the left, and secondary sources and in the middle and right respectively. The cost for evaluating the primary source is everywhere, the cost of evaluating and are of the primary cost each.

The priors on the kernel lengthscale and coregionalization matrix are set analogously to the other experiments already described in Section 4.1. ams-bq is initialized with one evaluation of the primary source and two evaluations each of the secondary sources which amounts to a total initial cost of 1.2 (initial evaluations shown as red dots in Figure 13). Vanilla-bq is initialized with three evaluations which are needed to get an initial guess for its hyperparameters (initial cost=3). The result is shown in Figure 14 which plots relative error of the integral estimate versus the budget spent as well as two standard deviations of the relative error as returned by the model. It is apparent that ams-bq finds a good solution faster than vanilla-bq.

[width=]./figure13

Figure 13: Integrands used for the bivariate linear combination of Gaussians. From left to right: primary source and secondary sources and . Initial evaluations marked as red dots.

./figure14

Figure 14: Relative error vs. budget spent for vanilla-bq and ams-bq.

./figure15

Figure 15: Evaluation sequence of primary and secondary sources in 2D experiment (250 evaluations shown).

Figure 15 illustrates the sequence of sources choses by ams-bq. Secondary source is chosen more often than secondary source at equal evolution cost of 0.05. This is intuitive since , by construction, provides more information about than , but both secondary sources shrink the budget equally when queried. The percentage of number of evaluations for each source after spending a total budget of 50 is , , for sources , , respectively.

Variable Shape Description
number of sources, indexed by where is the primary source
dimension of the input space, indexed by
number of source-input-evaluation triplets, indexed by
number of potential source-input-evaluation triplets
input location
, where is the source
integral of the source
integration measure on
domain that is integrated over,
random variable for integral of interest
source-location pair where is evaluated
source-location-pairs
noise-free function evaluations
noisy function evaluations
simultaneous evaluation of all sources
noise vector,
diagonal noise matrix with noise per level
collected data triplets
matrix-valued covariance matrix
covariance function
vector-valued covariance
Gram matrix
gp prior mean for multi-output gp
prior mean evaluated at source-location pairs
posterior mean at source
posterior covariance of sources
integrated prior mean
kernel mean of source at source-location pairs
initial error
coregionalization matrix for the kernel used in the icm
kernel encoding purely spatial correlation in the icm
potential new source-location-evaluation triplets
cost of evaluating at ;
; in the myopic case denoted as
scalar correlation function for , defined in Eq. (6)
non-myopic acquisition function, in the myopic case
Table 1: Summary of the notation used. Generally, vector-valued quantities are denoted by lower case bold letters and matrices are upper case bold letters. Normal font denotes scalars.