Knowledge Gradient for Selection with Covariates: Consistency and Computation

06/12/2019 ∙ by Xiaowei Zhang, et al. ∙ Shanghai Jiao Tong University 0

Knowledge gradient is a design principle for developing Bayesian sequential sampling policies to consider in this paper the ranking and selection problem in the presence of covariates, where the best alternative is not universal but depends on the covariates. In this context, we prove that under minimal assumptions, the sampling policy based on knowledge gradient is consistent, in the sense that following the policy the best alternative as a function of the covariates will be identified almost surly as the number of samples grows. We also propose a stochastic gradient ascent algorithm for computing the sampling policy and demonstrate its performance via numerical experiments.



There are no comments yet.


This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

We consider the ranking and selection (R&S) problem in the presence of covariates. This problem emerges naturally as the popularization of data and decision analytics in recent years. For example, the appeal of an online advertisement depends on consumer preference (Arora et al. 2008). Customized advertising therefore aims to present to each consumer the advertisement that is most suitable for her. For another example, the effect of a treatment regimen depends on patients’ biometric characteristics (Kim et al. 2011). Personalized medicine therefore aims to select the treatment regime that is customized to each patient.

Formally, R&S with covariates can be postulated as follows. A decision maker is presented with a finite collection of alternatives. The performance of each alternative is unknown and depends on the covariates. Suppose that the decision maker has access to noisy samples of any alternative for any chosen value of the covariates, but the samples are expensive to acquire. Given a finite sampling budget, the goal is to develop an efficient sampling policy indicating locations as to which alternative and what value of the covariates to sample from, so that upon termination of the sampling, the decision maker can identify a decision rule that accurately specifies the best alternative as a function of the covariates.

Being a classic problem in the area of stochastic simulation, R&S has a vast literature. We refer to Kim and Nelson (2006) and Chen et al. (2015) for reviews on the subject with emphasis on frequentist and Bayesian approaches, respectively. Most of the prior work, however, does not consider the presence of the covariates, and thus the best alternative to select is universal rather than varies as a function of the covariates. There are several exceptions, including Shen et al. (2017)Hu and Ludkovski (2017), and Pearce and Branke (2017). Among them the first takes a frequentist approach to solve R&S with covariates, whereas the other two a Bayesian approach. The present paper adopts a Bayesian perspective as well.

A first main contribution of this paper is to develop a sampling policy based on knowledge gradient (KG) for R&S with covariates. KG, introduced in Frazier et al. (2008), is a design principle that has been widely used for developing Bayesian sequential sampling policies to solve a variety of optimization problems, including R&S, in which evaluation of the objective function is noisy and expensive. In its basic form, KG begins with assigning a multivariate normal prior on the unknown constant performance of all alternatives. In each iteration, it chooses the sampling location by maximizing the increment in the expected value of the information that would be gained by taking a sample from the location. Then, the posterior is updated upon observing the noisy sample from the chosen location. The sampling efficiency of KG-type policies is often competitive with or outperforms other sampling policies; see Frazier et al. (2009)Scott et al. (2011)Ryzhov (2016), and Pearce and Branke (2018) among others.

A KG-based sampling policy for R&S with covariates is also proposed in  Pearce and Branke (2017)

. The main difference here is that our treatment is more general. First, we allow the sampling noise to be heteroscedastic, whereas it is assumed to be constant for different locations of the same alternative in their work. Heteroscedasticity is of particular significance for simulation applications such as queueing systems. Second, we take into account possible variations in sampling cost at different locations, whereas the sampling cost is simply treated as constant everywhere in 

Pearce and Branke (2017). Hence, our policy, which we refer to as integrated knowledge gradient (IKG), attempts in each iteration to maximize a “cost-adjusted” increment in the expected value of information.

A second main contribution of this paper is to provide a theoretical analysis of the asymptotic behavior of the IKG policy, whereas Pearce and Branke (2017) conducted only numerical investigation. In particular, we prove that IKG is consistent in the sense that for any value of the covariates, the selected alternative upon termination of the policy will converge to the true best almost surely as the sampling budget grows to infinity.

Consistency of KG-type policies has been established in various settings, mostly for problems where the number of feasible solutions is finite, including R&S (Frazier et al. 2008, 2009, Frazier and Powell 2011, Mes et al. 2011), and discrete optimization via simulation (Xie et al. 2016). KG is also used for Bayesian optimization of continuous functions in Wu and Frazier (2016)Poloczek et al. (2017), and Wu et al. (2017)

. However, in these papers the continuous domain is discretized first, which effectively reduces the problem to one with finite feasible solutions, in order to facilitate their asymptotic analysis. The finiteness of the domain is critical in the aforementioned papers, because the asymptotic analysis there boils down to proving that each feasible solution can be sampled infinitely often. This, by the law of large numbers, implies that the variance of the objective value of each solution will converge to zero. Thus, the optimal solution will be identified ultimately since the uncertainty about the solutions will be removed completely in the end.

By contrast, proving consistency of KG-type policies for continuous solution domains demands a fundamentally different approach, since most solutions in a continuous domain would hardly be sampled even once after all. To the best of our knowledge, the only work of this kind is Scott et al. (2011), which studies a KG-type policy for Bayesian optimization of continuous functions. Assigning a Gaussian process prior on the objective function, they established the consistency of the KG-type policy basically by leveraging the continuity of the covariance function of the Gaussian process, which intuitively suggests that if the variance at one location is small, then the variance in its neighborhood ought to be small too.

We cast R&S with covariates to a problem of ranking a finite number of Gaussian processes, thereby having both discrete and continuous elements structurally. As a result, we establish the consistency of the proposed IKG policy by proving the following two facts – (i) each Gaussian process is sampled infinitely often, and (ii) the infinitely many samples assigned to a given Gaussian process drives its posterior variance at any location to zero, thanks to the assumed continuity of its covariance function. The theoretical analysis in this paper is partly built on the ideas developed for discrete and continuous problems, respectively, in Frazier et al. (2008) and Scott et al. (2011) in a federated manner.

A particularly noteworthy characteristic of this paper is that our assumptions are simple and minimal. By contrast, for the proof in Scott et al. (2011)

to be valid, technical conditions are imposed to regulate the asymptotic behavior of the posterior mean function and the posterior covariance function of the underlying Gaussian process. Nevertheless, the two conditions are difficult to verify. We do not impose such conditions. We achieve the substantial simplification of the assumptions by leveraging reproducing kernel Hilbert space theory. The theory has been used widely in machine learning 

(Steinwart and Christmann 2008). But its use in the analysis of KG-type policies is new. We develop several technical results based on RKHS theory to facilitate analysis of the asymptotic behavior of the posterior covariance function.

A third main contribution of this paper is that we develop an algorithm to solve a stochastic optimization problem that determines the sampling decision of the IKG policy in its each iteration. In Pearce and Branke (2017)

, this optimization problem is addressed by the sample average approximation method with a derivative-free optimization solver. Instead, we propose a stochastic gradient ascent (SGA) algorithm, taking advantage of the fact that an gradient estimator can be derived analytically for many popular covariance functions. Numerical experiments demonstrate the finite-sample performance of the IKG policy in conjunction with the SGA algorithm.

We conclude the introduction by reviewing briefly the most pertinent literature. A closely related problem is multi-armed bandit (MAB); see Bubeck and Cesa-Bianchi (2012) for a comprehensive review on the subject. The significance of covariates, thereby contextual MAB, has also drawn substantial attention in recent years; see Rusmevichientong and Tsitsiklis (2010)Yang and Zhu (2002)Krause and Ong (2011), and Perchet and Rigollet (2013) among others. There are two critical differences between contextual MAB and R&S with covariates. First, the former generally assumes that the covariates arrive exogenously in a sequential manner, and the decision-maker can choose at which arm (or alternative) to sample but not the value of covariates. By contrast, the latter assumes that the decision-maker is capable of choosing both the alternative and the covariates when specifying sampling locations. A second difference is MAB focuses on minimizing the regret which is caused by choosing inferior alternatives and accumulated during the sampling process, whereas R&S focuses on identifying the best alternative eventually and the regret is not the primary concern.

The rest of the paper is organized as follows. In Section 2 we follow a nonparametric Bayesian approach to formulate the problem of R&S with covariates, introduce the IKG policy, and present the main result. In Section 3 we prove that the posterior variance function converges uniformly. Not only is this result of interest in its own right, but also is crucial for us to prove the consistency of our sampling policy under assumptions weaker than those imposed for prior related problems. In Section 4

we prove the consistency of our sampling policy in the sense that the estimated best alternative as a function of the covariates converges to the truth with probability one as the number of samples grows to infinity. In

Section 5 we develop a SGA algorithm for computing our sampling policy and demonstrate its performance via numerical experiments. We conclude in Section 6 and collect additional technical results in the Appendix.

2 Problem Formulation

Suppose that a decision maker is presented with competing alternatives. For each , the performance of alternative

depends on a vector of

covariates and is denoted by for . The performances are unknown and can only be learned via sampling. In particular, for any and , one can acquire possibly multiple noisy samples of . The decision maker aims to select the “best” alternative for a given value of , i.e., identify . However, since the sampling is usually expensive in time and/or money, instead of estimating the performances every time a new value of is observed and then ranking them, it is preferable to learn offline the decision rule


as a function of

, through a carefully designed sampling process. Equipped with such a decision rule, the decision maker can select the best alternative upon observing the covariates in a timely fashion. In addition, the decision maker may have some knowledge with regard to the covariates. For example, certain values of the covariates may be more important or appear more frequently than others. Suppose that this kind of knowledge is expressed by a probability density function

on .

During the offline learning period, we need to make a sequence of sampling decisions , where means that the -th sample, denoted by , is taken from alternative at location . We assume that given ,

is an independent unbiased sample having a normal distribution, i.e.,

Here, is the variance of a sample of given and is assumed to be known. Moreover, suppose that the cost of taking a sample from alternative at location is , which is also assumed to be known. In practice, both and are unknown and need to be estimated. Suppose that the total sampling budget for offline learning is , and the sampling process is terminated when the budget is exhausted. Mathematically, we will stop with the -th sample, where


Consequently, the sampling decisions are and the samples taken during the process are . Notice that if for , in which case the sampling budget is reduced to the number of samples.

We follow a nonparametric Bayesian approach to model the unknown functions as well as to design the sampling policy. We treat ’s as random functions and impose a prior on them under which they are mutually independent, although this assumption may be relaxed. Suppose that takes continuous values and that under the prior, is a Gaussian process with mean function and covariance function that satisfies the following assumption.

Assumption 1.

For each , there exists a constant and a positive continuous function such that . Moreover,

  1. [label=()]

  2. , where means taking the absolute value component-wise;

  3. is decreasing in component-wise for ;

  4. and as , where denotes the Euclidean norm.

Assumption 1 stipulates that is second-order stationary, i.e., it depends on and only through the difference . In addition, can be interpreted as the prior variance of for all , and as the prior correlation between and which decreases as increases.

A variety of covariance functions satisfy Assumption 1. Notable examples include the squared exponential (SE) covariance function

where and ’s are positive parameters, and the Matérn covariance function

where is a positive parameter that is typically taken as half-integer (i.e., for some nonnegative integer ), is the gamma function, and is the modified Bessel function of the second kind. The covariance function reflects one’s prior belief about the unknown functions. We refer to Rasmussen and Williams (2006, Chapter 4) for more types of covariance functions.

2.1 Bayesian Updating Equations

For each , let denote the -algebra generated by , the sampling decisions and the samples collected up to time . Suppose that , that is, depends only on the information available at time . In addition, we use the notation , and define and likewise.

Given the setup of our model, it is easy to derive that are independent Gaussian processes under the posterior distribution conditioned on , . In particular, under the prior mutual independence, taking samples from one unknown function does not provide information on another. Let denote the set of the locations of the samples taken from up to time and define likewise. With slight abuse of notation, when necessary, we will also treat as a matrix wherein the columns are corresponding to the points in the set and arranged in the order of appearance, and as a column vector with elements also arranged in the order of appearance. Then, the posterior mean and covariance functions of are given by


where for two sets and , is a matrix of size , is a diagonal matrix of size , and is a column vector of size . We refer to, for example, Scott et al. (2011, Section 3.2) for details. Further, the following updating equation can be derived



is an independent standard normal random variable, and


In particular, conditioned on and prior to taking a sample at , the predictive distribution of is normal with mean

and standard deviation

. Moreover, notice that


(Note that eqs. 58 are still valid even if , and/or ) Hence, is non-increasing in . This basically suggests that the uncertainty about each unknown function under the posterior decreases as more samples from it are collected. It is thus both desirable and practically meaningful that such uncertainty would be completely eliminated if the sampling budget is unlimited, in which case one would be able to identify the decision rule eq. 1 perfectly. In particular, we define consistency of a sampling policy as follows.

Definition 1.

A sampling policy is said to be consistent if it ensures that


almost surely (a.s.) for all .

Remark 1.

Under the assumption that are prior independent, collecting samples from does not provide information about if . Therefore, a consistent policy under the independence assumption ought to ensure that the number of samples taken from each grows without bounds.

2.2 Knowledge Gradient Policy

We first assume temperately that is given and fixed, and that for . Then, solving is a selection of the best problem having finite alternatives, and each sampling decision is reduced to choosing an alternative to take a sample of . The knowledge gradient (KG) policy introduced in Frazier et al. (2008) is designed exactly to solve such a problem assuming an independent normal prior. Specifically, the knowledge gradient at is defined there as the increment in the expected value of the information about the maximum at gained by taking a sample at , that is,


Then, each time the alternative that has the largest value of is selected to generate a sample of .

Let us now return to our context where (1) the covariates are present, (2) each sampling decision consists of both and , and (3) each sampling decision may induce a different sampling cost. Since a sample of would alter the posterior belief about , we generalize eq. 10 and define


which can be interpreted as the increment in the expected value of the information about the maximum at gained per unit of sampling cost by taking a sample at . Then, we consider the following integrated KG (IKG)


and define the IKG sampling policy as


The integrand of eq. 12 can be calculated analytically, as shown in Lemma 1, whose proof is deferred to the Appendix.

Lemma 1.

For all and ,


where , is the standard normal distribution function, and is its density function.

We solve eq. 13 by first solving for all and then enumerating the results. The computational challenge in the former lies in the numerical integration in eq. 14. Notice that is in fact a stochastic optimization problem if we view the integration in eq. 14 as an expectation with respect to the probability density on . One might apply the sample average approximation method to solve , but it would be computationally prohibitive if is high dimensional. Instead, we show in Section 5 that the gradient of the integrand in eq. 14 with respect to

can be calculated explicitly, which is an unbiased estimator of

under regularity conditions, thereby leading to a stochastic gradient ascent method (Kushner and Yin 2003).

We now present our main theoretical result — the IKG policy is consistent under simple assumptions. The proof will be given in Section 4.

Assumption 2.

The design space is a compact set in with nonempty interior.

Assumption 3.

For each , , and are all continuous on , and on .

Theorem 1.

If Assumptions 3, 2 and 1 hold, then the IKG policy (13) is consistent, that is, under the IKG policy,

  1. [label=()]

  2. a.s. as for all and ;

  3. a.s. as for all and ;

  4. a.s. as for all .

We conclude this section by highlighting the differences between our assumptions and those in Scott et al. (2011), in which the consistency of a KG-type policy driven by a Gaussian process is proved. First and foremost, they impose conditions on both the posterior mean function and the posterior covariance function to regulate their large-sample asymptotic behavior. Specifically, they assume that uniformly for all and with , (1) is bounded a.s., and (2) is bounded above away from one, where means the posterior correlation.111The subscript is ignored because there is only one Gaussian process involved in Scott et al. (2011). The two assumptions are nontrivial to verify and critical for their analysis.

By contrast, we do not make such assumptions. Condition (1) is not necessary in our analysis because the “increment in the expected value of the information” is defined as eq. 12 in this paper, whereas in a different form without integration in Scott et al. (2011). There is no need for us to impose Condition (2) in order to regulate the asymptotic behavior of the posterior covariance function, because instead we achieve the same goal by utilizing reproducing kernel Hilbert space theory.

Second, the sampling variance is assumed to be a constant in their work, whereas we allow it to vary at different locations. This is significant, because the sampling process is usually heteroscedastic, especially for simulation models that stem from queueing systems. Allowing unequal sampling variances enhances substantially the applicability of our work.

Last but not the least, in Scott et al. (2011) the prior covariance function of the underlying Gaussian process is of SE type. We relax it to Assumption 1, which allows a great variety of covariance functions. Another relaxation in assumption is that the prior mean function is assumed to be a constant in their work, whereas a continuous function in this paper. We also take into account possibly varying sampling costs at different locations.

3 Convergence of Posterior Covariance Function

We now characterize the asymptotic behavior of the posterior covariance function. We show in Proposition 1 that if the prior covariance function is stationary, then for any , , a sequence of functions of , converges uniformly as for any arbitrary sequence of sampling decisions. Our proof is built on reproducing kernel Hilbert space (RKHS) theory. We will collect below several basic results on RKHS and refer to Berlinet and Thomas-Agnan (2004) for an extensive treatment on the subject.

To simplify notation, in this section we assume and suppress the subscript , but the results can be generalized to the case of without essential difficulty. In particular, we use to denote a generic covariance function, the prior covariance function of a Gaussian process, and the posterior covariance function.

Definition 2.

Let be a nonempty set and be a covariance function on . A Hilbert space of functions on equipped with an inner-product is called a RKHS with reproducing kernel , if (i) for all , and (ii) for all and . Furthermore, the norm of is induced by the inner-product, i.e., for all .

Remark 2.

In Definition 2, for a fixed , is understood as a function mapping to such that for . Moreover, condition (ii) is called the reproducing property. In particular, it implies that and for all .

Remark 3.

By Moore-Aronszajn theorem (Berlinet and Thomas-Agnan 2004, Theorem 3), for each covariance function there exists a unique RKHS for which is its reproducing kernel. Specifically,

where . Moreover, the inner-product is defined by

for any and .

The following lemma asserts that convergence in norm in a RKHS implies uniform pointwise convergence, provided that the covariance function is stationary.

Lemma 2.

Let be a nonempty set and be a covariance function on . Suppose that a sequence of functions converges in norm as , then the limit, denoted by , is in . Moreover, if is stationary, then as uniformly in .

Proof of Lemma 2..

First of all, is guaranteed as a Hilbert space is a complete metric space. A basic property of RKHS is that convergence in norm implies pointwise convergence to the same limit; see, e.g., Corollary 1 of Berlinet and Thomas-Agnan (2004, page 10). Namely, as for all .

To show the pointwise convergence is uniform, note that since is stationary, there exists a function such that . Hence, . It follows that


for all and , where the first equality follows from the reproducing property.

Since a Hilbert space is a complete metric space, the -converging sequence is a Cauchy sequence in , meaning that as for all . Since this convergence to zero is independent of , it follows from eq. 15 that is a uniform Cauchy sequence of functions, thereby converging to uniformly in . ∎

We show in the following Proposition 1 that irrespective of the allocation of the design points and the sampling variance , converges uniformly as for all . (Note that this does not mean the limit is necessarily zero.) The uniform convergence preserves the continuity of in the limit, a property that is crucial for the proof of Proposition 2 later in Section 4.

Proposition 1.

If is stationary, then for any , converges to a limit, denoted by , uniformly in as .

In the light of Lemma 2, in order to establish the uniform convergence of as a function of , it suffices to prove the norm convergence of in the RKHS induced by . We first establish this result for a more general case in the following Lemma 3, where is not required to be stationary. After that, we will prove Proposition 1.

Lemma 3.

Let be the RKHS induced by . If for all , then for any , converges in norm as .

Proof of Lemma 3..

Fix . The fact that is due to eq. 4. It follows from eq. 8 that form a non-increasing sequence bounded below by zero. The monotone convergence theorem implies that converges as . Hence, for all ,


Let and . Then, by eq. 4,


For notational simplicity, let . Then,



Moreover, note that by eq. 4,


Let . Then, it follows from eq. 19 and the reproducing property that



denotes the identity matrix of a compatible size. Furthermore, note that


We now combine eqs. 21 and 20 to have

which is the difference between two positive semi-definite matrices. Therefore, by eq. 18,

where the second inequality follows from the definition of and the equality follows from eq. 17. Then, we apply eq. 16 to conclude that as for all . Therefore, converges in norm as . ∎

Proof of Proposition 1..

Since is stationary, for all . Then by Lemma 3, for any , converges in norm as . Then by Lemma 2, for the -converging limit , uniformly in as . ∎

4 Consistency

It is straightforward to show that if and only if , since is bounded both above and below away from zero on for each under Assumptions 3 and 2. Thus, Theorem 1 is equivalent to Theorem 2 as follows.

Theorem 2.

If Assumptions 2, 3 and 1 hold, then under the IKG policy,

  1. [label=()]

  2. a.s. as for all and ;

  3. a.s. as for all and ;

  4. a.s. as for all .

For each , let denote the (random) number of times that a sample is taken from alternative regardless of the value of up to the -th sample, i.e.,

Further, let , which is well defined since it is a limit of an non-decreasing sequence of random variables.

Under Assumptions 3, 2 and 1, the IKG policy (13) is well defined. This can be seen by noting that the maximum of over is attainable since is continuous in by Assumptions 3 and 1 together with Lemma 1, and is compact by Assumption 2. The bulk of the proof of consistency of the IKG policy lies in part (i) of Theorem 2, i.e., to show that