# Ranking and Selection as Stochastic Control

Under a Bayesian framework, we formulate the fully sequential sampling and selection decision in statistical ranking and selection as a stochastic control problem, and derive the associated Bellman equation. Using value function approximation, we derive an approximately optimal allocation policy. We show that this policy is not only computationally efficient but also possesses both one-step-ahead and asymptotic optimality for independent normal sampling distributions. Moreover, the proposed allocation policy is easily generalizable in the approximate dynamic programming paradigm.

There are no comments yet.

## Authors

• 6 publications
• 9 publications
• 2 publications
• 2 publications
• ### Classification-based Approximate Policy Iteration: Experiments and Extended Discussions

Tackling large approximate dynamic programming or reinforcement learning...
07/02/2014 ∙ by Amir-massoud Farahmand, et al. ∙ 0

• ### Online User Scheduling and Resource Allocation for Mobile-Edge Computing Systems

In this paper, we investigate the multi-user mobile edge computing (MEC)...
04/30/2019 ∙ by Shanfeng Huang, et al. ∙ 0

• ### Knowledge Gradient for Selection with Covariates: Consistency and Computation

Knowledge gradient is a design principle for developing Bayesian sequent...
06/12/2019 ∙ by Xiaowei Zhang, et al. ∙ 0

• ### Context-dependent Ranking and Selection under a Bayesian Framework

We consider a context-dependent ranking and selection problem. The best ...
12/10/2020 ∙ by Haidong Li, et al. ∙ 0

• ### Statistical Decision Making for Optimal Budget Allocation in Crowd Labeling

In crowd labeling, a large amount of unlabeled data instances are outsou...
03/12/2014 ∙ by Xi Chen, et al. ∙ 0

• ### Diffusion Approximations for a Class of Sequential Testing Problems

We consider a decision maker who must choose an action in order to maxim...
02/13/2021 ∙ by Victor F. Araman, et al. ∙ 0

• ### Approximate information state for approximate planning and reinforcement learning in partially observed systems

We propose a theoretical framework for approximate planning and learning...
10/17/2020 ∙ by Jayakumar Subramanian, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## I Introduction

In this paper, we consider a simulation optimization problem of choosing the highest mean alternative from a finite set of alternatives, where the means are unknown and must be estimated by statistical sampling. In simulation, this problem is often called statistical ranking and selection (R&S) problem (see

Bechhofer et al. [1995]). Applications of R&S include selecting the best alternative from many complex discrete event dynamic systems (DEDS) that are computationally intensive to simulate (see Chen et al. [2013]), and finding the most effective drug from different alternatives, where the economic cost of each sample for testing the effectiveness of the drug is expensive (see Powell and Ryzhov [2012]). Broadly speaking, there are two main approaches in R&S (Goldsman and Nelson [1998] and Chen and Lee [2011]

). The first approach allocates samples to guarantee the probability of correct selection (PCS) up to a pre-specified level (

Rinott [1978], Kim and Nelson [2006], Kim [2013]), whereas the second approach maximizes the PCS (or other similar metric) subject to a given sampling budget (Chen et al. [2000], Chick and Inoue [2001], Lee et al. [2012], Pasupathy et al. [2014]).

The earliest sampling allocation schemes use two-stage procedures (e.g., Rinott [1978], Chen et al. [2000], Chick and Inoue [2001]), where unknown parameters are estimated in the first stage. More recently, fully sequential sampling allocation procedures have been developed (Kim and Nelson [2001], Hong and Nelson [2005], Frazier [2014]). In the Bayesian framework, Chen et al. [2006], Frazier et al. [2008], and Chick et al. [2010] proposed sequential algorithms by allocating each replication to maximize the posterior information gains one step ahead; Chick and Gans [2009] and Chick and Frazier [2012] provided sequential policies analogous to a multi-armed bandit problem and used a continuous-time approximation to solve their Bellman equation; Peng et al. [2016] offered a sequential rule achieving the asymptotically optimal sampling rate of the PCS; and Peng and Fu [2017] developed a sequential algorithm that possesses both one-step-ahead and asymptotic optimality.

Previous work using the Bayesian framework approached the difficult dynamic R&S problem by replacing the sequential sampling and selection decisions with a more tractable surrogate optimization problem. In this work, we formulate the R&S problem as a stochastic control problem (SCP) and derive the associated Bellman optimality equation, which requires care due to the interaction between the sampling allocation policy and the posterior distribution. We show that under a canonical condition in R&S, the sampling allocation decision does not affect the Bayesian posterior distributions conditional on the information of sample observations; thus the SCP is proved to be a Markov decision process (MDP). To the best of our knowledge, this is the first work to study R&S as an SCP and use MDP to analyze it.

We then analyze the optimal allocation and selection (A&S) policy of the resulting MDP and prove that a commonly used selection policy of selecting the alternative with the largest sample mean is asymptotically consistent with the optimal selection policy under some mild conditions. The size of the state space of the sampling allocation policy for independent discrete sampling distributions is shown to only grow polynomially with respect to the number of allocated replications if the numbers of alternatives and the possible outcomes of the discrete distributions are fixed, but will have an exponential growth rate when the number of alternatives and the number of possible outcomes of the discrete distributions grow together.

Sampling from independent normal distributions is a standard assumption in the R&S literature, so we focus on this setting. In contrast to the usual approach of replacing the SCP with a tractable approximate surrogate (static) optimization problem, we address the SCP directly by approximating the value function, as in approximate dynamic programming (ADP) (see

Powell [2007]). The value function approximation (VFA) using a simple feature of the value function yields an approximately optimal allocation policy (AOAP) that is not only computationally efficient, but also possesses both one-step-ahead and asymptotic optimality. In addition, the VFA approach is easily generalizable in the ADP paradigm. For example, we show how to extend the AOAP to a multi-step look-ahead sampling allocation procedure, and how to obtain an efficient sampling algorithm for a low-confidence scenario (see Peng et al. [2017a]) by implementing an off-line learning algorithm.

The rest of the paper is organized as follows: Section II formulates the SCP in R&S, and the associated Bellman equation is derived in Section III. Section IV offers further analysis on the optimal A&S policy, and Section V focuses on the approximations of the optimal A&S policy for normal sampling distributions. Numerical results are given in Section VI. The last section offers conclusions.

## Ii Problem Formulation

Among alternatives with unknown means , , our objective is to find the best alternative defined by

 ⟨1⟩△=argmaxi=1,…,kμi,

where each is estimated by sampling. Let be the -th replication for alternative . Suppose , , follows an independent and identically distributed (i.i.d.) joint sampling distribution, i.e., , with a density (or probability mass function) , where comprises all unknown parameters in the parametric family. The marginal distribution of alternative is denoted by , with a density , where comprises all unknown parameters in the marginal distribution. Generally, , , and . In addition, we assume the unknown parameter follows a prior distribution, i.e., , where contains all hyper-parameters for the parametric family of the prior distribution.

We define the two parts of an A&S policy. The allocation policy is a sequence of mappings , where , which allocates the -th sample to an alternative based on information set collected through all previous steps. The information set at step is given by

 Eat△={At(Eat−1);Et},

where contains all sample information and prior information . Define . The information collection procedure following a sampling allocation policy in R&S problem is illustrated in Figure 1 for allocating four samples among three alternatives. Given prior information , collected information set is determined by the two tables in the figure. The allocation decision represented by the table at the bottom determines the (bold) observable elements in the table on the top.

The sampling decision and information flow have an interactive relationship shown in Figure 2.

The sampling decision and the information set are nested in each other as evolves. We reorganize the (allocated) sample observations by putting them together and ordering them in chronological arrangement, i.e., , where , . Although is also a map from the information set, we suppress the argument for simplicity. For the example in Figure 1, we specifically illustrate how to reorganize the sample observations in Figure 3.

We have .

The selection policy is a map , which makes the final selection at step and indicates the best alternative chosen by the A&S algorithm. The final reward for selecting an alternative is a function of , given the selection decision, i.e., . In R&S, two of the most frequently used candidates for the final reward are

 VP(θ;i)△=1{i=⟨1⟩},VE(θ;i)△=μi−μ⟨1⟩,

where the subscripts and stand for PCS and expected opportunity cost (EOC), respectively, and , , are order statistics s.t. . If the alternative selected as the best is the true best alternative, is one, otherwise is zero; is the difference between the mean of the selected alternative and the mean of the true best alternative, which measures the economic opportunity cost (regret) of the selection decision. Notice that the values of the final rewards and are unknown due to the uncertainty of parameter , which is quantified by the prior distribution of the parameter in the Bayesian framework.

We formulate the dynamic decision in R&S by a SCP as follows. Under the Bayesian framework, the expected payoff for an A&S policy , where , in the SCP can be defined recursively by

 VT(EaT;A,S)△=E[V(θ;i)|EaT]|i=S(EaT) = ∫θ∈ΘV(θ;i) F(dθ|EaT)∣∣∣i=S(EaT),

where is the posterior distribution of conditioned on the information set , and in stands for Lebesgue measure for continuous distributions and the counting measure for discrete distributions,

 Vt (Eat;A,S)△=E[Vt+1(Eat∪{Xi,t+1};A,S)|Eat]|i=At+1(Eat) =∫XiVt+1(Eat∪{xi,t+1};A,S) Qi(dxi,t+1|Eat)∣∣∣i=At+1(Eat),

where is the support of , and is the predictive distribution for conditioned on the information set . The posterior and predictive distributions can be calculated using Bayes rule:

 F(dθ|Eat)=L(Eat;θ) F(dθ;ζ0)∫θ∈ΘL(Eat;θ) F(dθ;ζ0), (1)

and

 Qi(dxi,t+1|Eat)=∫θ∈ΘQi(dxi,t+1;θi) L(Eat;θ) F(dθ;ζ0)∫θ∈ΘL(Eat;θ) F(dθ;ζ0), (2)

where is the likelihood of the samples. The posterior and predictive distributions for specific sampling distributions will be discussed in the next section. With the formulation of the SCP, we define an optimal A&S policy as

 (A∗,S∗)△=supA,SV0(ζ0;A,S) . (3)

## Iii R&S as Stochastic Control

In Section III-A, we establish the Bellman equation for SCP (3). In Section III-B

, we show that the information set determining the posterior and predictive distributions can be further reduced to hyper-parameters by using conjugate priors.

### Iii-a Optimal A&S Policy

To avoid having to keep track of the entire sampling allocation policy history, the following theorem establishes that the posterior and predictive distributions at step are determined by ; thus, if we define as the state at step , then SCP (3) satisfies the optimality equation of an MDP.

###### Theorem 1.

Under the Bayesian framework introduced in Section II, the posterior distribution (1) of conditioned on and the predictive distribution (2) of conditioned on are independent of the allocation policy .

###### Proof:

At any step , all replications except for the replication of the alternative being sampled, , are missing. The likelihood of observations collected by the sequential sampling procedure though steps is given by

 L (Eat;θ)=∫⋯∫Xtt∏ℓ=1q(xℓ;θ)k∏i=1{Ai,ℓ(Eaℓ−1) δXi,ℓ(dxi,ℓ) (4) +(1−Ai,ℓ(Eaℓ−1)) dxi,ℓ} =(k∑i=1Ai,t(Eat−1) qi(Xi,t;θi))∫⋯∫Xt−1t−1∏ℓ=1q(xℓ;θ) ×k∏i=1{Ai,ℓ(Eaℓ−1) δXi,ℓ(dxi,ℓ)+(1−Ai,ℓ(Eaℓ−1)) dxi,ℓ} =t∏ℓ=1(k∑i=1Ai,ℓ(Eaℓ−1) qi(Xi,ℓ;θi))=k∏i=1ti∏ℓ=1qi(¯Xi,ℓ;θi),

where and is the delta-measure with a mass point at . The first equality in (4) holds because , , are assumed to be i.i.d. and the -th replication is independent of the information flow before step, i.e., , by construction of the information set; thus the variables of the missing replications at step in the joint density are integrated out, leaving only the marginal likelihood of the observation at step . By using the same argument inductively, the second equality in (4) holds. The last equality in (4) holds because the product operation is commutative. With (4), we can denote the likelihood as , since the information set completely determines the likelihood.

Following Bayes rule, the posterior distribution of is

 F(dθ|EaT) =L(ET;θ) F(dθ;ζ0)∫θ∈ΘL(ET;θ) F(dθ;ζ0) (5) =∏ki=1∏Tit=1qi(¯Xi,t;θi) F(dθ;ζ0)∫θ∈Θ∏ki=1∏Tit=1qi(¯Xi,t;θi) F(dθ;ζ0),

which is independent of the allocation policy , conditioned on . With (5), we can denote the posterior distribution as , since the information set completely determines the posterior distribution. Similarly, the predictive distribution of is

 Qi (dxi,t+1|Eat)=∫θ∈ΘQi(dxi,t+1;θi) L(Et;θ) F(dθ;ζ0)∫θ∈ΘL(Et;θ) F(dθ;ζ0) (6) = ∫θ∈ΘQi(dxi,t+1;θi) ∏ki=1∏tiℓ=1qi(¯Xi,ℓ;θi) F(dθ;ζ0)∫θ∈Θ∏ki=1∏tiℓ=1qi(¯Xi,ℓ;θi) F(dθ;ζ0),

which is independent of the allocation policy , conditioned on . With (6), we can denote the predictive distribution of as , since completely determines the predictive distribution. ∎

Remark. The interaction between sampling allocation policy and posterior distribution has also been studied by Görder and Kolonko [2017], but they introduced a monotone missing pattern that is not satisfied by the sequential sampling mechanism assumed in our paper. If the sampling distribution is assumed to be independent, i.e., , the missing pattern can be fitted into a missing at random (MAR) paradigm studied in incomplete data analysis. MAR means that the missing rule is independent of the missing data, given the observations; see Chapter 2 of Kim and Shao [2013] for a rigorous definition. If the sampling distribution is dependent, the sequential information collection procedure in our work does not satisfy the classic MAR paradigm. For the example in Figure 1, we can see that if the sampling distribution is not independent, the missing rule, say , could be dependent on the missing data, say , since and are dependent and . Even without the MAR condition, we can still prove our conclusion because of two facts: (1) the replications, i.e., , , of the sampling distribution are assumed to be independent; (2) the allocation decision at step , i.e., , only depends on the information set collected at the step in our setting. We call the special structure of sequential sampling decision in R&S sequentially MAR (SMAR).

Dependence in the sampling distribution is often introduced by using common random numbers (CRN) to enhance the efficiency of R&S (see Fu et al. [2007] and Peng et al. [2013]). Although dependence in the sampling distribution is not a problem, our proof for Theorem 1 does not apply if there is dependence between replications, because and , , could be dependent in this case. The i.i.d. assumption for replications, assumed in our paper, is a canonical condition in R&S.

Bellman Equation:

With the conclusion of Theorem 1, the R&S problem is an MDP with state , action for and for , no reward for and for , and the following transition for :

 {ζ0,¯X(t)1,…,¯X(t)k} →{ζ0,¯X(t)1,…,¯X(t)i,Xi,t+1,…,¯X(t)k}|i=At+1,

where , . Then, we can recursively compute the optimal A&S policy of the SCP (3) by the following Bellman equation:

 VT(ET)△=VT(ET;i)|i=S∗(ET), (7)

where , and

 S∗(ET)=argmaxi=1,…,kVT(ET;i),

and for ,

 Vt(Et)△=Vt(Et;i)|i=A∗t+1(Et), (8)

where , and

 A∗t+1(Et)=argmaxi=1,…,kVt(Et;i) .

For an MDP, the equivalence between the optimal policy of the SCP, i.e., (3), and the optimal policy determined by the Bellman equation, i.e., (7) and (8), can be established straightforwardly by induction. The equivalence discussion can be found in Proposition 1.3.1 of Bertsekas [1995].

### Iii-B Conjugacy

Notice that the dimension of the state space of the MDP in the last section grows as the step grows. Under the conjugate prior, the information set can be completely determined by the posterior hyper-parameters, i.e.,

. Thus, the dimension of the state space is the dimension of the hyper-parameters, which is fixed at any step. We provide specific forms for the conjugacy of independent Bernoulli distributions and independent normal distributions with known variances.

1. Bernoulli Distribution

The Bernoulli distribution is a discrete distribution with probability mass function (p.m.f.): and , so the mean of alternative is

. The conjugate prior for the Bernoulli distribution is a beta distribution with density

, where

 fi(θi;αi,βi)=θαi−1i(1−θi)βi−1∫10θαi−1i(1−θi)βi−1dθi, θi∈[0,1], αi,βi>0.

With (5) and (6), the posterior distribution of is

 Fi(dθi;ζt,i)=fi(θi;α(t)i,β(t)i)dθi,

where , and

 α(t)i=α(0)i+tim(t)i,β(t)i=β(0)i+ti(1−m(t)i), m(t)i△=∑tiℓ=1¯Xi,ℓti,

and the predictive p.m.f. of is

 qi(1;ζt,i)=γ(t)i,qi(0;ζt,i)=1−γ(t)i,

where

 γ(t)i△=α(t)iα(t)i+β(t)i .

Assuming , if and , then , and if and , then () when (). If , then , and the prior is called an uninformative prior, which is not a proper distribution, although the posterior distribution can be appropriately defined similarly as the informative prior.

2. Normal Distribution

The conjugate prior for the normal distribution with unknown mean and known variance is a normal distribution . With (5) and (6), the posterior distribution of is , where

 μ(t)i=(σ(t)i)2⎛⎝μ(0)i(σ(0)i)2+tim(t)iσ2i⎞⎠, (σ(t)i)2=⎛⎝1(σ(0)i)2+tiσ2i⎞⎠−1,

and the predictive distribution of is . If , , and the prior is the uninformative prior in this case. For a normal distribution with unknown variance, there is a normal-gamma conjugate prior (see DeGroot [2005]).

## Iv Analysis of Optimal A&S Policy

In Section IV-A, we analyze the properties of the optimal selection policy. For discrete sampling and prior distributions, an explicit form for the optimal A&S policy and its computational complexity are provided in Section IV-B.

### Iv-a Optimal Selection Policy

The optimal selection policy is the last step in the Bellman equation. From (5), we know posterior distributions conditioned on are independent when the prior distributions for different alternatives are independent, which will be assumed in this section. For PCS, the optimal selection policy is

 S∗(ET) =argmaxi=1,…,kP(μi≥μj, ∀ j≠i|ET) =argmaxi=1,…,k∫Oi∏j≠iFj(x|ET) fi(x|ET) dx,

where is the feasible set of , is the posterior distribution of with density , , and for EOC, the optimal selection policy is

 S∗(ET)=argmaxi=1,…,kE[μi|ET],

and

 VT(ET)=E[μi−μ⟨1⟩|ET]∣∣i=S∗(ET) = E[μi|ET]|i=S∗(ET)−E[k∑i=1μi1{μi>μj, j≠i}∣∣ ∣∣ET] = E[μi|ET]|i=S∗(ET)−k∑i=1E⎡⎣μi∏j≠iE[1{μi>μj}∣∣μi,ET]∣∣ ∣∣ET⎤⎦ = E[μi|ET]|i=S∗(ET)−k∑i=1∫Oix∏j≠iFj(x|ET) fi(x|ET) dx .

For EOC, the optimal selection policy for the Bernoulli distribution under conjugacy is

 S∗(ET)=argmaxi=1,…,kγ(T)i,

and the optimal selection policy for the normal distribution under conjugacy is

 S∗(ET)=argmaxi=1,…,kμ(T)i .

For PCS, the optimal selection policy depends on the entire posterior distributions rather than just the posterior means. For normal distributions with conjugate priors, Peng et al. [2016] showed that except for , selecting the largest posterior mean is not the optimal selection policy, which should also incorporate correlations induced by .

The following theorem establishes that under some mild conditions, the selection policy selecting the alternative with the largest sample mean is asymptotically consistent with the optimal selection policy for EOC, which is analogous to the result for PCS in Peng et al. [2016].

###### Theorem 2.

Suppose for , , , i.i.d., , with mutually independent, with mutually independent, and the following conditions are satisfied:

• whenever , ;

• ;

• , ;

• For any and finite , , .

Then, we have

 limT→∞E[VE(θ;i)|E∗T]|i=Sm(E∗T)=0a.s.,

where and means the information set obtained by following the optimal allocation policy , and

 limT→∞E[VE(θ;i)|E∗T]|i≠Sm(E∗T)<0a.s.,

therefore,

 limT→∞[S∗(E∗T)−Sm(E∗T)]=0a.s.
###### Proof:

Denote as the equal allocation policy. Following , every alternative will be sampled infinitely often as

goes to infinity. By the law of large numbers, we know

 limT→∞maxi=1,…,km(T)i=maxi=1,…,kμ∗i,a.s.,

where means the true parameter. In addition, and are martingales. With condition (iii), we have

 E[|E[μi|EeT]|]≤E[|μi|]<∞, E[∣∣∣E[maxi=1,…,kμi|EeT]∣∣∣]≤k∑i=1E[|μi|]<∞,

where means the information set obtained by following . By Doob’s Martingale Convergence and Consistency Theorems (see Doob [1953] and Van der Vaart [2000]),

 limT→∞E[μi|EeT]=μ∗i, limT→∞E[maxi=1,…,kμi|EeT]=maxi=1,…,kμ∗i,a.s.,

so

 limT→∞E[VE(θ;Sm(EeT))|EeT]=0a.s.

By definition, we have

 0= limT→∞E[VE(θ;Sm(EeT))|EeT] ≤limT→∞E[VE(θ;S∗(E∗T))|E∗T]≤0a.s.

Then, we prove that following the optimal policy , every alternative will be sampled infinitely often almost surely as goes to infinity. Otherwise, , s.t. and

 {i1,…,ik1:T∗il△=limT→∞Til<∞, l=1,…,k1}≠∅, {j1,…,jk2:limn→∞Til=∞, l=1,…,k2}≠∅ .

We have

 E[∣∣∣E[maxi=1,…,kμi∣∣∣ET,μi1,…,μik1]∣∣∣] ≤E[E[∣∣∣maxi=1,…,kμi∣∣∣∣∣∣ET,μi1,…,μik1]]≤k∑i=1E[|μi|]<∞ .

By the Dominated Convergence Theorem (see Rudin [1987]) and Doob’s Martingale Convergence and Consistency Theorems,

 limT→∞E[maxi=1,…,kμi∣∣∣E∗T] = limn→∞E[E[maxi=1,…,kμi∣∣∣E∗T,μi