# Ranking and Selection with Covariates for Personalized Decision Making

We consider a ranking and selection problem in the context of personalized decision making, where the best alternative is not universal but varies as a function of observable covariates. The goal of ranking and selection with covariates (R&S-C) is to use sampling to compute a decision rule that can specify the best alternative with certain statistical guarantee for each subsequent individual after observing his or her covariates. A linear model is proposed to capture the relationship between the mean performance of an alternative and the covariates. Under the indifference-zone formulation, we develop two-stage procedures for both homoscedastic and heteroscedastic sampling errors, respectively, and prove their statistical validity, which is defined in terms of probability of correct selection. We also generalize the well-known slippage configuration, and prove that the generalized slippage configuration is the least favorable configuration of our procedures. Extensive numerical experiments are conducted to investigate the performance of the proposed procedures. Finally, we demonstrate the usefulness of R&S-C via a case study of selecting the best treatment regimen in the prevention of esophageal cancer. We find that by leveraging disease-related personal information, R&S-C can improve substantially the expected quality-adjusted life years for some groups of patients through providing patient-specific treatment regimen.

## Authors

• 2 publications
• 6 publications
• 15 publications
• ### Knowledge Gradient for Selection with Covariates: Consistency and Computation

Knowledge gradient is a design principle for developing Bayesian sequent...
06/12/2019 ∙ by Xiaowei Zhang, et al. ∙ 0

• ### Selecting optimal subgroups for treatment using many covariates

We consider the problem of selecting the optimal subgroup to treat when ...
02/26/2018 ∙ by Tyler J. VanderWeele, et al. ∙ 0

• ### Context-dependent Ranking and Selection under a Bayesian Framework

We consider a context-dependent ranking and selection problem. The best ...
12/10/2020 ∙ by Haidong Li, et al. ∙ 0

• ### Active Learning for Decision-Making from Imbalanced Observational Data

Machine learning can help personalized decision support by learning mode...
04/10/2019 ∙ by Iiris Sundin, et al. ∙ 14

• ### A statistical methodology to select covariates in high-dimensional data under dependence. Application to the classification of genetic profiles in oncology

We propose a new methodology for selecting and ranking covariates associ...
09/12/2019 ∙ by Bérangère Bastien, et al. ∙ 0

• ### Distributionally Robust Selection of the Best

Specifying a proper input distribution is often a challenging task in si...
03/14/2019 ∙ by Weiwei Fan, et al. ∙ 0

• ### From Local to Global: External Validity in a Fertility Natural Experiment

We study issues related to external validity for treatment effects using...
06/19/2019 ∙ by Rajeev Dehejia, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Ranking and selection (R&S) is concerned with choosing the best from a finite collection of alternatives, whose performances are unknown and can only be learned through sampling. In this paper we introduce a new R&S problem in which the performance of an alternative varies as a function of some observable random covariates, which are also known as side information, auxiliary quantities, or contextual variables. This is mainly motivated by the emerging popularity of personalized decision making in various areas such as healthcare, e-commerce, and wealth management as customer-specific data grows exponentially and powerful computational infrastructure becomes more accessible. By taking advantage of personalized information as covariates, decisions can be tailored to the individual characteristics of each customer, thereby conceivably more beneficial.

For instance, medical studies show that the effectiveness of cancer chemotherapy treatment depends on the biometric characteristics of a patient such as tumor biomarker and gene expression, and thus the treatment outcome can be improved significantly by personalizing treatment regimen (Yap et al. 2009, Kim et al. 2011). In marketing research, it is known that by sending customized advertisements or promotions based on consumers’ demographic information and purchasing behaviors, companies can increase both profits and customer satisfaction considerably (Arora et al. 2008). For a third example, leveraging rapid advances in financial technology, automated asset management firms (i.e., robo-advisors) assign portfolios based on investor characteristics such as age, income, and risk preference in order to meet the individual financial need of each client (Faloon and Scherer 2017).

A critical feature of ranking and selection with covariates (R&S-C) is that the best alternative is not universal but depends on the covariates. Hence, the solution to a R&S-C problem is a decision rule as a function of the covariates that specifies the best alternative for each given value of the covariates. We assume in this paper that we are able to sample each alternative at any value of the covariates, so the computation is done offline. (By contrast, in an online environment the covariates are observed sequentially and their values are non-controllable.) Nevertheless, after the decision rule is computed, it can be applied online to select the best alternative for each subsequent individual after observing his or her covariates.

R&S-C reflects a shift in perspective with regard to the role of simulation in decision making. R&S-C can be viewed as a tool for system control, since the decision rule it produces can determine the optimal decision dynamically across time. By contrast, R&S is a tool for system design, since it is used to find the best alternative (e.g., production line configuration) before implementation, which is generally expensive and unlikely to change in near future. This shift in perspective is discussed extensively in Nelson (2016) who envisions future development of simulation in the next decade as computational facilities progress rapidly. This perspective is also taken in Jiang et al. (2017) to design algorithms for dynamic risk monitoring.

We assume a linear relationship between the response of an alternative and the covariates. Despite its simplicity, linear models have distinct advantages in terms of their interpretability and robustness relative to model misspecification, and often show good performance in prediction (James et al. 2013). Moreover, they can be generalized easily to accommodate nonlinearity by applying covariate transformation via basis functions (Hastie et al. 2009). Linear models have also been used in R&S problems; see Negoescu et al. (2011) and Chen et al. (2015), where the former assumes that the responses are linear in the covariates, while the latter linear in certain functions of the covariates. However, these procedures aim to select the best alternative as a static decision rather than the kind of decision rule that we seek.

We develop R&S-C procedures that have certain statistical guarantee in terms of probability of correct selection (PCS), whose definition, nevertheless, is complicated by the presence of covariates. Specifically, correct selection is now a conditional event given the covariates, thereby suggesting a conditional PCS. We define two forms of unconditional PCS, one by taking expectation with respect to the distribution of the covariates, while the other by taking the minimum over the support of the covariates. Statistical validity of a R&S-C procedure is defined via either form of unconditional PCS.

### 1.1 Main Contributions

First and foremost, we formulate R&S-C as a novel framework to facilitate personalized decision making for choosing the best from a set of competing alternatives. Along the way, we generalize important concepts for R&S problems, including the indifference-zone formulation and PCS, to the new setting.

Second, since the sampling errors of an alternative when sampled at different values of the covariates may have unequal variances, we propose two-stage procedures for both the homoscedasticity and heteroscedasticity cases, respectively, and prove they are statistically valid. In addition, the procedures can be revised to accommodate different forms of unconditional PCS.

Third, we generalize the concept of slippage configuration of the means for R&S problems to the R&S-C setting and prove that it is the least favorable configuration for a family of R&S-C procedures including ours.

Fourth, we conduct extensive numerical experiments to assess the performance of the proposed procedures in terms of the achieved unconditional PCS, and investigate its sensitivity relative to various aspects such as form of unconditional PCS and configuration of the variances.

At last, we formulate a personalized medicine problem of selecting the best prevention regimen for esophageal cancer as a R&S-C problem, and demonstrate its practical value and advantage relative to a more traditional approach to treatment selection which corresponds to a R&S problem, using a Markov simulation model developed and calibrated by domain experts in cancer research.

### 1.2 Related Literature

R&S is a classic research problem in simulation literature over the past decades and a large number of selection procedures have been developed to solve this problem. In general, a procedure specifies the proper sample size of each alternative and determines which alternative to select. The procedures in the literature are developed following either a frequentist or a Bayesian approach, depending on whether the decision maker interprets the mean performance of an alternative as a constant or a random variable; see

Kim and Nelson (2006) and Chen et al. (2015) for overviews on the two approaches, respectively. Frequentist procedures aim to provide certain statistical guarantee (usually in terms of PCS) even for the least favorable configuration (Rinott 1978, Kim and Nelson 2001, Hong and Nelson 2007, Luo et al. 2015). Thus, they are typically conservative and require more samples than necessary for average cases. Bayesian procedures, on the other hand, aim to allocate a finite computational budget to different alternatives in order to either maximize the PCS or to minimize the expected opportunity cost. There are a variety of approaches to developing a Bayesian procedure, including value of information (Chick and Inoue 2001, Chick et al. 2010), knowledge gradient (Frazier et al. 2008, 2009), optimal computing budget allocation (Chen et al. 1997, Fu et al. 2007), and economics of selection procedures (Chick and Gans 2009, Chick and Frazier 2012). Bayesian procedures often require fewer samples than frequentist procedures to achieve the same level of PCS. However, they do not provide a statistical guarantee in general, except for Frazier (2014) in which a Bayes-inspired procedure is proposed to achieve a pre-specified PCS in the frequentist sense.

The present paper follows a frequentist perspective. Among the frequentist procedures in the literature, there are two-stage procedures and sequential procedures. The former use the first stage to estimate the appropriate sample size for each alternative and select the best alternative at the end of the second stage. Examples include

Rinott (1978), Nelson and Banerjee (2001), and Chick and Wu (2005). Sequential procedures do not specify the sample size in advance. Instead, they take samples gradually and meantime eliminate the inferior alternatives sequentially once enough statistical evidence is collected, thereby generally requiring fewer samples than their two-stage counterparts; see, e.g., Paulson (1964), Kim and Nelson (2001), Hong (2006), and Fan et al. (2016). However, they may induce substantially more computational overhead due to repeated switches between simulation models from which samples of different alternatives are taken (Hong and Nelson 2005). The procedures developed in this paper are stage-wise, have a structure similar to the two-stage procedure of Rinott (1978), and are easy to implement.

Our research is also related to the literature on multi-arm bandit (MAB) with covariates. MAB is an important class of sequential decision making problems in fields such as operations research and statistics. It was first proposed by Robbins (1952) and has been studied extensively since then; see, for instance, Bubeck and Cesa-Bianchi (2012) for a comprehensive review of MAB. In recent years, MAB with covariates (otherwise known as contextual MAB) has drawn considerable attention as a tool for facilitating personalized decision making. The reward is often modeled as a linear function of the covariates (Auer 2002, Rusmevichientong and Tsitsiklis 2010). In particular, Goldenshluger and Zeevi (2013) consider a linear model whose linear coefficients are arm-dependent, which motivates our formulation of R&S-C. We refer to Slivkins (2014) and references therein for nonparametric models, and refer to Bubeck and Cesa-Bianchi (2012) for a review on recent advances in contextual MAB problems.

The remaining of the paper is organized as follows. In §2 we introduce the R&S-C problem and define various concepts, such as correct selection and statistical validity, that generalize their conventional meanings in the R&S setting. In §3 and §4, we develop two-stage selection procedures for homoscedastic and heteroscedastic sampling errors, respectively. In §5, we discuss the least favorable configuration of the means for R&S-C problems and show that, for a family of selection procedures, it is the so-called generalized slippage configuration. In §6 we conduct extensive numerical experiments to investigate the performance of the proposed procedures in various settings. In §7 we demonstrate the practical value of R&S-C in the context of personalized medicine for esophageal cancer prevention. We conclude in §8 and collect some technical proofs in the Appendix.

## 2 Problem Formulation

We consider a collection of distinctive alternatives. Suppose that the performance of each alternative depends on

, a vector of random

covariates with support . For each and , let denote the mean performance of alternative and denote the th sample from alternative , where denotes the augmented covariates with support . Our goal is to select the alternative with the largest mean performance conditionally on ,

 i∗(x)\coloneqqargmax1≤i≤k{Yi(X)|X=x}.

We call this problem R&S-C. By contrast, the sought alternative in the conventional R&S setting is independent of the values of the covariates. In particular, a decision maker who is risk-neutral with respect to the covariates would seek the best alternative via solving

 i†\coloneqqargmax1≤i≤k{E[Yi(X)]},

where the expectation is taken with respect to the distribution of . Notice that

 E[Yi∗(X)(X)]=E[max1≤i≤kYi(X)]≥max1≤i≤kE[Yi(X)]=E[Yi†(X)],

by Jensen’s inequality. This indicates that it is better to select an alternative after observing the covariates than before the observation, if the decision maker is risk-neutral with respect to the covariates. The usefulness of R&S-C will be demonstrated further in the context of personalized medicine in §7.

We assume a linear model in which is linear in and is unbiased.

###### Assumption 1.

For each and , conditionally on ,

 Yi(x) =x⊺βi, Yiℓ(x) =Yi(x)+ϵiℓ(x),

where is a vector of unknown coefficients and is the sampling error with the following properties:

1. [label=()]

2. , i.e., normal distribution with mean 0 and variance

;

3. is independent of for any .

###### Remark 1.

Property (i) in Assumption 1 allows the sampling errors to have unequal variances for different values of for each alternative, which is often the case in stochastic simulation. Property (ii), on the other hand, indicates that the samples taken from different alternatives, different replications, or different values of the covariates are independent.

###### Remark 2.

Notice that the linear model is a natural extension of the normality assumption commonly used in R&S to R&S-C setting. It is simple to handle, easy to interpret, and robust to model misspecification. Moreover, they can be generalized easily to accommodate nonlinearity by applying covariate transformation via basis functions (Hastie et al. 2009).

### 2.1 Indifference-Zone Formulation

We adopt the indifference-zone (IZ) formulation (Bechhofer 1954) to develop selection procedures for the R&S-C problem. The sought procedures ought to provide a lower bound for both the probability of correct selection (CS) and the probability of good selection (GS) under the IZ formulation; see Ni et al. (2017) for their definitions in the context of R&S problems. However, the events of CS and GS need to first be redefined carefully in the light of covariates.

Let be a pre-specified IZ parameter that represents the smallest difference in performance between the competing alternatives that the decision maker considers worth detecting. Let denote the selected alternative given upon termination of a procedure. Clearly, is not necessarily identical to due to the random sampling errors. A CS event occurs when the alternative selected by the procedure is the same as the true best, i.e.,

 CS(x)\coloneqq{ˆi∗(X)=i∗(X)∣∣X=x},

where the probability is taken with respect to the distribution of the samples used by the selection procedure producing . In particular, CS is a conditional event and its meaning is ambiguous unless the value of the covariates is specified.

If for some , namely, conditionally on there exists a “good” alternative whose mean performance is within of the best alternative, then the decision maker feels indifferent between alternative and the best alternative when . We define the GS event as the following conditional event where one of the good alternatives is selected:

 GS(x)\coloneqq{Yi∗(X)(X)−Yˆi∗(X)(X)<δ∣∣X=x}.

Notice that if for all , then the GS event is reduced to the CS event. In R&S literature, most frequentist procedures based on the IZ formulation are developed for the situation where the best alternative is better than the other alternatives by at least , and thus it is conventional to define statistical validity of a procedure by assessing the PCS it achieves. In the presence of covariates, it appears too restrictive to assume that for all and all . Hence, in the rest of this paper we use the term PCS in an extended way that accommodates both the events of CS and GS. In particular, we define the conditional PCS as

 PCS(x)\coloneqqP(Yi∗(X)(X)−Yˆi∗(X)(X)<δ∣∣X=x), (1)

so that it represents the conditional probability of if for all , and the conditional probability of otherwise.

We then define two forms of unconditional PCS. Specifically,

 PCSE\coloneqqE[PCS(X)], (2)

where the expectation is taken with respect to the distribution of , and

 PCSmin\coloneqqminx∈ΘPCS(x). (3)

Fixing a particular form (either or ), we aim to develop selection procedures that provide a lower bound for the unconditional PCS. In particular, a selection procedure is said to be statistically valid if the achieved unconditional PCS is no smaller than a pre-specified value .

We do not argue or suggest that one form of unconditional PCS is better than the other, although they may be suitable for different circumstances. In particular, a decision maker may use if he or she is risk-neutral with respect to the covariates and the distribution of the covariates is known or can be estimated credibly from data. When is used, some individual may have a conditional PCS that is smaller than . On the other hand, represents a more conservative criterion than , since by definition and a lower bound for must be a lower bound for but not vice versa. Using essentially ensures that all individuals will have conditional PCS that is at least as large as . Hence, may be a better choice for a risk-averse decision maker, or if the distribution of the covariates is unknown due to lack of information.

### 2.2 Fixed Design

We consider the fixed design setting as follows. Suppose that design points are chosen properly and fixed, and that alternative can be sampled at repeatedly arbitrarily many times, for each and . The fixed design is suitable when a simulation model is available and the decision maker can perform experiment design in advance. However, if observations are collected from real experiments or in a sequential manner such as clicks of banner ads on a webpage in the field of online advertising, the fixed design may not be applicable.

The placement of the design points given a computational budget is certainly an important problem in practice. A popular approach for linear models is the so-called D-optimal design, which minimizes the determinant of the covariance matrix of the least-square estimators of the coefficients . There are various other optimality criteria for computing a good experiment design; see Atkinson et al. (2007, Chapter 10) for more details on the subject. Intuitively, the linear model would prefer the design points to be allocated far away from each other, which implies that the interior of the domain has scarce design points. However, one may consider to spread the design points over roughly evenly in order to protect against model misspecification. It is beyond the scope of the present paper to discuss the experiment design extensively and we leave the investigation to future study. In the rest of this paper, we simply assume that the design points are properly chosen and satisfies the following condition.

###### Assumption 2.

is a nonsingular matrix, where .

###### Remark 3.

Albeit written in the form of “assumption” for ease of presentation, the above condition can always be satisfied since we may incorporate it in the experiment design as a constraint when we choose the design points. Notice that this is also a typical requirement in linear regression.

Given a design matrix, we will develop selection procedures for both homoscedastic and heteroscedastic sampling errors. Here, homoscedasticity or heteroscedasticity refers to the variances of the sampling errors of the same

alternative at different design points. The variances of the sampling errors of different alternatives are always allowed to be different. In particular, the sampling errors for each alternative have equal variances for different values of the covariates for the homoscedasticity case, while they have unequal variances for the heteroscedasticity case. This analogizes the difference between the ordinary least squares method and the generalized least squares method for estimating the unknown coefficients in a linear regression model.

## 3 Homoscedastic Sampling Errors

By homoscedastic sampling errors, we mean the following assumption.

###### Assumption 3.

for and .

We emphasize that by homoscedasticity we do not mean , and they are instead allowed to be unequal. Notice that for certain situations where Assumption 3 fails, it is possible to apply variance stabilizing transformation by redefining the covariates properly to achieve homoscedasticity (Box and Cox 1964). Similar to the setting of linear regression, the assumption of homoscedasticity simplifies mathematical and computational treatment in our development of selection procedures. Nevertheless, misusing the procedure devised for homoscedasticity in a highly heteroscedastic environment may cause deterioration in the achieved unconditional PCS. We defer the related discussion to §4 after introducing the procedure for the case of heteroscedasticity.

### 3.1 A Two-stage Procedure

We develop a two-stage procedure for the R&S-C problem with fixed design and homoscedastic sampling errors and call it Procedure FDHom. The structure of the procedure is simple and similar to typical two-stage R&S procedures such as Rinott’s procedure (Rinott 1978). The first stage takes a small number of samples in order to estimate the total sample size that is required to deliver the desired statistical guarantee, while the second stage takes the additional samples and produces a selection rule based on the overall samples.

There are several distinctive features in the procedure stemming from the presence of covariates. First, the concept of PCS is more subtle. Since the form of unconditional PCS is not unique, it must be specified before the procedure commences as it is one of the factors that determine the required total sample size. Second, upon termination the procedure yields a decision rule, instead of a single alternative, which stipulates the best alternative for any given value of the covariates. Third, estimation of the mean performances of the alternatives becomes essentially estimation of the unknown coefficients of the covariates. Hence, theoretical analysis of the procedure is closely related to linear regression. We now present Procedure FDHom.

###### Procedure FDHom:ProcedureProcedure FDHom:Procedure FDHom:
1. Setup: Specify the form of unconditional PCS (either or ), the target unconditional PCS , the IZ parameter , the first-stage sample size , the number of design points , and the design matrix . Set if is used, and if is used, where the constants and respectively satisfy the following equations

 E⎧⎨⎩∫∞0[∫∞0Φ(hHomE√(n0m−d−1)(t−1+s−1)X⊺(X⊺X)−1X)η(s)ds]k−1η(t)dt⎫⎬⎭=1−α, (4)

where the expectation is taken with respect to the distribution of , and

 minx∈Θ⎧⎨⎩∫∞0[∫∞0Φ(hHommin√(n0m−d−1)(t−1+s−1)x⊺(X⊺X)−1x)η(s)ds]k−1η(t)dt⎫⎬⎭=1−α, (5)

where

is the cumulative distribution function (cdf) of the standard normal distribution and

is the probability density function (pdf) of the chi-squared distribution with

degrees of freedom.

2. First-stage Sampling: Take independent samples of each alternative at each design point , and denote them by , , . For each , set

 ˆβi(n0)=1n0(X⊺X)−1X⊺n0∑ℓ=1Yiℓ,

and

3. Second-stage Sampling: Compute the total sample size for each , where denotes the smallest integer no less than . Take additional independent samples of alternative at each design point .

4. Selection: For each alternative , compute the overall estimate of its unknown coefficients

 ˆβi=1Ni(X⊺X)−1X⊺Ni∑ℓ=1Yiℓ.

Return as the decision rule.

### 3.2 Implementation Guide

The constant (either or ) is computed numerically. In our implementation, the integration (including the expectation) is computed by the MATLAB built-in numerical integration function integral, then is solved by the MATLAB built-in root finding function fzero. However, since the expectation in (4) for computing is taken with respect to , a

-dimensional random vector, using numerical integration suffers from the curse of dimensionality if

is large. In this case, one may use the Monte Carlo method to approximate the expectation.

It is computationally easier to solve for . Notice that the minimizer of the left-hand side of (5) is the same as the maximizer of , since the function is increasing. But this maximization problem is relatively easy to solve as indicated by the following result.

###### Proposition 1.

If is a non-empty bounded closed set and Assumption 2 holds, then

where is the set of all extreme points of the convex hull of .

###### Proof.

It is straightforward to see that is positive definite under Assumption 2. Hence, is positive definite as well, and thus is convex in . Then, the result follows immediately from Theorem 32.2 and Corollary 32.3.3 in Rockafellar (1970). ∎

For instance, if is a (continuous or discrete) hyper-rectangle with dimension , we can simply compute its corner points to find the maximum.

### 3.3 Statistical Validity

We have the following statistical validity of Procedure FDHom.

###### Theorem 1.

Under Assumptions 1, 2, and 3, Procedure FDHom is statistically valid, i.e., if and if .

In order to highlight the critical issue for establishing the finite-sample unconditional PCS guarantee in Theorem 1, we first revisit R&S problems with unknown variances and no covariates. Suppose that are independent and identically distributed (i.i.d.) samples that are taken from an alternative. A typical two-stage R&S procedure involves using in the first stage to determine a total sample size of the alternatives and using the overall sample mean to rank the alternatives (see, for instance, Rinott (1978)

). In order to establish finite-sample PCS guarantee of the two-stage procedure, one needs to characterize the probability distribution of

explicitly. However, because is a random variable that is determined by the first-stage samples, this distribution is unknown and difficult to characterize in general.

A result due to Stein (1945) stipulates that, if are i.i.d. normal random variables and depends on the first-stage samples only through the sample variance, then has normal distribution with known parameters. Consequently, this result is a cornerstone of finite-sample statistical validity of R&S procedures with unknown variances; see Dudewicz and Dalal (1975) and Rinott (1978) for early use of this result in designing two-stage R&S procedures, and Theorem 2 of Kim and Nelson (2006) for a rephrased version of the result. We now extend this result to the setting of R&S-C and state it in Lemma 1. We defer its proof to the Appendix but remark here that the assumption of the linear model is crucial.

###### Lemma 1.

Let , where , , and with denoting the zero vector in and

the identity matrix in

. Assume that is nonsingular. Let be a random variable independent of and of , where are independent samples of . Suppose that is an integer-valued function of and no other random variables. Let . Then, for any ,

1. [label=()]

2. ;

3. is independent of and has the standard normal distribution.

Notice that the PCS is bounded below by the joint probability that the best alternative eliminates all the other alternatives, which is messy to characterize in general even if all alternatives are sampled independently. The following result due to Slepian (1962) represents a solution by providing a lower bound for the joint probability through marginal probabilities.

###### Lemma 2 (Slepian 1962).

Suppose that has a multivariate normal distribution. If for all , then for any constants ,

 P(k⋂i=1{Zi≥ci})≥k∏i=1P(Zi≥ci).

Now we are ready to prove Theorem 1.

###### Proof of Theorem 1..

Notice that under Assumptions 1, 2, and 3, and are estimators of and based on , respectively, for each . By Theorem 7.6b in Rencher and Schaalje (2008), and are independent; moreover, has the chi-squared distribution with degrees of freedom. Therefore, is independent of . Obviously, is independent of as well. Since is an integer-valued function only of , by Lemma 1,

 X⊺ˆβi∣∣X,S2i∼N(X⊺βi,σ2iNiX⊺(X⊺X)−1X),i=1,…,k. (6)

For notational simplicity, we let and temporarily write suppress the dependence on . Let be the set of alternatives outside the IZ given . For each , is independent of given . It then follows from (6) that

 X⊺ˆβi∗−X⊺ˆβi∣∣X,S2i∗,S2i∼N(X⊺βi∗−X⊺βi,(σ2i∗/Ni∗+σ2i/Ni)V(X)). (7)

Hence, letting denote a standard normal random variable, for each ,

 P(X⊺ˆβi∗−X⊺ˆβi>0∣∣X,S2i∗,S2i) =P⎛⎜ ⎜⎝Z>−(X⊺βi∗−X⊺βi)√(σ2i∗/Ni∗+σ2i/Ni)V(X)∣∣∣X,S2i∗,S2i⎞⎟ ⎟⎠ ≥P⎛⎜ ⎜⎝Z>−δ√[σ2i∗δ2/(h2S2i∗)+σ2iδ2/(h2S2i)]V(X)∣∣∣X,S2i∗,S2i⎞⎟ ⎟⎠ =Φ⎛⎜ ⎜⎝h√(n0m−d−1)(ξ−1i∗+ξ−1i)V(X)⎞⎟ ⎟⎠, (8)

where the inequality follows the definition of , and the last equality follows the definition of .

Then, conditionally on , by the definition (1) the GS event must occur if alternative eliminates all alternatives in . Thus,

 PCS(X) ≥P⎛⎝⋂i∈Ω(X){X⊺ˆβi∗−X⊺ˆβi>0}∣∣∣X⎞⎠ =E⎡⎣P⎛⎝⋂i∈Ω(X){X⊺ˆβi∗−X⊺ˆβi>0}∣∣∣X,S2i∗,{S2i:i∈Ω(X)}⎞⎠∣∣∣X⎤⎦, (9)

where the equality is due to the tower law of conditional expectation. Notice that conditionally on , is multivariate normal by (7). Moreover, for and , due to the conditional independence between and ,

 Cov(X⊺ˆβi∗−X⊺ˆβi,X⊺ˆβi∗−X⊺ˆβi′∣∣X,S2i∗,{S2i:i∈Ω(X)})=Var(X⊺ˆβi∗∣∣X,S2i∗)>0.

Therefore, applying (3.3) and Lemma 2,

 PCS(X) ≥E⎡⎣∏i∈Ω(X)P(X⊺ˆβi∗−X⊺ˆβi>0∣∣X,S2i∗,S2i)∣∣∣X⎤⎦ =∫∞0[∫∞0Φ(h√(n0m−d−1)(t−1+s−1)V(X))η(s)ds]|Ω(X)|η(t)dt, (10)

where the second inequality follows from (8). Since and is a pdf, the integral inside the square brackets in (10) is no greater than 1. Moreover, since , hence,

 PCS(X)≥∫∞0[∫∞0Φ(h√(n0m−d−1)(t−1+s−1)V(X))η(s)ds]k−1η(t)dt.

Then, it follows immediately from (2), the definition of , and (4), the definition of , that if . Likewise, if . ∎

## 4 Heteroscedastic Sampling Errors

In this section we drop Assumption 3 and consider heteroscedastic sampling errors, a more general case. We develop a two-stage procedure for the R&S-C problem with fixed design and heteroscedasticity and call it Procedure FDHet. For simplicity, we use to denote the chi-squared distribution with degrees of freedom.

###### Procedure FDHet:ProcedureProcedure FDHet:Procedure FDHet:
1. Setup: Specify the form of unconditional PCS (either or ), the target unconditional PCS , the IZ parameter , the first-stage sample size , the number of design points , and the design matrix . Set if is used, and if is used, where the constants and respectively satisfy the following equations

 E⎧⎨⎩∫∞0[∫∞0Φ(hHetE√(n0−1)(t−1+s−1)X⊺(X⊺X)−1X)γ(1)(s)ds]k−1γ(1)(t)dt⎫⎬⎭=1−α, (11)

where the expectation is taken with respect to the distribution of , and

 (12)

where is the pdf of the smallest order statistic of i.i.d. random variables, i.e.,

 γ(1)(t)=mγ(t)(1−Γ(t))m−1,

with and denoting the pdf and cdf of the distribution, respectively.

2. First-stage Sampling: Take independent samples of each alternative at each design point , and denote them by , , . For each and , set

3. Second-stage Sampling: Compute the total sample size for each and . Take additional independent samples of alternative at design point .

4. Selection: For each alternative , compute the overall estimate of its unknown coefficients , where and

 ˆYij=1NijNij∑ℓ=1Yiℓ(xj).

Return as the decision rule.

###### Remark 4.

From the implementation point of view, the constants and can be computed in a way similar to that for and ; see the discussion in §3.2.

A noticeable difference in Procedure FDHet relative to Procedure FDHom is the involvement of the smallest order statistic, which is introduced for computational feasibility. Without it, the equations for computing the constant would involve -dimensional numerical integration, which becomes prohibitively difficult to solve for . See Remark 5 in the Appendix for details.

We have the following statistical validity of Procedure FDHet. Its proof is similar to that of Theorem 1, albeit technically more involved, so we defer it to the Appendix but remark here that the proof relies critically on a generalized version of Lemma 1, which is stated and proved as Lemma 3 in the Appendix.

###### Theorem 2.

Under Assumptions 1 and 2, Procedure FDHet is statistically valid, i.e., if and if .

Clearly, the assumption of homoscedasticity yields more analytical and computational tractability than the assumption of heteroscedasticity. However, if Procedure FDHom is used in the presence of heteroscedastic sampling errors, it may fail to deliver the desired unconditional PCS guarantee. An intuitive explanation is that using a single variance estimate for all the design points may underestimate the variance at some design points, leading to insufficient sampling effort at those design points. On the other hand, Procedure FDHet may behave in an overly conservative manner in the presence of homoscedastic sampling error. This is because Procedure FDHet demands estimation of the variance at each design point, which amounts to estimating the common variance repeatedly in the homoscedasticity setting and each time is done separately, resulting in excessive samples. The conservativeness is also attributed to the fact that using order statistic in Procedure FDHet further loosens the lower bound of the unconditional PCS. These trade-off will be revealed clearly in the numerical experiments in §6.

The above discussion provides us a rule of thumb for choosing procedures in practice. Procedure FDHom may be preferred if either the problem of interest has approximately homoscedastic sampling errors, or the decision maker can tolerate some underachievement relative to the desired unconditional PCS. On the other hand, Procedure FDHet may be a better choice if the sampling errors are notably heteroscedastic or if the decision maker is stringent on delivering the unconditional PCS guarantee.

## 5 Least Favorable Configuration

For R&S problems, many selection procedures are designed by analyzing the so-called least favorable configuration (LFC) of the means, which, given the variance and the number of samples of each alternative, yields the greatest lower bound of PCS amongst all possible configurations of the means (Bechhofer 1954). If a selection procedure can meet the target PCS for the LFC, it can certainly meet the same target for all configurations. It is well known that under the IZ formulation, the LFC for R&S problems is the slippage configuration (SC) for many selection procedures (Gupta and Miescke 1982). The SC is such that there exists a unique best alternative and all the other alternatives have equal means which differ from the best by exactly the IZ parameter.

In this section, we generalize the SC to the R&S-C setting and define the generalized slippage configuration (GSC) as follows:

 β10−δ=βi0, β1l=βil,i=2,…,k,l=1,…,d. (13)

In the degenerated case, if for all and , then , so the mean performance of an alternative is independent of the covariates and the GSC is reduced to the SC. While in the general case, for any ,

 x⊺β1−x⊺βi=β10−βi0=δ,i=2,…,k. (14)

Hence, with the GSC, the best alternative is the same for all

, and the other alternatives have equal mean performances. Geometrically, the GSC means that the hyperplanes formed by the mean performances of the inferior alternatives are identical and are parallel to the hyperplane that corresponds to the best alternative. Moreover, the “distance” between the two parallel hyperplanes is

; see Figure 1 for an illustration for .

It turns out that the GSC defined in (13) is the LFC for a family of selection procedures, including Procedure FDHom and Procedure FDHet, for R&S-C problems under the IZ formulation, as stated in the following Theorem 3 and subsequent Corollary 1. These results not only deepen our understanding of the structure of R&S-C problems, but also help us design numerical experiments to serve as a stress test for proposed selection procedures, and can even potentially guide future development of more efficient selection procedures.

###### Theorem 3.

Let denote the number of samples of alternative taken at design point , and denote their means, for , . Let and for . Under Assumptions 1 and 2, the GSC defined in (13) is the LFC for a selection procedure of the R&S-C problem with the IZ formulation and a fixed design, if all the following properties hold:

1. [label=()]

2. The selected alternative is .

3. Conditionally on , for all , , and is independent of if .

4. is independent of the configuration of the means, for all , .

###### Proof.

Suppose that follows the GSC. Then, and by Property (i),

 PCS(x;β) =P(x⊺ˆβ1−x⊺ˆβi>0,∀i=2,…,k) =E[P(x⊺ˆβ1−x⊺ˆβi>0,∀i=2,…,k∣∣Nij,1≤i≤k,1≤j≤m}], (15)

where the expectation is taken with respect to the ’s and we write to stress its dependence on since we will consider a different configuration of the means later.

By Property (ii), conditionally on and , is independent of for ; moreover,

 x⊺ˆβi∣∣{Ni,j:1≤i≤k,1≤j≤m}∼N(x⊺βi,~σ2(x,Σi)),

where and . In particular, does not depend on by Property (iii). Hence, letting denote the pdf of , it follows from (15) that

 PCS(x;β)=E[∫+∞−∞k∏i=2Φ(t−x⊺βi~σ(x,Σi))ϕ(t;x⊺β1,~σ2(x,Σ1))dt]. (16)

We now consider a different configuration of the means, . We assume, without loss of generality, that . We will show below that for all . For each , we define sets and as follows

 Θ(1)i={x∈Θ:x⊺β†i−x⊺β†j≥δ for all j≠i}, Θ(2)i={x∈Θ:x⊺β†i−x⊺β†j≥0 for all j≠i, and x⊺β†i−x⊺β