Survey Bandits with Regret Guarantees

02/23/2020 ∙ by Sanath Kumar Krishnamurthy, et al. ∙ 19

We consider a variant of the contextual bandit problem. In standard contextual bandits, when a user arrives we get the user's complete feature vector and then assign a treatment (arm) to that user. In a number of applications (like healthcare), collecting features from users can be costly. To address this issue, we propose algorithms that avoid needless feature collection while maintaining strong regret guarantees.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In a number of applications, like healthcare and mobile health, collecting features from users can be costly. This encourages us to develop variants of contextual bandits that at any time-step select a set of features to be collected from users, and then select an action based on these observed features. We use the term survey bandits to refer to this set-up. Through this formulation, we address the issue of needless feature collection in contextual bandits.

Suppose we are building a system to recommend charities that users can donate to. Since there are many charities we could recommend, it is efficient to make personalized recommendations. This requires us to collect features from users, which can be done by requiring users to fill a survey/questionnaire. The reward at any time-step might be the amount donated. As usual, our goal is to minimize regret. Beyond regret, we would also like to improve user experience by shortening the survey we require users to fill. We try to answer the following question: Can we ensure strong regret guarantees, while being able to shorten our survey over time?

Our contributions: We answer this question in the affirmative. We start by considering zero-shot surveys where the decision maker has to decide the set of features to be queried at every time-step before the user arrives and then make a personalized decision based on the responses. We now state our assumptions. In addition to the standard assumptions made for LinUCB, we introduce 1

(beta-min) which is common in the feature selection literature. We propose algorithms that are natural variants of LinUCB

(Li et al., 2010) for the survey bandits framework. Under our assumptions, we prove regret guarantees for these algorithms. At the same time, these algorithms exploit sparsity in arm parameters to reduce the set of features collected from users.

In fact our algorithm, Ridge SurveyUCB has regret guarantees that are tight even for standard contextual bandits . We also provide an algorithm Elastic Net SurveyUCB which is more robust to creftype 1, but with a weaker

regret guarantee. This result requires us to prove a new adaptive tail bounds for the elastic net estimator, which may be of independent interest.

Through simulations, when creftype 1 holds, we demonstrate that both algorithms perform well in terms of regret. Unfortunately, Ridge SurveyUCB can perform poorly on regret when creftype 1 is violated. Fortunately, Elastic Net SurveyUCB performs well even when creftype 1 is violated. It is worth noting that we can still use Ridge SurveyUCB, we just need to use a conservative choices for the beta-min parameter (in creftype 1). Even for conservative choices of the beta-min parameter, we eventually see benefits on survey length.

Through simulations, we also see that both algorithms demonstrate savings in terms of survey length. In fact, in the absence of sub-optimal arms, both algorithms always remove the features that are not relevant for reward prediction in the survey.

We also consider settings with interactive survey’s where at each time step, before making a personalized decision, the decision maker can continually expand the set of queries based on previous responses by the user at that time-step. This allows us to start by querying a smaller set of features and expand our survey for the user only if it is needed to make a better personalized decision. We develop variants of our algorithms that use interactive surveys to ensure lower survey lengths, especially in the presence of sub-optimal arms, while maintaining the same regret performance.

Related work: Prior work (e.g. (Abbasi-Yadkori et al., 2012) and (Bastani and Bayati, 2015)) has also exploited sparsity in arm parameters to provide stronger sparsity dependent regret guarantees. When an upper-bound on the sparsity of arm-parameters is known to the decision-maker, Abbasi-Yadkori et al. (2012) provide algorithms with tight regret guarantees for contextual bandits. Under distributional assumptions on user covariates, Bastani and Bayati (2015) provide stronger regret guarantees, even when no sparsity parameter is provided to the algorithm. While these regret guarantees are stronger than the ones we provide, it is important to note that we do not make any distributional assumption on user covariates and we do not assume that the decision-maker knows an upper-bound to the sparsity of arm-parameters.

Bouneffouf et al. (2017) is the most closely related paper to our work. They develop interesting algorithms for a very similar setup and evaluate them empirically. At every time-step, their algorithms queries a fixed number of features (). Their algorithms requires this parameter to be an input to the algorithm. For a conservative choice of , we would see little benefits to the survey length. Unfortunately, empirically their algorithms could perform much worse than contextual bandits for small choices of . We view our work as an alternate approach with guarantees on regret.

2 Preliminaries

2.1 Problem Setting

Let denote the number of time-steps. At each time-step, a new user arrives. The decision maker has a survey with questions and can ask a user any subset of these questions. At any time-step , a user comes with a set of observable covariates . Here is a -vector, corresponding to the -th user’s observable answers to the questions on the survey.

The decision maker has access to arms (decisions). Each arm yields a noisy, user specific reward. In particular, each arm has an unknown parameter . At time , pulling arm would yield a reward . Where are independent sequence of

-sub-Gaussian random variables. Note that at any time-step, the decision maker can only observe the reward of the arm that was pulled.

Some notation: For any vector and any index set , we let denote the vector obtained by setting coordinates of that are not in to zero. For any matrix and any index set , we let denote the matrix obtained by setting rows and columns of that do not correspond to an index in to zero. For any , we let to denote the support. And for any set , we let .

Goal: The goal is to design a sequential decision making policy that maximizes expected cumulative reward, and subject to strong reward guarantees minimizes the total number of questions asked to users. Let denote the subset of survey questions queried by policy at time . And, let denote the arm chosen by policy at time . Note that we do not observe and only observe , hence we should be able to choose the arm using only the observed covariates and data collected from previous time-steps.

Target policy: We now describe a ”sensible” target policy. Consider the target policy that already knows arm parameters but not the noise parameters. We want the target policy to maximize expected cumulative reward, and hence at any time-step the policy must pick an arm . Therefore, the target policy only needs to query features that influence arm rewards. That is,

Note that for any vector and any arm , we have that . Therefore at any time-step , having observed , the target policy is able to choose the best arm for the covariate :

Regret: If , we define expected regret at time as the difference between the maximum expected reward and the expected reward of arm  at time , i.e. . And, we define the cumulative expected regret as .

Additional notation: Let denote the design matrix, whose rows are . Let denote the vector of observations , where entries of may be missing 111If arm  wasn’t played at time , then the -th coordinate of would be missing.. For all and for any , define the sample set and let . Let denote the number of times arm  was pulled upto time (i.e. ). For any , we let be the submatrix of whose rows are for each . Similarly when , we let be the -vector whose coordinates are for all 222Since when , we know that for all . Therefore, we have has no missing values.. Also let

denote the identity matrix.

2.2 Assumptions

We make two assumptions. The following assumption is to allow us to ignore features that have small influence on arm rewards, we assume that such features in-fact have no influence on arm rewards.

Assumption 1 (Beta-min).

The decision maker knows a parameter such that for all arms and all , either or .

The following is a common assumption made in problems for contextual multi-arm bandits, it is equivalent to assuming that expected rewards are bounded.

Assumption 2 (Bounded rewards).

We make the following assumptions to ensure that expected rewards of any arm for any context is bounded: For all and , we have and .
For simplicity we further assume for all and , the potential reward of pulling arm  lies in . i.e. . 333We can avoid the assumption that for all and . We would just need to choose in definition 2 for the regret guarantee to work out. Where for all .

Since for any vector , we know that for all . Therefore for all , creftype 2 gives us that the -norms of and are bounded by and for all time-steps and arms .

3 UCB for Survey Bandits

In this section, we describe a natural extension of LinUCB (Li et al., 2010) for survey bandits which involves describing a policy for selecting survey questions and a consistent arm selection policy that can choose an arm given the observed covariates.

1:  Initialize confidence sets. (See section 4.1)
2:  for  do
3:     Let for all .
4:     Query , and observe .
5:     Pick .
6:     Play arm  and observe reward .
7:     Set for all .
8:     Update (using AlgConfidence).
9:  end for
Algorithm 1 SurveyUCB

Upper confidence bound (UCB) algorithms follow the principle of optimism in the face of uncertainty. The essential idea is to construct high-probability confidence sets

, for the parameter of every arm , from observed covariates and rewards . That is, with high probability, for all time-steps and for all arms . The algorithm then queries the set of features:

Therefore, we have that for . Hence, for any and , we have that . Therefore, it follows that:

The algorithm chooses an optimistic estimate for every arm and then chooses an arm which maximizes reward according to the optimistic estimates. Equivalently, and more compactly, the algorithm chooses the arm:

Note that the arm is chosen given only the observed covariates . We call the resulting algorithm SurveyUCB.

4 Confidence Sets for SurveyUCB

In this section, we define confidence sets in Standard form and describe the construction of these sets using AlgConfidence. It will turn-out that AlgConfidence constructs confidence sets in Standard form. Throughout this manuscript, we use AlgConfidence to construct confidence sets for SurveyUCB.

4.1 Confidence Sets in Standard Form

We start by defining weighted norms and use that to define confidence sets in Standard form.

Definition 1 (Weighted norm).

For any vector and any positive semi-definite matrix , we define the weighted norm of with respect to as follows: . 444 is a norm when is positive semi-definite.

Definition 2 (Standard form).

We say confidence sets are in Standard form if at any time-step and for any arm , we have that is a ball centered around our estimate of the true arm parameter under a weighted norm and have non-increasing supports. More specifically, for all , and for some , we have:

In SurveyUCB, note that the confidence sets determine the set of features queried. Also at any time-step , confidence sets must be constructed using only the observed covariates and rewards at every time-step upto . Lemma 1 gives us a set of observed covariates when SurveyUCB uses confidence sets in Standard form.

Lemma 1.

If SurveyUCB uses confidence sets in Standard form for times . Then for any arm , we observe . Where for all arms and time-steps .

Proof.

All statements in this proof hold for all arms and time-steps . SurveyUCB queries the set of features at time , where . From the structure of confidence sets in Standard form, we have that . This implies that:

That is, we observe . ∎

Note that at any time-step , SurveyUCB observes . Lemma 1 shows that for any , we also observe because the set of features queried by SurveyUCB at is a supper set of the support of the confidence set . That is, under the conditions of lemma 1 we have:

Initializing confidence sets: We now define our confidence set for time-step zero based on creftype 2. For any arm  under creftype 2, we have that . That is, we have that for all arms , where:

Based on creftype 1, section 4.2 describes the construction of confidence set for all arms and time-steps .

4.2 AlgConfidence

Consider any time-step . Suppose that the confidence sets constructed upto time are in Standard form, i.e. are in Standard form. Let be the arm pulled at time . We now describe the AlgConfidence update for confidence set and show that the confidence sets are in Standard form. This would inductively imply that AlgConfidence constructs confidence sets in Standard form since the base case trivially holds 555Confidence sets constructed up to and including time-step zero are trivially in Standard form..

From lemma 1 we already know that if the confidence sets constructed upto time are in Standard form, then the decision maker using SurveyUCB at least observes at every time-step . Where, . Hence, AlgConfidence can use this to construct the confidence set () for arm at time .

Input : Given , , , and , for some time-step and arm .
Output : .

1:  Initialize .
2:  for  do
3:     if   then
4:        .
5:     end if
6:  end for
7:  Set . {Note that , hence we have .}
8:  Estimate from regression on the data .
9:  Compute.
10:  Output.
Algorithm 2 AlgConfidence (Constructing confidence sets in Standard form)

AlgConfidence starts by constructing from by relying on creftype 1. Where with be the support of the confidence set . We would like to construct so that it contains the support of , i.e. . In particular, if the confidence set for arm  at time holds (i.e. ), then from creftype 1 we have that for all :

Also we get that is a subset of the support of the confidence set () of arm at time . Hence we have that:

AlgConfidence now constructs confidence set with support . We then estimate by regressing over the features in , on the observed data set: . We then set the components of not in to zero, i.e. . Now with , we construct the confidence set for arm  at time :

Note that and . Hence given the above form of the confidence set, we have that are in Standard form.

4.3 Probability Aggregation

To construct the confidence set , recall that AlgConfidence assumes that . This is unlike LinUCB (Li et al., 2010) and several other UCB algorithms where confidence sets are constructed from observed data without directly relying on previous confidence sets. Here, we argue that our construction does not lead to any unexpected issues. We now state a helpful lemma and its corollary, and defer proofs to appendix A.

Lemma 2 (Probability aggregation).

Consider a probability space . Consider any sequence of events , such that and for any . Let . We then have that:

Corollary 1.

Suppose SurveyUCB constructs the Standard form confidence sets for all arms and all time-step’s . We then have that:

Where and for all arms and time-steps .

Therefore from corollary 1 for any , we have that if for all and we have: 666Note that

(1)

5 General Regret Analysis

In section 4.2, we already saw that SurveyUCB constructs confidence sets in Standard form. In lemma 3 we exploit the structure of confidence sets in Standard form to get a general regret bound for SurveyUCB.

Lemma 3 (General regret analysis).

Suppose creftype 2 holds with for all . Suppose the confidence sets used in SurveyUCB are in Standard form with parameter (in definition 2). And suppose we have that at any time-step and for all arms . We then have that:

  • Instantaneous regret at time-step , .

  • And cumulative regret in rounds,

We defer the proof of lemma 3 appendix B. It is worth noting that the proof sheds some light on the need for confidence sets to be in Standard form.

6 Tail Inequalities for Adapted Observations

Consider a linear model , with design matrix , response vector , and noise vector . Where are independent sequence of -sub-Gaussian random variables.

6.1 Ridge Estimator

We now define the Ridge estimator for estimating the parameter as follows:

Definition 3 (Ridge).

Given regularization parameters . The Ridge estimate is given by:

From theorem 2 in (Abbasi-Yadkori et al., 2011) we get:

Lemma 4 (Abbasi-Yadkori, et al 2).

Let denote the -th row of . Let denote the -th entry of . The sequence form an adapted sequence of observations. That is, may depend on . Assume that . Then for , we have that:

Furthermore, if for all , we have that:

Where is the Ridge estimate with parameter . And, .

6.2 Elastic Net Estimator

We now define the Elastic net estimator for estimating the parameter as follows:

Definition 4 (Elastic net).

Given regularization parameters , . The Elastic net estimate is given by:

We now provide an adaptive tail inequality for the Elastic net estimator that may be of independent interest. We defer the proof to appendix C.

Lemma 5 (Elastic net tail inequality for Adapted observations).

Let denote the -th row of . Let denote the -th entry of . The sequence form an adapted sequence of observations. That is, may depend on . Also assume all realizations of satisfy and that . Then, if , we have:

Where is the Elastic net estimate with parameters . And, .

7 Ridge SurveyUCB

Note that the description and analysis of SurveyUCB gives us a fair amount of flexibility. We are still free to specify the regression method used to estimate for all and times . We are also free to specify our choice of for all arms and all times .

Ridge SurveyUCB is a version of the SurveyUCB. In Ridge SurveyUCB

, we use Ridge regression with a fixed regularization parameter (

), to estimate . We also choose for all arms and time-steps based on corollary 1 (eq. 1) and lemma 4 so that the Standard form confidence sets that we construct hold with high probability. Then from lemma 3 we naturally get a high-probability regret bound for Ridge SurveyUCB.

Say we want our bounds to hold with probability . Then from corollary 1, it is enough to show that eq. 1 holds for all arms and times. That is for all arms and all times , we want:

(1)

Given , lemma 4 gives us that eq. 1 holds for with support if:

(2)

Hence, from lemma 3 with the above choice of and the corresponding high-probability guarantee, we get theorem 1. We defer the proof to appendix E.

Theorem 1 (Ridge SurveyUCB regret).

Suppose creftype 1 and creftype 2 hold. Let be fixed constants. Let be chosen as in eq. 2 for all . Then with probability at least , Ridge SurveyUCB has the following regret guarantee:

8 Elastic Net SurveyUCB

In this section, we want to develop a variant of SurveyUCB that is more robust to the choice of the beta-min parameter in creftype 1. One way to do this is to modify line 3 in AlgConfidence. In particular, suppose is the arm pulled at time . We then construct by removing all features from that we estimate to be zero (i.e. ) and that we determine are irrelevant based on creftype 1. That is:

It is easy to see that all our arguments for SurveyUCB continue to hold with the above modification. Now note that the above modification makes SurveyUCB more robust to creftype 1, because we additionally need the th coordinate of our estimate () to be zero before we remove feature (at time ) from the model of arm . This modification encourages us to use sparse estimators.

Elastic Net SurveyUCB is a variant of SurveyUCB with the above modification. In Elastic Net SurveyUCB, we use the Elastic net regression to estimate , with regularization parameters and for all arms and time-steps . Where:

(3)

Similar to the arguments in section 7, corollary 1 and lemma 5 give us that our confidence sets hold with probability if for all arms and times :

(4)

Again from lemma 3 with the above choice of and the corresponding high-probability guarantee, we get theorem 2. We defer the proof to appendix F.

Theorem 2 (Elastic Net SurveyUCB regret).

Suppose creftype 1 and creftype 2 hold. Let be fixed constants. For all , let and be chosen as in eq. 3 and eq. 2 respectively. Then with probability at least , Elastic Net SurveyUCB has the following regret guarantee:

9 Interactive Surveys

We start by defining sub-optimal arms, describe an inefficiency in SurveyUCB, and propose a fix using interactive surveys.

Definition 5 (Sub-optimal arms).

An arm is said to be sub-optimal if the target policy would not pick it for any context vector in the context space.

At any time-step , the SurveyUCB algorithms query Now suppose the decision maker plays arm  at time . The decision maker only needs the reward and the features corresponding to to update the model. Recall that the reason the decision maker queries all the features in is to be able to compute the upper confidence bounds for all arms.

Note that the bandit assignment only depends on the highest upper confidence bound, so it is not necessary to compute all the upper confidence bounds. The algorithm’s inefficiency becomes more evident in the presence of sub-optimal arms because these arms are played less frequently, hence updated less frequently, and increase survey length.

Interactive input : User at time , that answers queries whenever asked.
Output: Queries to user .

1:  Create an ordered list of arms , starting from the most pulled arm to least pulled arm.
2:  Initialize , largest upper confidence bound among queried arms.
3:  Initialize set of queried features .
4:  while  is not empty do
5:     Let , be the first arm in .
6:     Query and observe .
7:     Update .
8:     Update
9:     Remove from the list .
10:     for  in  do
11:        Set .
12:        if  then
13:           Remove arm from .
14:        end if
15:     end for
16:  end while{Note that we observe enough information to determine which arm has the largest upper confidence and can update its confidence set after observing its reward.}
Algorithm 3 Interactive Survey Protocol

We attempt to resolve this inefficiency through interactive surveys. Consider the user at time

, the decision maker needs to query some features from the user before taking an action. We start by creating an ordered list of all arms, starting from the most pulled arms and ending with the least pulled arms. This ordering is a heuristic choice, the idea is to keep sub-optimal arms (which are less frequently pulled) towards the end of the list. We keep removing arms from this list and terminate (take an action) when it is empty.

The main idea is to sequentially query the feature sets for arms in the list, simultaneously remove queried arms from the list, and also remove unqueried arms that do not have the largest upper confidence bound. We will now explain how we determine that some unqueried arm doesn’t have the largest upper confidence bound. First note that for each queried arm, we can exactly compute its upper confidence bound at time . Clearly the following optimization problem gives us an upper bound to the upper confidence bound of arm : 777Here is the pseudo-inverse of the matrix.

(5)