In a number of applications, like healthcare and mobile health, collecting features from users can be costly. This encourages us to develop variants of contextual bandits that at any time-step select a set of features to be collected from users, and then select an action based on these observed features. We use the term survey bandits to refer to this set-up. Through this formulation, we address the issue of needless feature collection in contextual bandits.
Suppose we are building a system to recommend charities that users can donate to. Since there are many charities we could recommend, it is efficient to make personalized recommendations. This requires us to collect features from users, which can be done by requiring users to fill a survey/questionnaire. The reward at any time-step might be the amount donated. As usual, our goal is to minimize regret. Beyond regret, we would also like to improve user experience by shortening the survey we require users to fill. We try to answer the following question: Can we ensure strong regret guarantees, while being able to shorten our survey over time?
Our contributions: We answer this question in the affirmative. We start by considering zero-shot surveys where the decision maker has to decide the set of features to be queried at every time-step before the user arrives and then make a personalized decision based on the responses. We now state our assumptions. In addition to the standard assumptions made for LinUCB, we introduce 1
(beta-min) which is common in the feature selection literature. We propose algorithms that are natural variants of LinUCB(Li et al., 2010) for the survey bandits framework. Under our assumptions, we prove regret guarantees for these algorithms. At the same time, these algorithms exploit sparsity in arm parameters to reduce the set of features collected from users.
In fact our algorithm, Ridge SurveyUCB has regret guarantees that are tight even for standard contextual bandits . We also provide an algorithm Elastic Net SurveyUCB which is more robust to creftype 1, but with a weaker
regret guarantee. This result requires us to prove a new adaptive tail bounds for the elastic net estimator, which may be of independent interest.
Through simulations, when creftype 1 holds, we demonstrate that both algorithms perform well in terms of regret. Unfortunately, Ridge SurveyUCB can perform poorly on regret when creftype 1 is violated. Fortunately, Elastic Net SurveyUCB performs well even when creftype 1 is violated. It is worth noting that we can still use Ridge SurveyUCB, we just need to use a conservative choices for the beta-min parameter (in creftype 1). Even for conservative choices of the beta-min parameter, we eventually see benefits on survey length.
Through simulations, we also see that both algorithms demonstrate savings in terms of survey length. In fact, in the absence of sub-optimal arms, both algorithms always remove the features that are not relevant for reward prediction in the survey.
We also consider settings with interactive survey’s where at each time step, before making a personalized decision, the decision maker can continually expand the set of queries based on previous responses by the user at that time-step. This allows us to start by querying a smaller set of features and expand our survey for the user only if it is needed to make a better personalized decision. We develop variants of our algorithms that use interactive surveys to ensure lower survey lengths, especially in the presence of sub-optimal arms, while maintaining the same regret performance.
Related work: Prior work (e.g. (Abbasi-Yadkori et al., 2012) and (Bastani and Bayati, 2015)) has also exploited sparsity in arm parameters to provide stronger sparsity dependent regret guarantees. When an upper-bound on the sparsity of arm-parameters is known to the decision-maker, Abbasi-Yadkori et al. (2012) provide algorithms with tight regret guarantees for contextual bandits. Under distributional assumptions on user covariates, Bastani and Bayati (2015) provide stronger regret guarantees, even when no sparsity parameter is provided to the algorithm. While these regret guarantees are stronger than the ones we provide, it is important to note that we do not make any distributional assumption on user covariates and we do not assume that the decision-maker knows an upper-bound to the sparsity of arm-parameters.
Bouneffouf et al. (2017) is the most closely related paper to our work. They develop interesting algorithms for a very similar setup and evaluate them empirically. At every time-step, their algorithms queries a fixed number of features (). Their algorithms requires this parameter to be an input to the algorithm. For a conservative choice of , we would see little benefits to the survey length. Unfortunately, empirically their algorithms could perform much worse than contextual bandits for small choices of . We view our work as an alternate approach with guarantees on regret.
2.1 Problem Setting
Let denote the number of time-steps. At each time-step, a new user arrives. The decision maker has a survey with questions and can ask a user any subset of these questions. At any time-step , a user comes with a set of observable covariates . Here is a -vector, corresponding to the -th user’s observable answers to the questions on the survey.
The decision maker has access to arms (decisions). Each arm yields a noisy, user specific reward. In particular, each arm has an unknown parameter . At time , pulling arm would yield a reward . Where are independent sequence of
-sub-Gaussian random variables. Note that at any time-step, the decision maker can only observe the reward of the arm that was pulled.
Some notation: For any vector and any index set , we let denote the vector obtained by setting coordinates of that are not in to zero. For any matrix and any index set , we let denote the matrix obtained by setting rows and columns of that do not correspond to an index in to zero. For any , we let to denote the support. And for any set , we let .
Goal: The goal is to design a sequential decision making policy that maximizes expected cumulative reward, and subject to strong reward guarantees minimizes the total number of questions asked to users. Let denote the subset of survey questions queried by policy at time . And, let denote the arm chosen by policy at time . Note that we do not observe and only observe , hence we should be able to choose the arm using only the observed covariates and data collected from previous time-steps.
Target policy: We now describe a ”sensible” target policy. Consider the target policy that already knows arm parameters but not the noise parameters. We want the target policy to maximize expected cumulative reward, and hence at any time-step the policy must pick an arm . Therefore, the target policy only needs to query features that influence arm rewards. That is,
Note that for any vector and any arm , we have that . Therefore at any time-step , having observed , the target policy is able to choose the best arm for the covariate :
Regret: If , we define expected regret at time as the difference between the maximum expected reward and the expected reward of arm at time , i.e. . And, we define the cumulative expected regret as .
Additional notation: Let denote the design matrix, whose rows are . Let denote the vector of observations , where entries of may be missing 111If arm wasn’t played at time , then the -th coordinate of would be missing.. For all and for any , define the sample set and let . Let denote the number of times arm was pulled upto time (i.e. ). For any , we let be the submatrix of whose rows are for each . Similarly when , we let be the -vector whose coordinates are for all 222Since when , we know that for all . Therefore, we have has no missing values.. Also let
denote the identity matrix.
We make two assumptions. The following assumption is to allow us to ignore features that have small influence on arm rewards, we assume that such features in-fact have no influence on arm rewards.
Assumption 1 (Beta-min).
The decision maker knows a parameter such that for all arms and all , either or .
The following is a common assumption made in problems for contextual multi-arm bandits, it is equivalent to assuming that expected rewards are bounded.
Assumption 2 (Bounded rewards).
We make the following assumptions to ensure that expected rewards of any arm for any context is bounded: For all and , we have and .
For simplicity we further assume for all and , the potential reward of pulling arm lies in . i.e. . 333We can avoid the assumption that for all and . We would just need to choose in definition 2 for the regret guarantee to work out. Where for all .
Since for any vector , we know that for all . Therefore for all , creftype 2 gives us that the -norms of and are bounded by and for all time-steps and arms .
3 UCB for Survey Bandits
In this section, we describe a natural extension of LinUCB (Li et al., 2010) for survey bandits which involves describing a policy for selecting survey questions and a consistent arm selection policy that can choose an arm given the observed covariates.
Upper confidence bound (UCB) algorithms follow the principle of optimism in the face of uncertainty. The essential idea is to construct high-probability confidence sets, for the parameter of every arm , from observed covariates and rewards . That is, with high probability, for all time-steps and for all arms . The algorithm then queries the set of features:
Therefore, we have that for . Hence, for any and , we have that . Therefore, it follows that:
The algorithm chooses an optimistic estimate for every arm and then chooses an arm which maximizes reward according to the optimistic estimates. Equivalently, and more compactly, the algorithm chooses the arm:
Note that the arm is chosen given only the observed covariates . We call the resulting algorithm SurveyUCB.
4 Confidence Sets for SurveyUCB
In this section, we define confidence sets in Standard form and describe the construction of these sets using AlgConfidence. It will turn-out that AlgConfidence constructs confidence sets in Standard form. Throughout this manuscript, we use AlgConfidence to construct confidence sets for SurveyUCB.
4.1 Confidence Sets in Standard Form
We start by defining weighted norms and use that to define confidence sets in Standard form.
Definition 1 (Weighted norm).
For any vector and any positive semi-definite matrix , we define the weighted norm of with respect to as follows: . 444 is a norm when is positive semi-definite.
Definition 2 (Standard form).
We say confidence sets are in Standard form if at any time-step and for any arm , we have that is a ball centered around our estimate of the true arm parameter under a weighted norm and have non-increasing supports. More specifically, for all , and for some , we have:
In SurveyUCB, note that the confidence sets determine the set of features queried. Also at any time-step , confidence sets must be constructed using only the observed covariates and rewards at every time-step upto . Lemma 1 gives us a set of observed covariates when SurveyUCB uses confidence sets in Standard form.
If SurveyUCB uses confidence sets in Standard form for times . Then for any arm , we observe . Where for all arms and time-steps .
All statements in this proof hold for all arms and time-steps . SurveyUCB queries the set of features at time , where . From the structure of confidence sets in Standard form, we have that . This implies that:
That is, we observe . ∎
Note that at any time-step , SurveyUCB observes . Lemma 1 shows that for any , we also observe because the set of features queried by SurveyUCB at is a supper set of the support of the confidence set . That is, under the conditions of lemma 1 we have:
Consider any time-step . Suppose that the confidence sets constructed upto time are in Standard form, i.e. are in Standard form. Let be the arm pulled at time . We now describe the AlgConfidence update for confidence set and show that the confidence sets are in Standard form. This would inductively imply that AlgConfidence constructs confidence sets in Standard form since the base case trivially holds 555Confidence sets constructed up to and including time-step zero are trivially in Standard form..
From lemma 1 we already know that if the confidence sets constructed upto time are in Standard form, then the decision maker using SurveyUCB at least observes at every time-step . Where, . Hence, AlgConfidence can use this to construct the confidence set () for arm at time .
AlgConfidence starts by constructing from by relying on creftype 1. Where with be the support of the confidence set . We would like to construct so that it contains the support of , i.e. . In particular, if the confidence set for arm at time holds (i.e. ), then from creftype 1 we have that for all :
Also we get that is a subset of the support of the confidence set () of arm at time . Hence we have that:
AlgConfidence now constructs confidence set with support . We then estimate by regressing over the features in , on the observed data set: . We then set the components of not in to zero, i.e. . Now with , we construct the confidence set for arm at time :
Note that and . Hence given the above form of the confidence set, we have that are in Standard form.
4.3 Probability Aggregation
To construct the confidence set , recall that AlgConfidence assumes that . This is unlike LinUCB (Li et al., 2010) and several other UCB algorithms where confidence sets are constructed from observed data without directly relying on previous confidence sets. Here, we argue that our construction does not lead to any unexpected issues. We now state a helpful lemma and its corollary, and defer proofs to appendix A.
Lemma 2 (Probability aggregation).
Consider a probability space . Consider any sequence of events , such that and for any . Let . We then have that:
Suppose SurveyUCB constructs the Standard form confidence sets for all arms and all time-step’s . We then have that:
Where and for all arms and time-steps .
Therefore from corollary 1 for any , we have that if for all and we have: 666Note that
5 General Regret Analysis
In section 4.2, we already saw that SurveyUCB constructs confidence sets in Standard form. In lemma 3 we exploit the structure of confidence sets in Standard form to get a general regret bound for SurveyUCB.
Lemma 3 (General regret analysis).
6 Tail Inequalities for Adapted Observations
Consider a linear model , with design matrix , response vector , and noise vector . Where are independent sequence of -sub-Gaussian random variables.
6.1 Ridge Estimator
We now define the Ridge estimator for estimating the parameter as follows:
Definition 3 (Ridge).
Given regularization parameters . The Ridge estimate is given by:
From theorem 2 in (Abbasi-Yadkori et al., 2011) we get:
Lemma 4 (Abbasi-Yadkori, et al 2).
Let denote the -th row of . Let denote the -th entry of . The sequence form an adapted sequence of observations. That is, may depend on . Assume that . Then for , we have that:
Furthermore, if for all , we have that:
Where is the Ridge estimate with parameter . And, .
6.2 Elastic Net Estimator
We now define the Elastic net estimator for estimating the parameter as follows:
Definition 4 (Elastic net).
Given regularization parameters , . The Elastic net estimate is given by:
We now provide an adaptive tail inequality for the Elastic net estimator that may be of independent interest. We defer the proof to appendix C.
Lemma 5 (Elastic net tail inequality for Adapted observations).
Let denote the -th row of . Let denote the -th entry of . The sequence form an adapted sequence of observations. That is, may depend on . Also assume all realizations of satisfy and that . Then, if , we have:
Where is the Elastic net estimate with parameters . And, .
7 Ridge SurveyUCB
Note that the description and analysis of SurveyUCB gives us a fair amount of flexibility. We are still free to specify the regression method used to estimate for all and times . We are also free to specify our choice of for all arms and all times .
Ridge SurveyUCB is a version of the SurveyUCB. In Ridge SurveyUCB
, we use Ridge regression with a fixed regularization parameter (), to estimate . We also choose for all arms and time-steps based on corollary 1 (eq. 1) and lemma 4 so that the Standard form confidence sets that we construct hold with high probability. Then from lemma 3 we naturally get a high-probability regret bound for Ridge SurveyUCB.
8 Elastic Net SurveyUCB
In this section, we want to develop a variant of SurveyUCB that is more robust to the choice of the beta-min parameter in creftype 1. One way to do this is to modify line 3 in AlgConfidence. In particular, suppose is the arm pulled at time . We then construct by removing all features from that we estimate to be zero (i.e. ) and that we determine are irrelevant based on creftype 1. That is:
It is easy to see that all our arguments for SurveyUCB continue to hold with the above modification. Now note that the above modification makes SurveyUCB more robust to creftype 1, because we additionally need the th coordinate of our estimate () to be zero before we remove feature (at time ) from the model of arm . This modification encourages us to use sparse estimators.
Elastic Net SurveyUCB is a variant of SurveyUCB with the above modification. In Elastic Net SurveyUCB, we use the Elastic net regression to estimate , with regularization parameters and for all arms and time-steps . Where:
9 Interactive Surveys
We start by defining sub-optimal arms, describe an inefficiency in SurveyUCB, and propose a fix using interactive surveys.
Definition 5 (Sub-optimal arms).
An arm is said to be sub-optimal if the target policy would not pick it for any context vector in the context space.
At any time-step , the SurveyUCB algorithms query Now suppose the decision maker plays arm at time . The decision maker only needs the reward and the features corresponding to to update the model. Recall that the reason the decision maker queries all the features in is to be able to compute the upper confidence bounds for all arms.
Note that the bandit assignment only depends on the highest upper confidence bound, so it is not necessary to compute all the upper confidence bounds. The algorithm’s inefficiency becomes more evident in the presence of sub-optimal arms because these arms are played less frequently, hence updated less frequently, and increase survey length.
We attempt to resolve this inefficiency through interactive surveys. Consider the user at time
, the decision maker needs to query some features from the user before taking an action. We start by creating an ordered list of all arms, starting from the most pulled arms and ending with the least pulled arms. This ordering is a heuristic choice, the idea is to keep sub-optimal arms (which are less frequently pulled) towards the end of the list. We keep removing arms from this list and terminate (take an action) when it is empty.
The main idea is to sequentially query the feature sets for arms in the list, simultaneously remove queried arms from the list, and also remove unqueried arms that do not have the largest upper confidence bound. We will now explain how we determine that some unqueried arm doesn’t have the largest upper confidence bound. First note that for each queried arm, we can exactly compute its upper confidence bound at time . Clearly the following optimization problem gives us an upper bound to the upper confidence bound of arm : 777Here is the pseudo-inverse of the matrix.