Codes for paper Heteroscedastic Bandits with Reneging https://arxiv.org/abs/1810.12418
Although shown to be useful in many areas as models for solving sequential decision problems with side observations (contexts), contextual bandits are subject to two major limitations. First, they neglect user "reneging" that occurs in real-world applications. That is, users unsatisfied with an interaction quit future interactions forever. Second, they assume that the reward distribution is homoscedastic, which is often invalidated by real-world datasets, e.g., datasets from finance. We propose a novel model of "heteroscedastic contextual bandits with reneging" to overcome the two limitations. Our model allows each user to have a distinct "acceptance level," with any interaction falling short of that level resulting in that user reneging. It also allows the variance to be a function of context. We develop a UCB-type of policy, called HR-UCB, and prove that with high probability it achieves O(√(T)((T))^3/2) regret.READ FULL TEXT VIEW PDF
Contextual bandits have the same exploration-exploitation trade-off as
Contextual bandits algorithms have become essential in real-world user
Exploring the effect of policies in many real world scenarios is difficu...
We introduce a novel stochastic contextual bandit model, where at each s...
We consider a variant of the contextual bandit problem. In standard
In many sequential decision-making problems, the individuals are split i...
This paper is a scoping study to identify current methods used in handli...
Codes for paper Heteroscedastic Bandits with Reneging https://arxiv.org/abs/1810.12418
Multi-armed Bandits (MAB)  have been extensively used to model sequential decision problems with uncertain rewards. Such problems commonly arise in a large number of real-world applications such as clinical trials, search engines, online advertising, and notification systems. While in those applications, users (e.g., patients) have been modeled as being homogeneous, there is a strong motivation to enhance user experience by personalizaton for users and taking care of their specific demands, and thereby increase revenue with improved user experience. The model of “contectual bandits”  seeks to do so by proposing a MAB model for learning how to act optimally based on contexts (features) of users and arms. At the beginning of each round, the learner observes a context from the context set (e.g., medical records, treatment details) and selects an arm from the arm set (e.g., different treatments). At the end of the round, the learner receives a random reward (e.g., the result of the treatment) with the mean value of its distribution depending on the observed context. The objective of the learner is to accumulate as much reward as possible within rounds. Since the parameters involved in the dependence of the mean reward on the context are unknown, the learner has to handle a trade-off between exploration (e.g., choosing new treatment with possible higher effectiveness) and exploitation (e.g., choosing the best known treatment) at each round.
While this model has been usefully applied in many areas, it is subject to two major limitations. First, it neglects the phenomenon of “reneging” that is common in real-world applications. Reneging here refers to the behavior of users cutting ties with the learner after an unsatisfactory experience, and desisting from any future interactions. This is also referred to as “churn”, “disengagement”, “abandonment”, or “unsubscribing” . Since, as is well known, the acquisition cost for new users is much higher than the retention cost for existing users, handling reneging plays a critical role in business success. Reneging is common in real-world applications. For instance, in clinical trials, a patient dissatisfied with the effectiveness of a treatment quits all further trials. Search services face a similar problem; users may never again use any search engine after one returns results regarded as irrelevant. Another example is online advertising, where users stop clicking on any future advertisements, after the pursuit of one or more delivered advertisements leads to a loss of in the advertiser. Similar concerns are found in notification systems employed by content creators, where there is value in sending more e-mail notifications, but each e-mail also risks the user disabling the notification functionality, permanently eliminating any opportunity for the creator to interact with the user in the future.
Second, previous studies have usually assumed that rewards are generated from an underlying reward distribution that is homoscedastic, i.e., its variance is independent of contexts. Unfortunately, this model is invalid due to the presence of “heteroscedasticity” in many real-world datasets, and learning algorithms based on it may be improvable. Examples abound in financial applications such as portfolio selection for hedge funds . In online advertising or notification systems, the click-through rate can vary among users due to their differing spare times. Users with more spare time tend to be more tolerant to advertisements/notifications, and may continue to click on them, while users with little spare time will in most cases ignore them.
We propose a novel model of contextual bandits that addresses the challenges arising from reneging risk and heteroscedasticity. We call the model “heteroscedastic bandits with reneging.” In our model, at a round for user , the learner observes a collection of contexts , where context is drawn from context set . After observing the context, the learner selects an action and receives a reward drawn from a reward distribution. To model heteroscedasticity, we allow for the mean and variance of the reward distribution to both depend on , i.e., and . To model the reneging risk, we suppose that user has a satisfaction level . If is below level , the user quits all future interactions; otherwise, the user stays. We assume that the satisfaction level for each user is fixed beforehand and does not depend on the decision of the learner. Under this model, the reneging risk associated with action of user is the probability that the observed reward is below its acceptance level, i.e., . The parameters in and are unknown and need to be learned on the fly.
Three key challenges arise in finding the optimal policy for heteroscedastic bandits with reneging. First, to estimate the unknown variance function, we have to construct a satisfactory estimator and the corresponding confidence interval. Since in statistics, there is usually no explicit way to represent the confidence interval for variance estimation, establishing regret bounds for upper confidence bound (UCB) algorithms becomes difficult. Second, the presence of reneging makes estimation of unknown functions more difficult. Each round has a non-zero probability of being the last round, and so some user-arm pair may be pulled. As a result, the conventional definition of regret needs to be modified. Moreover, since the mean and variance depend on the context, the reward distributions to be learned for one user are different from those for another user. How to transfer the knowledge accumulated on one user to another user has to be carefully handled. Third, the optimal policy needs to handle the issue of exploration vs. exploitation in terms of both rewards and risk. Intuitively, a good policy should prefer actions with high expected return and low reneging risks. This becomes difficult when there are arms that have high expected return and high risk. This work focuses on developing optimal learning algorithms that address the above challenges.
Seminal studies on contextual bandits consider linear contextual bandits [1, 10, 4], assuming that the expected reward is a linear function on contexts. Although these models have been shown to be useful in some areas, they do not address reneging and heteroscedasticity. Reneging can be handled as risk to be avoided or controlled. The risk in bandit problems has been studied for variance minimization  and value-at-risk maximization [21, 8, 9], and guarantees provided that outperform baselines [14, 24]. However, the risks those studies handle are different from those we are motivated by, and their models cannot be used to solve the problems of interest here. The risks they handle usually have no impact on lifetimes of bandits. Their approaches encode the consideration of risk in statistics and put them in objective functions, while in our problem, the reneging risk comes from the probability that the observed reward is below an acceptance level. Moreover, their models are restricted to homoscedastic datasets, while our model is applicable to both heteroscedastic and homoscedastic datasets. The acceptance level in our formulation has a flavor of thresholding bandits [2, 17, 12, 19]. However, the latter is based on a very different setting and assumes the distribution is context independent and homoscedastic (a more careful review and comparison are given in Section 2).
Contributions. Our research contributions can be summarized as follows:
Reward heteroscedasticity and reneging risk are common in real-world applications but not taken into account in existing bandit models. We formulate a novel model, dubbed “heteroscedastic bandits with reneging.” To the best of our knowledge, this paper is the first to address them in a bandit model.
To solve the proposed model, we develop a UCB-type policy, called HR-UCB, that is proved to achieve a regret bound with high probability. Although the proposed solution mainly applies to heteroscedastic bandits with reneging, the techniques employed here to handle heteroscedasticity can be used to solve bandits that are sensitive to variance, e.g., risk-averse bandits, thresholding bandits etc.
Contextual bandits, as an approach to solve sequential decision problems with side observations (contexts) and user heterogeneity, have attracted considerable research attention recently. The most well known studies are of linear contextual bandits [1, 10, 4], where it is assumed that the expected reward is a linear function of context, an assumption also made in this paper. Although previous studies of contextual bandits have been useful in many areas, they are subject to two major limitations. First, they neglect user reneging that is comminly found in real-world applications, e.g., search engines and online advertising. That is, a user not satisfied with one interaction just drops out forever from any future interactions. Appropriately handling it has been therefore regarded by many real-world practitioners as key to their long-term viability and success [13, 3]. Second, it is usually assumed that the reward distribution is homoscedastic in contexts, which is usually invalidated by real-world datasets, e.g., datasets from financial-related applications. When the reward distribution is alloed to br context-dependent, the assumption that only the mean of the distribution depends on context restricts the applicability of those models. So motivated, in this paper we propose a novel model of contextual bandits. Differing from previous works, our model allows each user to have a distinct acceptance level, with interactions falling below it resulting in the user reneging. Moreover, our model allows the variance also to be a function of context. Modeling reneging and heteroscedasticity in contextual bandits are the salient features of this paper. Compared to conventional contextual bandits, both the function for variance and for mean need to be learned in our model; in addition, reneging aborts future interactions and makes the learning task more complex. Moreover, diverse reward distributions make the avoidance of reneging more difficult. The objective of our paper is to propose an optimal policy that attacks those challenges. As far as we are aware, our model is the first one that addresses the two issues and achieves optimal regret.
There are two main lines of research related to our work: bandits with risk and thresholding bandits.
Bandits with Risk.
Reneging can be viewed as a type of risk that the learner tries to avoid or control. The risk in bandit problems has been studied in terms of variance, quantiles, and guarantees that outperform baselines. In and many follow up works, mean-variance models to handle return (reward) and risk (variability) are studied, where the objective to be maximized is a linear combination of mean reward and variance. Subsequent studies [21, 8] propose a quantile (value at risk) to replace rewards and variance in evaluating which arm to select. In contrast to these works, [14, 24] control the risk by requiring that the accumulated rewards while learning the optimal policy be above those of baselines. Similarly, in , each arm is associated with some risk; safety is guaranteed by requiring the accumulated risk to be below a given budget. Although these studies investigate optimal policies under risk, the risks they handle are different from ours and their models cannot be used to solve our problem. The risks they handle usually have no impact on lifetime of bandits. Their approaches to handle the risk are based on more straightforward statistics, while, in our problem, the reneging risk is relatively complex, i.e., it comes from the probability that the observed reward is below an acceptance level. Moreover, their models assume homoscedasticity, while we allow the variance to depend on the context.
Thresholding Bandits. The acceptance level in our model has the flavor of thresholding bandits. However, the thresholds in the existing literature differ from our perspective. In , the action receives a unit payoff in the event that the sampled reward exceeds a threshold. In , the objective is to find the set of arms whose means are above a given threshold up to a precision. In , threshold is used to trigger a one-shot reward, i.e., for an arm, no rewards can be collected until the total number of successes exceeds the threshold, but once a reward is collected, the arm is removed from the interaction. Compared to the problem in this paper, the most similar one that has been studied is in . However, it has a very different setting and assumes that the distribution is context independent and homoscedastic. In that paper, each arm is represented by a real number; users may abandon the program as long as the pulled arm exceeds a threshold, which measures user tolerance capability. As comparison, we consider a contextual bandit model; we allow the reward distribution to be heteroscedastic; and we capture the reneging through a probability.
As far as we are aware, only one very recent paper discusses bandits under heteroscedasticity . Compared to it, our paper has two salient differences. First, we discuss heteroscedasticity under the presence of reneging. The presence of reneging makes the learning problem more challenging as the learner has to always be prepared that plans for the future may not be carried out. Second, the solution in  is based on information directed sampling. In contrast to that, we exhibit in this paper, a heteroscedastic UCB policy that is efficient, and easier to implement, can perfectly achieve sub-linear regret.
In heteroscedastic bandits with reneging, since the interaction with one user is often aborted after a finite number rounds with new users joining in the interactions afterwards, we index users by their order of interaction and conduct a regret analysis in terms of the total number of interacting users. Let be the number of users, who are indexed by . Let be the context set, where denotes the -norm. At each round for user , the learner observes a set of contexts . After observing the contexts, the learner selects an action and receives a random reward drawn from a reward distribution that satisfies:
denotes the Gaussian distribution with zero mean and variance. For the mean of the reward distribution we operate under the linear realizability assumption: that is there is an unknown with so that
for all and . For the variance of the reward distribution, heteroscedasticity is taken into account through a function
where is known and is required to be nonnegative, strictly increasing, and bi-Lipschitz continuous, i.e. there exists a constant with such that , for all . For example, we can choose or
. The parameter vectorwith is unknown and will be learned during interactions. Since is bounded over all possible and , we know that is also bounded, i.e. for some , for all and defined above. This also implies that is -sub-Gaussian, for all .
The minimal expectation in an interaction of a user is characterized by its acceptance level. Denote by the acceptance level of user . We assume that acceptance levels of users, like their context, are available before interacting with them. Denote by the observed reward for user at round . When is below , reneging occurs and the user drops out from any future interaction. Suppose that at round , arm is selected for user , then the risk that reneging occurs is
where is the cumulative density function (CDF) for . Without loss of generality, we also assume that is lower bounded by for some . Let be the stopping time that denotes the first time that is below the acceptance level,
A policy is a rule for selecting an arm at each round of a user based on the preceeding interactions with that user and other users, where denotes the set of all admissible policies. In fact, the stopping time also depends on the policy that is used, so we use to represent the stopping time of user operating under policy . Let denote the sequence of contexts that correspond to the actions of user under policy . Let be the expected reward of user under the action sequence . Then we have
where is the probability of the event that the user stays for at least rounds. Then the total expected reward collected from users can be represented by
We are ready to define the pseudo-regret of the heteroscedastic bandits with reneging as
where is the optimal policy in terms of pseudo-regret among admissible policies, i.e.,
The objective of the learner is to learn a policy that achieves as minimal a regret as possible.
Illustrative examples for heteroscedasticity and reneging risk are shown in Figure 1. In Figure 1(a), the variance of the reward distribution gradually increases as the value of the one-dimensional context
increases. Although the mean of the reward distribution still follows the conventional formulation of being s linear function of context, and thus the ordinary least square estimator is still unbiased, the context dependent variance makes the standard error estimates biased, and invalidates the method usually used to construct the confidence bounds. Each user-arm pair corresponds to a distribution with distinct mean and variance. Moreover, the presence of reneging risk makes every observation have a probability of being the last one, which makes the learning task more challenging. Intuitively, the optimal policy prefers the distribution that has large mean and low reneging risk. Unfortunately, it is nontrivial to follow that intuition in optimal policy construction. As shown in Figure1(b), the reward distribution has mean and variance , correspondingly and variance for . The two correspond to the same user, but for different arms. Thus they have the same acceptance level . A learner may prefer pulling distribution as its mean reward is higher than . However, since the variance of is also higher than , the reneging risk (the blue shaded area) is higher than (the red shaded area) as well. When considering which arm to pull, the learner faces an additional dilemma (beyond the exploration vs. exploitation dilemma) of choosing between receiving higher reward for one pull and staying longer to collect more future rewards. This makes the model distinct and especially difficult to solve.
In this section, we present a UCB-type algorithm for heteroscedastic bandits with reneging. We start by introducing general results on heteroscedastic regression.
In this section, we consider a general regression problem with heteroscedasticity.
With a slight abuse of notation, let be a collection of pairs of context and reward realization that are collected sequentially. Recall from (1)-(3) that and with unknown parameters and . Note that given the contexts , are mutually independent. Let and be the row vectors of the reward realizations and the deviations from the mean reward, respectively. Let be an matrix in which the -th row is , for all . We use to denote the estimators of and based on the observations , respectively. Moreover, define the estimated deviation with respect to as
where is some regularization parameter and is the pre-image of the vector .
Note that in (13),
is the conventional ridge regression estimator. On the other hand, to obtain an estimator, (14) still follows the ridge regression approach, but with two additional steps: (i) derive the estimated deviation based on , and (ii) apply the map on the square of . It is known that defined in (14) has some nice asymptotic properties (e.g. Chapter 8.2 of ). However, it remains unknown how to obtain non-asymptotic results regarding the confidence set for . This question will be answered rigorously in Section 4.1.2.
In this section, we discuss the confidence sets for the estimators and described above. To simplify notation, we define a matrix as
A confidence set for was introduced in . For convenience, we restate the results in the following lemma.
(Theorem 2 in ) For all , define
For any , with probability at least , for all , we have
where is the induced vector norm of vector with respect to .
Next, we derive the confidence set for . Define
where and are some universal constants that will be described in Lemma 3. The following is the main theorem on the confidence set for .
For all , define
For any , with probability at least , for all , we have
To demonstrate the main idea behind Theorem 1, we highlight the proof procedure in the following Lemma 2-5. First, to quantify the difference between and , we start by considering the inner product of an arbitrary vector and in the following lemma.
For any , we have
The proof is provided in Appendix 6.1.
For any , for any , with probability at least , we have
We highlight the main idea of the proof. Recall that . Therefore, is a -distribution with a scaling of . Hence, each element in has zero mean. Moreover, we observe that is quadratic. Since the
-distribution is sub-exponential, we utilize a proper tail inequality for quadratic forms of sub-exponential distributions to derive an upper bound. The complete proof is provided in Appendix6.2.
Next, we derive an upper bound for (24).
For any , for any , with probability at least , we have
The main challenge is that (27) involves the product of the deviation and the estimation error . Through some manipulation, we can decouple from and apply a proper tail inequality for quadratic forms of sub-Gaussian distributions. The complete proof is provided in Appendix 6.3.
Next, we provide an upper bound for (25).
For any , for any , with probability at least , we have
Now we are ready to put all the above together and prove Theorem 1.
to denote the smallest eigenvalue of a square symmetric matrix. Recall thatis positive definite for all . Then we have
Since , we know that for a given and a given , with probability at least ,
Finally, to obtain a uniform bound, we simply choose and apply the union bound to (32) over all . Note that . Therefore, we conclude that with probability at least , for all ,
The proof is complete.
In this section, we formally introduce the proposed UCB policy based on the heteroscedastic regression discussed in Section 4.1.
In this section, we consider a policy which has access to an oracle with full knowledge of and . Consider users that arrive sequentially. Let be the sequence of contexts that correspond to the actions for the user under an oracle policy . The oracle policy is constructed by choosing
for each . Due to the construction in (34), we know that achieves the largest possible expected reward for each user , and is hence optimal in terms of pseudo-regret defined in Section 3. Based on (8) and (34), by using an one-step optimality argument, it is easy to verify that is a fixed policy for each user , i.e. , for all . Let denote the total expected reward of user under . We have
Next, we derive a useful property regarding (35). For any given , define the function as
Note that for any given , equals the total expected reward of a single user with threshold if a fixed action with context is chosen under parameters . We show that has the following nice property.
Let be a invertible matrix. For any with , , for any with , , for any , for any ,
where and are some finite positive constants that are independent of and .
The main idea is to apply first-order approximation under Lipschitz continuity of and . The detailed proof is provided in Appendix 6.5.
To begin with, we introduce an upper confidence bound based on the GLSE described in Section 4.1. Note that the results in Theorem 1 depend on the size of the set of context-reward pairs. Moreover, in our bandit model, the number of rounds of each user is a stopping time and can be arbitrarily large. To address this, we propose to actively maintain a regression sample set through a function . Specifically, we let the size of grow at a proper rate regulated by . One example is to choose for some constant . Since each user will play for at least one round, we know is at least after interacting with users. We use to denote the regression sample set right after the departure of user . Moreover, let be the matrix in which the rows are composed by the contexts of all the elements in . Similar to (15), we define , for all . To simplify notation, we also define
For any , we define the upper confidence bound as follows:
Next, we show that is indeed an upper confidence bound.
The proof is provided in Appendix 6.6.
Now, we formally introduce the HR-UCB algorithm. The complete algorithm is shown in Algorithm 1 and can be described in detail as follows:
After applying an action, HR-UCB observes the corresponding reward and the reneging event if any. The current context-reward pair will be added to only if the size of is less than .
Based on the regression sample set , HR-UCB updates the estimators and right after the departure of each user.
Note that under HR-UCB, the estimators and are updated right after the departure of each user (Line 1). Alternatively, and can be updated whenever is updated. While this alternative may make slightly better use of the observations, it also incurs more computation overhead. For ease of exposition, we still focus on the ”lazy-update” version presented in Algorithm 1.
In this section, we provide the regret analysis for the proposed HR-UCB policy.
Under HR-UCB, with probability at least , the pseudo regret is upper bounded as
Moreover, by choosing for some constant , we have
The proof is provided in Appendix 6.7.
We briefly discuss the difference between our regret bound and the regret bounds of other related settings. Note that if the acceptance level for all , then all the users will quit after exactly one round. This corresponds to the conventional contextual bandits setting (e.g. homoscedastic case  and heteroscedastic case ). In this degenerate case, our regret bound is , which has an additional factor resulting from the heteroscedasticity with reneging.
In this paper, we have studied the challenges in bandit modeling that arise from heteroscedasticity and reneging. Most existing contextual bandit algorithms suffer from neglecting them and cannot be used. These complications exist in many real-world applications, and taking them into account is economically necessary for the success of the business. To attack the above challenges, we have formulated a heteroscedastic bandit model with reneging, where the user may quit from future interactions if the reward falls below its acceptance level, and the variance of reward distribution can depend on context. We have proposed a UCB-type policy, called HR-UCB, to solve this novel model, and proved that it achieves regret. The techniques we developed to estimate heteroscedastic variance and establish sub-linear regret under the presence of heteroscedasticity, can be extended to other variance sensitive bandit problems, such as risk-averse bandits, thresholding bandits, etc.
International Conference on Machine Learning, pages 127–135, 2013.
Association for Uncertainty in Artificial Intelligence, 2018.
Recall that . Note that
Therefore, for any , we know
Moreover, by rewriting , we have
We first introduce the following useful lemmas.
Let be independent random complex variables with zero mean and variance and having the uniform sub-exponential decay, i.e. there exists such that
We use to denote the conjugate transpose of . Let , let denote the complex conjugate of , for all , and let be a complex matrix. Then, we have
where and are positive constants that depend only on . Moreover, for the standard -distribution, and .
By the definition of induced matrix norm,