Contextual Bandit with Missing Rewards

by   Djallel Bouneffouf, et al.

We consider a novel variant of the contextual bandit problem (i.e., the multi-armed bandit with side-information, or context, available to a decision-maker) where the reward associated with each context-based decision may not always be observed("missing rewards"). This new problem is motivated by certain online settings including clinical trial and ad recommendation applications. In order to address the missing rewards setting, we propose to combine the standard contextual bandit approach with an unsupervised learning mechanism such as clustering. Unlike standard contextual bandit methods, by leveraging clustering to estimate missing reward, we are able to learn from each incoming event, even those with missing rewards. Promising empirical results are obtained on several real-life datasets.


page 1

page 2

page 3

page 4


Online learning with Corrupted context: Corrupted Contextual Bandits

We consider a novel variant of the contextual bandit problem (i.e., the ...

Homomorphically Encrypted Linear Contextual Bandit

Contextual bandit is a general framework for online learning in sequenti...

Top-K Ranking Deep Contextual Bandits for Information Selection Systems

In today's technology environment, information is abundant, dynamic, and...

contextual: Evaluating Contextual Multi-Armed Bandit Problems in R

Over the past decade, contextual bandit algorithms have been gaining in ...

DORB: Dynamically Optimizing Multiple Rewards with Bandits

Policy gradients-based reinforcement learning has proven to be a promisi...

Adaptive Representation Selection in Contextual Bandit with Unlabeled History

We consider an extension of the contextual bandit setting, motivated by ...

Variational inference for the multi-armed contextual bandit

In many biomedical, science, and engineering problems, one must sequenti...

1 Introduction

Sequential decison making is a common problem in many practical applications, broadly encompassing situations in which an agent must choose the best action to perform at each iteration while maximizing cumulative reward over some period of time BouneffoufBG13; ChoromanskaCKLR19; RiemerKBF19; LinC0RR20; lin2020online; lin2020unified; NoothigattuBMCM19; lin2019split; lin2020story. One of the key challenges in sequential decision making is to achieve a good trade-off between the exploration of new actions and the exploitation of known actions. This exploration vs exploitation trade-off in sequential decision making is often formulated as the multi-armed bandit (MAB)

problem. In the MAB problem setting, given a set of bandit “arms” (actions), each associated with a fixed but unknown reward probability distribution  

surveyDB; LR85; UCB; Bouneffouf0SW19; LinBCR18; DB2019; BalakrishnanBMR19ibm; BouneffoufLUFA14; RLbd2018; balakrishnan2020constrained; BouneffoufRCF17, the agent selects an arm to play at each iteration, and receives a reward, drawn according to the selected arm’s distribution, independently from the previous actions.

A particularly useful version of MAB is the contextual multi-armed bandit (CMAB), or simply the contextual bandit problem, where at each iteration, the agent observes a -dimensional context, or feature vector prior to choosing an arm AuerC98; AuerCFS02; BalakrishnanBMR18; BouneffoufBG12

. Over time, the goal is to learn the relationship between the context vectors and rewards, in order to make better action choices given the context

AgrawalG13. Common sequential decision making problems with side information (context) that utilize the contextual bandit approach range from clinical trials villar2015multi to recommender systems MaryGP15; Bouneffouf16; aaai0G20, where the patient’s information (medical history, etc.) or an online user profile provide a context for making better decisions about which treatment to propose or ad to show. The reward reflects the outcome of the selected action, such as success or failure of a particular treatment option, or whether an ad is clicked or not.

In this paper we consider a new problem setting referred to as contextual bandit with missing rewards, where the agent can always observe the context but may not always observe the reward. This setting is motivated by several real-life applications where the reward associated with a selected action can be missing, or unobservable by the agent, for various reasons. For instance, in medical decision making settings, a doctor can decide on a specific treatment option for a patient, but the patient may not come back for follow-up appointments; though the reward feedback regarding the treatment success is missing, the context, in this case the patient’s medical record, is still available and can be potentially used to learn more about the patient’s population. Missing rewards can also occur in information retrieval or online search settings where a user enters a search request, but, for various reasons, may not click on any of the suggested website links, and thus the reward feedback about those choices is missing. Yet another example is in online advertisement, where a user clicking on a proposed ad represents a positive reward, but the absence of a click can be negative reward (the user did not like the ad), or can be a consequence of a bug or connection loss.

The contextual bandit with missing rewards framework proposed here aims to capture the situations described above, and provide an approach to exploit all context information for future decision making, even if some rewards are missing. More specifically, we will combine unsupervised online clustering with the standard contextual bandit. Online clustering allows us to learn representations of all the context vectors, with or without the observed rewards. Utilizing the contextual bandit on top of clustering makes use of the reward information when it is available. We demonstrate on several real-life datasets that this approach consistently outperforms the standard contextual bandit approach when rewards are missing.

2 Related Work

The multi-armed bandit problem provides a solution to the exploration versus exploitation trade-off AllesiardoFB14; dj2020; Sohini2019. This problem has been extensively studied. Optimal solutions have been provided using a stochastic formulation  LR85; UCB; BouneffoufF16, a Bayesian formulation  T33, and an adversarial formulation  AuerC98; AuerCFS02. However, these approaches do not take into account the relationship between context and reward, potentially inhibiting overall performance. In LINUCB  Li2010; ChuLRS11

and in Contextual Thompson Sampling (CTS) 

AgrawalG13, the authors assume a linear dependency between the expected reward of an action and its context; the representation space is modeled using a set of linear predictors. However, these algorithms assume that the bandit can observe the reward at each iteration, which is not the case in many practical applications, including those discussed earlier in this paper. Authors in bartok2014partial considered a kind of incomplete feedback called "Partial Monitoring (PM)", developing a general framework for sequential decision making problems with incomplete feedback. The framework allows the learner to retrieve the expected value of actions through an analysis of the feedback matrix when possible, assuming both are known to the learner.

In bouneffouf2020online, authors study a variant of the stochastic multi-armed bandit (MAB) problem in which the context are corrupted. The new problem is motivated by certain online settings including clinical trial and ad recommendation applications. In order to address the corrupted-context setting, the author propose to combine the standard contextual bandit approach with a classical multi-armed bandit mechanism. Unlike standard contextual bandit methods, they were able to learn from all iteration, even those with corrupted context, by improving the computing of the expectation for each arm. Promising empirical results are obtained on several real-life datasets.

In this paper we focus on handling incomplete feedback in the bandit problem setting more generally, without assuming the existence of a systematic corruption process. Our work is somewhat comparable to online semi-supervised learning

Yver2009; ororbia2015online

, a field of machine learning that studies learning from both labeled and unlabeled examples in an online setting. However, in online semi-supervised learning, the true label is available at each iteration, whereas in the contextual bandit with missing rewards, only bandit feedback is available, and the true label, or best action, is unknown.

3 Problem Setting

Algorithm 1 presents at a high-level the contextual bandit setting, where (we will assume here ) is a vector describing the context at time , is the reward of the action at time , and denotes a vector of rewards for all arms at time . Also, denotes a joint probability distribution over , denotes a set of actions, , and denotes a policy. We operate under the linear realizability assumption; that is, there exists an unknown weight vector with so that,

where is an unknown coefficient vector associated with the arm which needs to be learned from data. Hence, we assume that the

are independent random variables with expectation

. with some measurement noise. We also assume here that, the measurement noise is independent of everything and is -sub-Gaussian for some , i.e., for all .

Definition 1 (Cumulative regret).

The regret of an algorithm accumulated during iterations is given as:

where is the best action at step according to .

1:  Repeat
2:   is drawn according to
3:   is revealed to the player
4:   The player chooses an action
5:   The reward
6:   The player updates its policy
8:  Until t=T
Algorithm 1 Contextual Bandit

4 LINUCB with Missing Rewards (MLINUCB)

One solution for the contextual bandit is the LINUCB algorithm 13

where the key idea is to apply online ridge regression to incoming data to obtain an estimate of the coefficients

. In order to make use of the context even in the absence of the corresponding reward, we propose to use an unsupervised learning approach; specifically, we use an online clustering step to retrieve missing rewards from available rewards with similar contexts. At each time step, the context vectors are clustered into clusters where is selected a-priori.

We adapt the LINUCB algorithm for our setting, proposing to use a clustering step for imputing the reward data when missing. At each time step, we perform a clustering step on the context vectors where the total number of clusters

is a hyperparameter. For each cluster

we define the average reward for each arm as below:


Assuming is the metric used for clustering where is the cluster centroid and is the number of data points in cluster , we choose the smallest as the closest clusters to and compute a weighted average of the average cluster rewards as formulated below:


When is missing we assign . Note that if , is simply the average rewards of all the points within the cluster that belongs to.

1:  Input: value for , , , ,
2:  for t=1 todo
3:     cluster {, … , } into clusters
4:     for all  do
5:         A
7:     end for
8:     Choose arm , and observe real-valued payoff
9:     if  available then
10:        retrieve from data
11:     else
12:        .
13:     end if
14:     A A
16:  end for
Algorithm 2 MLINUCB

We now upper bound the regret of MLINUCB. Note that the general CBP setting abbasi2011improved takes one context per arm instead for our setting of the one context share by actions. To upper bound our algorithm for the general CBP setting, we simply cast our setting as theirs by the following steps. We simply choose a global vector as the concatenation of the vectors, so . We define a context per action with , where and being the -th vector within the concatenation. All ,, can be similarly defined from , , .

Theorem 1.

With probability , where , the upper bound on the R(T) for the MLINUCB in the contextual bandit problem, arms and features (context size) is given as follows:


with , with with contains the contexts with missing rewards and

Theorem 1 shows that MLINUCB has better upper bound compared to the LINUCB abbasi2011improved, where in LINUCB upper bound has under the square root where we have . We can see that the upper bound depends on , so more context with missing rewards better is the bound.

5 Regret analysis of BILINUCB

We now upper bound the regret of BILINUCB. General Contextual Bandit Problem (CBP) setting abbasi2011improved assumes one context per arm instead for BILINUCB setting with same context shared across arms. To upper bound regret of BILINUCB we cast our setting as general CBP setting in the following way. We choose a global vector as the concatenation of the vectors, so . Next define a context per arm as with being the -th vector within the concatenation. Let , where contains the contexts with missing rewards up to step , and let . We have the following theorem regarding the regret bound up to step .

Theorem 1 of the main text shows that BILINUCB has better upper bound compared to the LINUCB abbasi2011improved, where in LINUCB upper bound has under the square root where we have . The matrix

is the sum of identity matrix

and covariance matrix constructed using the contexts with missing reward. Both and are real symmetric and hence Hermitian matrices. Further,

is positive semi-definite as a covariance matrix. Since all the eigenvalues of

equal and since all the eigenvalues of are non-negative, by Weyl’s inequality in matrix theory for perturbation of Hermitian matrices, the eigenvalues of are lower bounded by . Hence which is the product of the eigenvalues of is lower bounded by . Hence, BILINUCB which involves the term has a provably better guarantee than LINUCB which involves only the term (without ).

5.1 Proof of Theorem 1


We need the following assumption: we assume that the noise introduced by the imputed reward is heteroscedastic. Formally, let

be a continuous, positive function, such that is conditionally -subgaussian, that is for all and ,


Note that this condition implies that the noise has zero mean, and common examples include Gaussian, Rademacher, and uniform random variables We need the following lemma,

Lemma 1.

Assuming that, the measurement noise issatisfies assumption (4). With probability , where and lies in the confidence ellipsoid.

The lemma is adopted from theorem 2 in abbasi2011improved using the noise being heteroscedastic. We follow the same step of proof, the main difference is that they have and we have with with contains the contexts with missing rewards .

where is the parameter of the classical LINUCB, and then

Now we investigate and separately.

Following the same step as the proof of theorem 2 in abbasi2011improved we also have the following,

, and using Cauchy-Schwarz with , we get

and then,

Since for all then we have . Therefore,

Our bound on the imputed reward assures . Therefore,

we have,

, with monotonically increasing

since for ,

we have ,

here we also use the fact that we have to get the last inequality.

by upper bounding using lemma 1 we get our result. ∎

6 Experiments

In order to verify the proposed MLINUCB methodology, we ran the LINUCB and MLINUCB algorithms on four different datasets, three derived from the UCI Machine Learning Repository 111 Covertype, CNAE-9, and Internet Advertisements, and one external dataset : Warfarin. The Warfarin dataset concerns the dosage of the drug Warfarin, where each record consists of a context of patient information and the corresponding appropriate dosage or action. The reward is then defined as 1 if the correct action is chosen and 0 otherwise. The details for each of these datasets are summarized in the Table 1.

Datasets Instances Features Classes
Covertype 500 000 95 7
CNAE-9 1080 856 9
Internet Advertisements 3279 1558 2
Warfarin 5528 93 3
Table 1: Datasets

To evaluate the performance of MLINUCB and LINUCB we utilize an accuracy metric that checks the equality of the selected action and the best action, which is revealed for the purposes of evaluation. Defined as such, accuracy is inversely proportional to regret. In the following experiments we fix

and utilize the mini batch K-means algorithm for clustering. In Table

2, we report the total average accuracies of running LINUCB and MLINUCB with 2, 5, 10, 15, and 20 clusters on each dataset.

10% Missing Rewards
Covertype CNAE-9 Internet Ads Warfarin
LINUCB 0.884 0.644 0.866 0.643
MLINUCB - 0.869 0.643 0.898 0.643
MLINUCB - 0.874 0.626 0.895 0.656
MLINUCB - 0.880 0.664 0.894 0.650
MLINUCB - 0.877 0.678 0.902 0.647
MLINUCB - 0.878 0.675 0.898 0.653

50% Missing Rewards
Covertype CNAE-9 Internet Ads Warfarin
LINUCB 0.884 0.566 0.824 0.615
MLINUCB - 0.838 0.578 0.888 0.630
MLINUCB - 0.847 0.546 0.896 0.641
MLINUCB - 0.863 0.592 0.897 0.640
MLINUCB - 0.854 0.608 0.903 0.638
MLINUCB - 0.853 0.592 0.901 0.639

75% Missing Rewards
Covertype CNAE-9 Internet Ads Warfarin
LINUCB 0.880 0.483 0.786 0.610
MLINUCB - 0.784 0.461 0.881 0.594
MLINUCB - 0.797 0.494 0.890 0.612
MLINUCB - 0.837 0.521 0.887 0.624
MLINUCB - 0.824 0.500 0.891 0.600
MLINUCB - 0.819 0.493 0.896 0.611
Table 2: Total average accuracy

As the MLINUCB regret upper bound is lower than the LINUCB regret upper bound when is small, minimizing clustering error is critical to performance. Accordingly, successful MLINUCB operates on the assumption that the context vectors live in a manifold that can be described by a set of clusters. Thus MLINUCB has the potential to outperform LINUCB when this manifold assumption holds, specifically when the number of clusters chosen adequately describes the structure of the context vector space. Visualizing the context vectors suggests that some of our test datasets violate this assumption, some respect this assumption, and when an appropriate number of clusters is chosen, MLINUCB performance aligns as expected.

Consider the Internet Advertisements and Warfarin datasets, where 2D projections of the context vectors capture the majority of the variance in the context vector space, 100.0% and 98.2% respectively. In Figures

3 and 4 the projected context vector spaces appear clustered, not randomly scattered, and MLINUCB outperforms LINUCB for most choices of , the number of clusters. The Internet Advertisements dataset yields the best results - when switching from LINUCB to MLINUCB algorithms, accuracy jumps from to when of the reward data is missing, from to when of the reward data is missing, and from to when of the reward data is missing.

Although the 2D projections of the Covertype and CNAE-9 context vectors in Figures 1 and 2 appear well clustered, both projections only capture a small amount of the variance in the context vector space, in the Covertype dataset and in the CNAE-9 dataset. MLINUCB results do not show improvement for the cases tried for Covertype dataset suggesting that the Covertype dataset violates the manifold assumption for the context space. However in the CNAE-9 dataset, we see that MLINUCB outperforms LINUCB for most choices of , which supports the observation that the context space is clustered.

(a) Context vector visualization with 5 clusters and 2D PCA. 2D PCA captures 29.7% of the variance in the Covertype dataset.
(b) LINUCB and MLINUCB accuracy comparison
Figure 1: Covertype
(a) Context vector visualization with 5 clusters and 2D PCA. 2D PCA captures 13.9% of the variance in the CNAE-9 dataset.
(b) LINUCB and MLINUCB accuracy comparison
Figure 2: CNAE-9
(a) Context vector visualization with 5 clusters and 2D PCA. 2D PCA captures 13.9% of the variance in the CNAE-9 dataset.
(b) LINUCB and MLINUCB accuracy comparison
Figure 3: Internet Advertisements
(a) Context vector visualization with 5 clusters and 2D PCA. 2D PCA captures 98.2% of the variance in the Warfarin dataset.
(b) LINUCB and MLINUCB accuracy comparison
Figure 4: Warfarin

Taking a more in depth look at the CNAE-9 dataset, in Figure 5, we vary LINUCB and MLINUCB’s common hyperparameter , which controls the ratio of exploration to exploitation, and see that MLINUCB continues to result in higher accuracies than LINUCB for most .

Note that , the number of clusters, is a hyperparameter of the algorithm and while initialized a-priori, it could be changed and optimized online as more context vectors are revealed. Alternatively, we could leverage clustering algorithms that do not initialize a-priori and learn the best from the available data.

Figure 5: LINUCB and MLINUCB cumulative accuracies on CNAE-9 across various

7 Conclusions and Future Work

In this paper we studied the effect of data imputation in the case of missing rewards for multi-arm bandit problems. We prove an upper bound for the total regret in our algorithm following the CBP upper bound. Our MLINUCB algorithm shows improvements over LINUCB in terms of total average accuracy for most cases. The main observation here is that when the context vector space lives in a clustered manifold, we can take advantage of this structure and impute the missing reward at each step given similar context in previous events. A very obvious next step is to try using the weighted average introduced in equation 1 with greater than . This would use more topological information from the context feature space and wouldn’t rely on a single cluster. Additionally, the algorithm doesn’t rely on a fixed value for so we could optimize the value of at each event using some clustering metric to find the best at each time. This work can also be extended by replacing the simple clustering step with more complex methodologies to learn a representation of the context vector space, for example sparse dictionary learning.