## 1 Introduction

Contextual bandits are ubiquitous models in reinforcement learning and sequential decision making in environments with finite action spaces. The range of applications is extensive and includes different problems that time-varying and action-dependent information are important, such as personalized recommendation of news articles, healthcare interventions, advertisements, and clinical trials [li2010contextual, bouneffouf2012contextual, tewari2017ads, nahum2018just, durand2018contextual, varatharajah2018contextual, ren2020dynamic].

In many applications, consequential variables for decision making are not perfectly observed. Technically, the context vectors are often observed in a partial, transformed, and/or noisy manner [bensoussan2004stochastic, bouneffouf2017context, tennenholtz2021bandits]. In general, sequential decision making algorithms under imperfect observations provide a richer class of models compared to those of perfect observations. Accordingly, they are commonly used in different problems, including space-state models for robot control and filtering [roesser1975discrete, nagrath2006control, kalman1960new, stratonovich1960application].

We study contextual bandits with imperfectly observed context vectors. The probabilistic structure of the problem understudy as time proceeds, is as follows. At every time step, there are available actions (also referred to as ‘arms’), and the unobserved context of arm at time , denoted by

, is generated according to a multivariate normal distribution

. Moreover, the corresponding observation (i.e., output) is , while the stochastic reward of arm is determined by the context and the unknown parameter . Formally, we have(1) | |||||

(2) |

where and are the noises of observation and reward, following the the distributions and , respectively. Further, the sensing matrix captures the relationship between and the noiseless portion of .

At each time, the goal is to learn to choose the optimal arm maximizing the reward, by utilizing the available information by time . That is, the agent chooses an arm based on the observations collected so far from the model in (1). So, the resulting reward will be provided to the agent according to the equation in (2). Clearly, to choose high-reward arms, the agent needs accurate estimates of the unknown parameter , as well as those of the contexts , for . However, because is not observed, the estimation of is available only through the output . Thereby, design of efficient reinforcement learning algorithms with guaranteed performance is challenging. Bandits are thoroughly investigated in the literature, assuming that are perfectly observed. Early papers focus on the method of Upper-Confident-Bounds (UCB) for addressing the exploitation-exploration trade-off [lai1985asymptotically, abe1999associative, auer2002using, abbasi2011improved, chu2011contextual]. UCB-based methods take actions following optimistic estimations of the parameters, and are commonly studied in reinforcement learning [abbasi2011regret, faradonbeh2020optimism]

. Another popular and efficient family of reinforcement learning policies use randomized exploration, usually in the Bayesian form of Thompson sampling

[chapelle2011empirical, agrawal2013thompson, faradonbeh2019applications, faradonbeh2020input, faradonbeh2020adaptive, modi2020no]. For contextual bandits that contexts are generated under certain conditions, exploration-free policies with Greedy nature can expose efficient performance [bastani2021mostly]. Besides, numerical analysis shows that Greedy algorithms outperform Thompson sampling under imperfect context observations [park2022efficient]. Therefore, this work focuses on the theoretical analysis of Greedy policies for imperfectly observed contextual bandits.Currently, theoretical results for bandits with imperfect context observations are scarce. One approach is to assume that additional information is provided to the algorithm (e.g., a known nuisance parameter) [tennenholtz2021bandits]. However, such information is not available in general, and such assumptions can lead to persistently sub-optimal action selection. When the output observations are of the same dimension as the context vectors, average-case asymptotic efficiency is shown [park2021analysis]. That is, the expected regret of Thompson sampling policies that use context estimates for calculating the posterior, scales asymptotically as the logarithm of time [park2021analysis].

However, comprehensive analyses and non-asymptotic theoretical performance guarantees for general output observations are not currently available and are adopted as the focus of this work. We perform the *finite-time worst-case* analysis of Greedy reinforcement learning algorithms for imperfectly observed contextual bandits. We establish efficiency and provide high probability upper bounds for the regret that consists of logarithmic factors of the time horizon and of the failure probability. Furthermore, the effects of other problem parameters such as the number of arms and the dimension are fully characterized. Illustrative numerical experiments showcasing the efficiency are also provided.

To study the performance of reinforcement learning policies, different technical difficulties arise in the high probability analyses. First, one needs to study the eigenvalues of the empirical covariance matrices, since the estimation accuracy depends on them. Furthermore, it is required to consider the number of times the algorithm selects sub-optimal arms. Note that both quantities are stochastic and so worst-case (i.e., high probability) results are needed for a statistically dependent sequence of random objects. To obtain the presented theoretical results, we employ advanced technical tools from martingale theory and random matrices. Indeed, by utilizing concentration inequalities for matrices with martingale difference structures, we carefully characterise the effects of order statistics and tail-properties of the estimation errors.

The remainder of this paper is organized as follows. In Section 2, we formulate the problem and discuss the relevant preliminary materials. Next, a Greedy reinforcement learning algorithm for contextual bandits with imperfect context observations is presented in Section 3. In Section 4, we provide theoretical performance guarantees for the proposed algorithm, followed by numerical experiments in Section 5. Finally, we conclude the paper and discuss future directions in Section 6.

We use to refer to the transpose of the matrix . For a vector , we denote the norm by . Additionally, and are employed to denote the column-space of the matrix and its orthogonal subspace, respectively. Further, is the projection operator onto . Finally, and denote the minimum and maximum eigenvalues of the symmetric matrix , respectively.

## 2 Problem Formulation

First, we formally discuss the problem of contextual bandits with imperfect context observations. A bandit machine has arms, each of which has its own unobserved context , for . Equation (1) presents the observation model, where the observations

are linearly transformed functions of the contexts, perturbed by additive noise vectors

. Equation (2) describes the process of reward generation for different arms, depicting that*if*the agent selects arm , then the resulting reward is an

*unknown*linear function of the unobserved context vector, subject to some additional randomness due to the reward noise .

The agent aims to maximize the cumulative reward over time, by utilizing the sequence of observations. To gain the maximum possible reward, the agent needs to learn the relationship between the rewards and the observations . For that purpose, we proceed by considering the conditional distribution of the reward given the observation , i.e., , which is

(3) |

where and .

Based on the conditional distribution in (3), in order to maximize the expected reward given the observation, we consider the conditional expectation of the reward given the observations, . So, letting be the transformed parameter, we focus on the estimation of . The rationale is twofold; first, the conditional expected reward can be inferred with only knowing , regardless of the exact value of the true parameter . Second, is not estimable when the rank of the sensing matrix in the observation model is less than the dimension of . Indeed, estimability of needs the restrictive assumptions of non-singular and .

The optimal policy that reinforcement learning policies need to compete against knows the true parameter . That is, to maximize the reward given the output observations, the optimal arm at time , denoted by , is

(4) |

Then, the performance degradation due to uncertainty about the environment that the parameter represents, is the assessment criteria for reinforcement learning policies. So, we consider the following performance measure, which is commonly used in the literature, and is known as *regret* of the reinforcement learning policy that selects the sequence of actions :

(5) |

In other words, the regret at time is the total difference in the obtained rewards, up to time , where the difference at time is between the optimal arms and the arm chosen by the reinforcement learning policy based on the output observations by the time .

## 3 Reinforcement Learning Policy

In this section, we explain the details of the Greedy algorithm for contextual bandits with imperfect observations. Although inefficient in some reinforcement learning problems, Greedy algorithms are known to be efficient under certain conditions such as covariate diversity [bastani2021mostly]. Intuitively, the latter condition expresses that the context vectors cover all directions in with a non-trivial probability, so that additional exploration is not necessary.

As discussed in Section 2, it suffices for the policy to learn to maximize

(6) |

To estimate the quantity , we use the least-squares estimate (that will be explained shortly), in lieu of the truth . So, the Greedy algorithm selects the arm at time , such that

(7) |

The recursions to update the parameter estimate and the empirical inverse covariance matrix are as follows:

(8) | |||||

(9) |

where the initial values consist of , for some arbitrary positive definite matrix , and . Algorithm 1 provides the pseudo-code for the Greedy reinforcement learning policy.

## 4 Theoretical Performance Guarantees

In this section, we present a theoretical result for Algorithm 1 presented in the previous section. The result provides a worst-case analysis and establishes a high probability upper-bound for the regret in (5).

###### Theorem 1.

Assume that Algorithm 1 is used in a contextual bandit with arms and the output dimension . Then, with probability at least , we have

Before proceeding to the proof of the theorem, we discuss its intuition. The regret bound above scales linearly with the number of arms , quadratically with the dimension of the observations , and logarithmically with the time horizon . Note that the dimension of unobserved context vectors does not appear in the regret bound, because the optimal policy in (4) does not have the exact values of the context vectors. So, similar to the reinforcement learning policy, the optimal policy needs to estimate the contexts as well, as can be seen in (4).

Intuitively, the linear growth of the regret bound with relies on the fact that a policy is more likely to choose one of the sub-optimal arms when is larger. So, as more non-optimal arms exist, it leads to incurring larger sub-optimalities in the arm selection and larger regret. In addition, it can be seen in the proof of the theorem that the quadratic terms of are generated by the scaling of the norms of the dimensional random vectors , as well as the rates of concentration of empirical covaraince matrices . Further, the term reflects the tail-effect of the random context vectors. Lastly, aggregating reward difference at time (which decays proportional to ) together with the probability of choosing a sub-optimal arm (which shrinks with the same rate), we obtain the last logarithmic term in the regret bound in the theorem; .

###### Proof.

First, for and , define the event

(10) |

where . We use the following intermediate results, for which the proofs are deferred to the appendix.

###### Lemma 1.

Let be the sigma-field generated by random vectors . For the observation of chosen arm at time t and the filtration defined according to

we have

where and . That is, is the expected maximum of independent with the standard normal distribution.

###### Lemma 2.

For the event defined in (10), we have .

###### Lemma 3.

(Matrix Azuma Inequality [tropp2012user]) Consider the sequence of symmetric random matrices adapted to some filtration , such that . Assume that there is a deterministic sequence of symmetric matrices that satisfy , almost surely. Let . Then, for all , it holds that

(11) |

###### Lemma 4.

###### Lemma 5.

Given and , if

then, the probability of choosing a sub-optimal arm is bounded as follows:

where is the second best arm (i.e., it has the second largest conditional expected reward).

###### Lemma 6.

Suppose that we have independent normal random variables. Then, the probability density function of the difference between the maximum of the random variables and the second largest among all, is upper-bounded by .

First, we consider the inverse of the (unnormalized) empirical covariance matrix in (8). By Lemma 1, the minimum eigenvalue of is greater than , for all . Thus, for all , it holds that

(13) |

Now, we focus on a high probability lower-bound for the smallest eigenvalue of . On the event , the matrix is positive semidefinite for all and . Let

(14) |

Then, and . Thus, is a martingale difference sequence. Because for all and , and , for all , on the event . By Lemma 3, we get

(15) |

for . Now, using

(18) |

where is arbitrary, and we used the fact that . Indeed, on the event defined in (10), for we have

(19) |

where . In other words, with the probability at least , we have

(20) |

for all . Next, we investigate the estimation error . Using on the event , we have

(21) |

So, we write the regret in the following form:

(22) |

where is the indicator function. Here, we denote . By (20), we can find , such that

(23) |

with the probability at least , for all . By Lemma 4, (21) and (23), for all , with the probability at least , we have

(24) |

where .

Note that is the difference of the maximum and second largest variables of independent ones with the distribution . By Lemma 6, the density of

is bounded by . Thus, by marginalizing from , which is obtained in Lemma 5, we have

(25) |

where . Since , for all by (23), we have

(26) |

Now, to find the upper bound of the sum of indicator functions in (22), we construct a martingale difference sequence that satisfy the conditions in Lemma 3. To that end, let ,

and , where

(27) |

Since , the above sequences and are a martingale difference sequence and a martingale with respect to the filtration , respectively. Let . Since , by Lemma 3, we have

(28) |

Thus, with the probability at least , it holds that

(29) |

By (26), above is bounded by , for all . Thus, we can bound the second term as

(30) |

Therefore, by (29) and (30), putting together the terms for and , with the probability at least , the following inequalities hold for the regret of the algorithm, which yield to the desired result:

(31) | |||||

## 5 Numerical Illustrations

In this section, we perform numerical analyses for the theoretical result in the previous section. We simulate cases for imperfectly observed contextual bandits with arms and with different dimensions of the observations; , while we fix the context dimension to . Each case is repeated times and the average and worst quantities of scenarios are reported.

In Figure 1, the left plot depicts the average-case (solid) and worst-case (dashed) regret among all scenarios, normalized by . The number of arms varies as shown in the graph, while the dimension is fixed to . Next, the plot on the right-hand-side of the figure illustrates that the normalized regret increases over time for different , and for the fixed number of arms . For both plots, the worst-case regret curves are well above the average-case ones, as expected. However, the slopes of curves for both cases decays as time goes on, implying that the worst-case regret grows logarithmically with .

Figure 2 presents the average-case and worst-case regret at time (not normalized by ) for different number of arms and different dimensions . The graph shows how the regret increase as and grow.

## 6 Conclusion

This work investigates reinforcement learning algorithms for contextual bandits that the contexts are observed imperfectly, and focuses on the theoretical analysis of the regret. We establish performance guarantees for Greedy algorithms and obtain high probability regret bounds growing logarithmically with the horizon.

There are multiple interesting future directions introduced in this paper. First, it will be of interest to study reinforcement learning policies for settings that each arm has its own parameter. Further, regret analysis for contextual bandits under imperfect context observations where the sensing matrix is unknown, is another problem for future work.

## 7 Appendices

### 7.1 Proof of Lemma 1

We use the following decomposition

(32) |

We claim that and are statistically independent. To show it, define

(33) |

where has the distribution and is an arbitrary vector in . The vector can be decomposed as .Then, we have , because . This implies that only the first term of the decomposed terms, , affects the result of . This means that has the same distribution as , which means

(34) |

where

is used to denote the equality of the probability distributions. Note that

Thus, has the same distribution as , where and are statistically independent. By the decomposition (32) and the independence,

can be written as