# Incentives in the Dark: Multi-armed Bandits for Evolving Users with Unknown Type

Design of incentives or recommendations to users is becoming more common as platform providers continually emerge. We propose a multi-armed bandit approach to the problem in which users types are unknown a priori and evolve dynamically in time. Unlike the traditional bandit setting, observed rewards are generated by a single Markov process. We demonstrate via an illustrative example that blindly applying the traditional bandit algorithms results in very poor performance as measured by regret. We introduce two variants of classical bandit algorithms, upper confidence bound (UCB) and epsilon-greedy, for which we provide theoretical bounds on the regret. We conduct a number of simulation-based experiments to show how the algorithms perform in comparison to traditional UCB and epsilon-greedy algorithms as well as reinforcement learning (Q-learning).

## Authors

• 17 publications
• 9 publications
• 6 publications
• 9 publications
• ### Combinatorial Bandits for Incentivizing Agents with Dynamic Preferences

The design of personalized incentives or recommendations to improve user...
07/06/2018 ∙ by Tanner Fiez, et al. ∙ 0

• ### Regime Switching Bandits

We study a multi-armed bandit problem where the rewards exhibit regime-s...
01/26/2020 ∙ by Xiang Zhou, et al. ∙ 6

• ### Laplacian-regularized graph bandits: Algorithms and theoretical analysis

We study contextual multi-armed bandit problems in the case of multiple ...
07/12/2019 ∙ by Kaige Yang, et al. ∙ 0

• ### Sequential Learning of Principal Curves: Summarizing Data Streams on the Fly

When confronted with massive data streams, summarizing data with dimensi...
05/18/2018 ∙ by Benjamin Guedj, et al. ∙ 0

• ### Output-Weighted Sampling for Multi-Armed Bandits with Extreme Payoffs

We present a new type of acquisition functions for online decision makin...
02/19/2021 ∙ by Yibo Yang, et al. ∙ 0

• ### Multi-user Communication Networks: A Coordinated Multi-armed Bandit Approach

Communication networks shared by many users are a widespread challenge n...
08/14/2018 ∙ by Orly Avner, et al. ∙ 0

• ### A Multi-Armed Bandit-based Approach to Mobile Network Provider Selection

We argue for giving users the ability to lease bandwidth temporarily fro...
12/08/2020 ∙ by Thomas Sandholm, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Online learning and the theory of multi-armed bandits play a key role in shaping how digital platforms actively engage with users. A common theme underlying these platforms is the presence of mechanisms that target individuals with personalized options chosen from a pool of available actions. Tools from the theory of multi-armed bandits provide a systematic framework for synthesizing algorithms where a decision-maker interacts with an agent, by balancing exploration (learning how agents react to new alternatives) with exploitation (repeatedly offering the best performing options to an agent). Typical applications where such trade-offs arise include recommender systems (Li, Chu, Langford, and Schapire, 2010, 2016), crowdsourcing (Liu and Liu, 2017; Tran-Thanh et al., 2014), and incentive design in e-commerce and physical retail111Although we motivate our work with applications pertaining to interactions between a decision-maker and an agent, an example being digital platforms that actively engage with users, our framework applies to any correlated Markovian environment, e.g., spectrum access applications (Tekin and Liu, 2012). (Chakrabarti et al., 2008). A rich stream of literature has investigated multi-armed bandit algorithms for such scenarios, obtaining near-optimal performance guarantees for a multitude of settings (see, e.g., Bubeck and Cesa-Bianchi, 2012; Slivkins, forthcoming; Lattimore and Szepesvári, forthcoming).

Despite their popularity, most of the bandit algorithms in this line of research fail to take into account a crucial component of human interaction prevalent in the aforementioned applications. Notably, the type or state of an agent at any given point in time depends primarily on their underlying beliefs, opinions, and preferences; as these states evolve, so do the rewards for pursuing each distinct action. For example, there is mounting empirical evidence that humans make decisions by comparing to evolving reference points such as the status quo, expectations about the future, or past experiences (Kahneman and Tversky, 1984). Against this backdrop, we illuminate three critical challenges that motivate our work and render current bandit approaches ineffective in a dynamic setting.

[itemsep=-2pt, topsep=2pt, leftmargin=0pt]

Correlated Evolution of Rewards:

The evolution of an agent’s beliefs or reference point is inextricably tied to the mechanisms that they interact with. Hence, the preferences of an agent evolve in a correlated fashion, and consequently, so do the rewards for each action a decision-maker can select. As a result, standard techniques that ignore such correlations may misjudge the rewards of each action. It is worth noting that several previous works studying online decision-making problems have identified similar dynamic feedback loops in applications such as click-through rate prediction (Graepel et al., 2010) and rating systems (Herbrich et al., 2007).

Lack of State Information:

Digital platforms that engage with users often observe only their actions (e.g., click or no click) that can act as a proxy for a reward, and rarely have access to their underlying beliefs, opinions, or preferences that induce responses. Thus, it is imperative to design learning policies that are unaware of the agent’s state, yet which remain cognizant of the general fact that rewards are drawn from evolving distributions over the agent’s state.

Batch Feedback:

In real world systems, it is often the case that reward feedback is processed in batches to update a learning policy because of various runtime constraints  (see Chapter 8 of Slivkins, forthcoming). Moreover, immediate feedback may not reflect the long-term mean reward of an action when agent preferences are rapidly evolving. In such scenarios, frequently adapting a policy can further complicate the task of distinguishing between the underlying value of actions.

The objective of the work in this paper is to develop a principled approach to solving bandit problems in environments where: (i) future rewards are correlated with past actions as a result of a dependence on the agent’s underlying state, (ii) the decision-maker does not have a priori information regarding the state nor do they observe the state during their interaction with the agent, and (iii) reward feedback cannot be collected immediately to update a learning policy. In order to capture these features, we consider a bandit setting with arms or actions and an underlying state . The policy class is restricted such that a fixed action must be played within each epoch—a time interval consisting of a number of iterations that grows linearly as a function of the number of times an arm has been played in the past. The observed feedback at the end of an epoch is a smoothed reward. The smoothed reward can be modeled as a discount-averaged reward or a time-averaged reward of the instantaneous rewards at each iteration within an epoch. In the discount-averaged reward model, rewards carry more weight toward the end of an epoch, while in the time-averaged reward model, rewards are given equal weight within an epoch. The reward for an action at each iteration within an epoch is drawn from a distribution that depends on . Moreover, the state evolves in each iteration according to a Markov chain whose transition matrix depends on the arm selection.

Although our setup is a generalization of the classic bandit problem, popular approaches such as UCB and –greedy (Auer et al., 2002a) perform poorly here, often converging to suboptimal arms; we demonstrate this in Example 2.2 of Section 2. The failure of existing algorithms stems from the correlation between actions: since the evolution of depends on the action played, observed rewards are a function of past actions. Finally, it is worth noting that problems with Markovian rewards are traditionally studied through the lens of reinforcement learning (Jaksch et al., 2010; Azar et al., 2013). However, all of the algorithms in this domain assume that the decision-maker has full or partial information on an agent’s state. We believe that the state-agnostic bandit algorithms developed in this work will serve as a bridge between the traditional bandit theory and techniques from reinforcement learning.

### 1.1 Contributions and Organization

Given a multi-armed bandit problem where the arm rewards evolve in a correlated fashion according to a Markov chain, the fundamental question studied in this work is the following: Can we design an algorithm that provably guarantees sublinear regret as a function of the time horizon in the absence of state observations and knowledge of the way in which the reward distributions are evolving?

The contributions of this work are now summarized. To demonstrate the necessity of our work, we show that traditional bandit algorithms such as UCB and –greedy can misidentify the optimal arm and suffer linear regret as a result of underestimating the value of the optimal arm in a correlated Markovian environment. We then develop a general framework for analyzing epoch-based bandit algorithms for problems with correlated Markovian rewards. We join this framework with theory of Markov chain mixing times and existing bandit principles to design algorithms capturing the uncertainty in the empirical mean rewards arising from the unobserved, evolving state distribution and the stochasticity of the reward distributions. This gives us the central contributions of the paper, bandit algorithms called EpochUCB and EpochGreedy. We prove gap-dependent regret bounds for each algorithm under discount-averaged and time-averaged reward feedback. Given that no online learning algorithm can obtain sublogarithmic regret bounds, even in the case of independent and identically distributed rewards, our algorithms produce regret bounds that are asymptotically tight. Moreover, we obtain an gap-independent regret bound for EpochUCB under discount-averaged and time-averaged reward feedback. Finally, we present a number of simulations comparing the empirical and theoretical performance of our proposed algorithms. This is augmented with a set of illustrative experiments demonstrating that our proposed algorithms empirically outperform existing bandit and reinforcement learning algorithms in correlated Markovian environments.

We now briefly characterize the challenges involved in designing online learning algorithms when the rewards evolve in a correlated fashion and outline our techniques to overcome them. Unlike a typical bandit problem, there are two sources of uncertainty in the empirical rewards: (i) uncertainty regarding the state, and (ii) uncertainty from the stochasticity of the distributions from which rewards are drawn. As we demonstrate with illustrative examples, misjudging the expected reward of arm as an artifact of failing to take into account the multiple sources of uncertainty can lead to significant regret. Moreover, the distribution from which an arm’s reward is drawn can change even when that arm is not selected since rewards depend on the evolving state. To overcome these obstacles, our proposed algorithms leverage techniques from the theory of mixing times for Markov chains, while retaining the spirit of the UCB and –greedy algorithms developed in Auer et al. (2002a)

. Specifically, by characterizing the mixing times, we are able to obtain estimates of the empirical mean rewards that closely approximate the expected stationary distribution rewards and maintain accurate confidence windows representing the uncertainty in these estimates. The execution of each algorithm in epochs of growing length ensures that the uncertainty in the empirical mean reward estimates for each arm eventually dissipates.

Following a formalization of the problem we study in Section 2, we present our proposed EpochUCB and EpochGreedy algorithms and analyze their regret in Section 3. In Section 4, we present simulation results. We conclude in Section 5 with a discussion of open questions and comments on future work. Finally, to promote readability, an appendix comes after the primary presentation of our work. In Appendix A, we put forward a notation table that includes the most important and frequently used notation in the paper. The majority of the proofs backing our theoretical results are contained in Appendix B. However, proof sketches are provided immediately following the statements of our main results. In Section C, further details are given on the existing algorithms that we empirically compare to our proposed algorithms.

### 1.2 Background and Related Work

The two distinct features separating our model from previous work on multi-armed bandits with Markov chains are that the rewards evolve in a correlated fashion and the decision-maker is fully unaware of the agent’s state. These features are missing in the related rested and restless multi-armed bandit problems.

In the rested (Anantharam et al., 1987; Tekin and Liu, 2010) and restless (Tekin and Liu, 2012; Ortner et al., 2014) bandit literature, there is an independent Markov chain tied to each arm and the reward for an arm depends on the state of the Markov chain for that arm. In each model, when an arm is selected the state of the Markov chain for that arm is observed and transitions. The distinguishing characteristic between the problems is the behavior of the Markov chains tied to arms that are not selected. In the rested bandit model, the Markov chains associated with unselected arms remain unaltered; however, in the the restless bandit model, the Markov chains associated with unselected arms evolve precisely as they would had they been played.

Hence, as is true in our model, in the restless bandit problem the reward distribution on an arm can evolve even when the arm is not being played. However, a key point of distinction is that the evolution of the reward distribution on an arm is independent of the reward distribution for every other arm. In contrast, the problem setting we study is such that there is a shared Markov chain between arms so that the evolution of the rewards on each arm is correlated. Moreover, in the setting under consideration there are no observations of the Markov chain state or distribution. Although these distinctions may appear subtle, the correlation between present actions and future rewards, along with the absence of state observations, results in a number of technical difficulties.

An exception to the formulations of the rested and restless bandit problems is the manuscript of Mazumdar et al. (2017), which proposes a UCB-inspired strategy for expert selection in a Markovian environment. Our proposed UCB-inspired algorithm, called EpochUCB, improves upon the regret bound in that work. Moreover, we consider more general reward feedback structures and propose an –greedy inspired algorithm called EpochGreedy. Another work along this same vein is that of Azar et al. (2013)

; in this work, an expert selection strategy is proposed for finding policies in Markov decision processes with partial state feedback.

The problem we study bears some conceptual similarity to the traditional principal-agent model from contract theory (Bolton et al., 2005; Laffont and Martimort, 2009). The standard principal-agent model is a one-shot interaction: a principal selects a signal to send to an agent with a type variable , and then the self-interested agent pursues an action depending on . The reward the principal obtains is a function of the action of the agent, and consequently, the parameters . Typically, there exists an information asymmetry between the principal and the agent. Prominent examples include adverse selection (agent type is unobservable to the principal) and moral hazard (agent action is unobservable to the principal). Recently, repeated principal-agent problems have begun to be studied as bandit problems where each round corresponds to the standard principal-agent formulation with adverse selection and moral hazard. A notable example of such a formulation is the work of Ho et al. (2016). The caveat of this work is that in each round a brand-new agent arrives with a type variable sampled from an i.i.d. distribution; following an interaction with the principal, an agent exits the system forever. In contrast, the problem we study can be viewed as a repeated principal-agent problem where a unique agent continuously interacts with the principal. However, while our formulation is an example of adverse selection since the state is unobserved and dynamically evolves, we do not model the strategic nature of the agent.

Finally, we remark that our setting is general enough to model a number of existing works, which we describe below:

1. [itemsep=-2pt, topsep=2pt]

2. Bandits with Positive Externalities: The state could represent the agent’s bias toward actions that have yielded high reward in the past as in Shah et al. (2018). The decision-maker receives a higher expected reward by selecting actions that the agent is positively predisposed toward.

3. Bandits Tracking Arm Selection History: In recharging bandits (Immorlica and Kleinberg, 2018), the reward on an arm is an increasing function of the time since it was last selected. The state could track such a time period. Along the same lines, in rotting bandits (Levine et al., 2017), the reward on an arm is a decreasing function of the number of times it has been played in the past. The state could track the number of plays of each arm.

## 2 Preliminaries

We now formalize the problem we study, detail technical challenges, and present an example demonstrating the insufficiency of existing algorithms that necessitates our work.

### 2.1 Problem Statement

Consider a decision-maker that faces the problem of repeatedly choosing an action to impact an agent over a finite time horizon. We refer to the set of possible actions as arms and use the notation to index them. The agent is modeled as having state , where is a finite set, that evolves in time according to a Markov chain whose transition matrix depends on the arm selected by the decision-maker. In turn, the agent’s state influences the rewards the decision-maker receives. The goal of the decision-maker is to construct a policy that sequentially selects arms in order to maximize the cumulative reward over a finite horizon.

We restrict the decision-maker’s policy class to a specific type of multi-armed bandit algorithm that we refer to as an epoch mixing policy. Such a policy is executed over epochs indexed by . In each epoch , the policy selects an arm and repeatedly ‘plays’ this arm for iterations within this epoch—where we use the superscript to indicate that the epoch length depends on the arm selected—before receiving feedback in the form of a reward. When the decision-maker makes an arm choice at an epoch , the state of the agent evolves for iterations. The reward the decision-maker observes at the end of the epoch is a function of the rewards at each iteration within the epoch. In short, an epoch mixing policy proceeds on two time scales: each selection of an arm corresponds to an epoch that begins at time and ends at time following iterations. Our motivation for restricting the policy class in this way stems from the inherent challenges online platforms face to process immediate feedback in order to update learning algorithms frequently, and the necessity of observing feedback based upon periods of near-stationarity to obtain meaningful regret bounds.

Returning to the agent model, the agent’s state is modeled as the state of a Markov chain. Let denote the state distribution at time and

denote the random variable representing the agent’s state at time

having distribution . Given that arm is selected at epoch , let the arm-specific transition matrix of the Markov chain be denoted as . Then, the state distribution on each evolves between epochs and as follows:

 βtk+1(θ)=∑θ′∈ΘP\taukrj(θ′,θ)βtk(θ′),

where

is the probability of the state transition from

to in iterations—that is, is the composition of . Observe that the agent model we adopt captures the fact that the agent’s preferences depend on past interactions with the decision-maker since the state distribution when an epoch begins depends on previous epochs. For each arm , the transition matrix is irreducible and aperiodic. Irreducible and aperiodic Markov chains are ergodic. Assumption 2.1 implies that for each , the Markov chain characterized by has a unique positive stationary distribution that we denote by . Furthermore, let denote the time reversal of , defined as . The time reversal matrix is also irreducible and aperiodic with unique positive stationary distribution . Define the multiplicative reversiblization of to be

, which is itself a reversible transition matrix. The eigenvalues of

are real and non-negative so that the second largest eigenvalue  (Fill, 1991). For each arm , the transition matrix is such that is irreducible. This is a standard assumption in the bandit literature with Markov chains (see Tekin and Liu, 2012, and the references therein) that implies . Recall that Assumption 2.1 on the transition matrices ensures . Hence, Assumption 2.1 on the transition matrices only restricts the boundary case when . The assumption is necessary to derive meaningful bounds on the deviation between the expected reward feedback and the expected stationary distribution reward of an arm.

A permissive sufficient condition on an ergodic transition matrix that ensures is irreducible is  (Tekin and Liu, 2012). This condition holds naturally for a large class of applications. A much more restrictive, yet still relevant sufficient condition on an ergodic transition matrix certifying that is irreducible would be that is also reversible. We emphasize that each of the sufficient conditions are not necessary conditions for Assumption 2.1.

Reward Models: The reward the decision-maker receives is dependent on the state of the agent. Let be the sequence of random state variables in epoch and let be the observed reward at the end of the epoch, where and denote the dependence on the arm selected and the state, respectively. Similarly, let denote the instantaneous reward at an iteration during epoch . While a number of different functions of the instantaneous rewards in each epoch could be considered, we restrict our attention to the case where is a smoothed reward over the epoch. We consider the observed smoothed reward to be a discount-averaged or time-averaged reward of the instantaneous rewards in an epoch. For a discount factor selected by the decision-maker, the discount-averaged reward is defined as

 \brkα(k)=1\bsgrk∑tk+1−1t=tk(γ)tk+1−1−t\nrtα(k),

where

 \bsgrk=∑tk+1−1t=tk(γ)tk+1−1−t

denotes the sum of the discount factors in the epoch. In the special case that the discount factor , the observed reward is a time-averaged reward:

 \brkα(k)=1\bsgrk∑tk+1−1t=tkγtk+1−1−t\nrtα(k)=1\taukr∑tk+1−1t=tk\nrtα(k).

The rewards at an iteration are assumed bounded; without loss of generality, . Moreover, they are stochastic with stochastic kernel such that and where

denotes the space of probability distributions on

.

Observe that when , the discount factors are growing within an epoch so that rewards are given more weight toward the end of an epoch, and when , the rewards within an epoch are given equal weight. This general framework allows us to model a variety of objectives. For instance, agents are likely to have recency bias and hence, if a decision-maker’s instantaneous reward depends on some measure of agent happiness or satisfaction, then discounting rewards over the epoch is reasonable. On the other hand, if the decision-maker’s instantaneous reward measures revenue or profit, then equally weighting all rewards accrued in an epoch is reasonable.

Given Assumption 2.1, if an arm is chosen at every iteration, then the Markov chain would eventually converge toward its stationary distribution . This would, in turn, give rise to a fixed reward distribution. We define the expected stationary distribution reward for arm to be

 μj=\mbE[∑θ∈Θrθjπj(θ)],

where the expectation is with respect to . Likewise, we define the optimal arm, indexed by , and denoted as when used in a subscript, to be the arm that yields the highest expected reward from its stationary distribution . Hence, the expected reward of the optimal arm is

 μ∗=maxj\mbE[∑θ∈Θrθjπj(θ)].

We use a notion of regret as a performance metric that compares the cumulative expected reward over a finite horizon of a benchmark policy and that of the policy . The benchmark policy we compare to is the best fixed arm in hindsight on the stationary distribution rewards. That is, we compare to the policy that plays the optimal arm at each epoch and receives rewards drawn from its stationary distribution. [Cumulative Regret] The cumulative regret after epochs of policy is given by

 Rα(n)=nμ∗−\mbE[∑nk=1\brkα(k)], (1)

where the expectation is with respect to the random draw of the rewards through , arms selected by the decision-maker using , and the state .

#### 2.1.1 Discussion of Regret Notion

Let us briefly comment on the regret notion we consider. It is worth noting that the benchmark policy being compared to is the optimal policy within the policy class that is restricted to a fixed arm being played. In general, however, the globally optimal policy for a given problem instance may not always play a fixed arm at each epoch. Simply put, the globally optimal policy may select an arm dependent on the state of the Markov chain. In fact, we would expect the globally optimal policy to be a deterministic policy in each state. Meaning that, conditioning on the state, the globally optimal policy would play the best arm for that state. This is because in the full information case, where the decision-maker observes the initial state distribution, the dynamics, etc., the decision-maker simply faces a Markov decision process—which are known to have deterministic state-dependent optimal policy (Bellman, 1957). Of course, since in our problem the state is fully unobserved and no prior on the distribution is available, finding such a policy is infeasible. Owing to this basis, measuring the regret with respect to the best fixed arm in hindsight on the stationary distribution rewards is standard in multi-armed bandit problems with Markov chains (see, e.g., Tekin and Liu, 2010, 2012; Gai et al., 2011). The regret notion we adopt—comparing to the best fixed arm in hindsight when the globally optimal policy may not always play a fixed arm—is often referred to as weak regret (Auer et al., 2002b).

### 2.2 Details on Technical Challenges and Insufficiency of Existing Methods

The key technical challenge stems from the dynamic nature of the reward distributions. Indeed, the rewards depend on an underlying state distribution which is common across arms; the initial distribution of the Markov chain when an arm is pulled is the distribution at the end of the preceding arm pull. The consequences of the evolving nature of the state distribution are two-fold: (i) the reward distribution on any given arm can evolve even when the arm is not being played by the algorithm and (ii) the fashion in which the reward distribution on each arm evolves when not being played depends on the current arm being played by the algorithm. That is, future reward distributions on each arm are correlated with the present actions of an algorithm. The manner in which the reward distributions evolve is precisely where the problem deviates from the rested and restless bandit problems and becomes more challenging.

Since rewards are dependent on the algorithm, i.i.d. reward assumptions, such as those found in the stochastic bandit problem, fail to hold. Despite this, a natural question may be whether or not naively employing algorithms from this literature, such as UCB and –greedy, is sufficient in a correlated Markovian environment. We consider a simple example that indicates it is not. [Failure of UCB and –greedy] Consider a problem instance with two arms and two states . The state transition matrix and reward structure for each arm are depicted in Figure (a)a. We assume that is a sufficiently small constant. The stationary distribution for arm is given by and , meaning at the stationary distribution of arm the state is almost surely, and vice-versa for arm . The deterministic reward for each (arm, state) pair with and is provided under the state. Clearly, the optimal strategy is to play arm repeatedly to obtain a per arm selection reward of nearly one.

Suppose that the initial state distribution is given by . Every time UCB and –greedy play arm , the agent is in state with high probability; consequently the reward for arm is estimated to be close to zero. Therefore, UCB and –greedy underestimate the reward for arm and misidentify arm as the optimal arm since the agent almost always remains in state as a result of the induced Markov chain. Simulations support this finding as demonstrated in Figure (b)b. Indeed, UCB and –greedy converge to the suboptimal arm and suffer linear regret. In contrast, our proposed algorithms, EpochUCB (Section 3.3) and EpochGreedy (Section 3.4), identify the optimal arm rapidly and incur only sublinear regret.

## 3 Regret Analysis

We begin this section by deriving a general framework for analyzing the regret of any multi-armed bandit policy interacting with a correlated Markovian environment in which the observed feedback is a smoothed (discount-averaged or time-averaged) reward over an epoch. We then introduce our proposed EpochUCB algorithm and present gap-dependent and gap-independent regret bounds for both discount-averaged and time-averaged reward feedback. This is followed by an exposition of our proposed EpochGreedy algorithm. For EpochGreedy, we prove a gap-dependent regret bound for both discount-averaged and time-averaged reward feedback.

### 3.1 Regret Decomposition

In this section, we derive a regret bound for a generic policy in terms of the expected number of plays of each suboptimal arm. While such regret decompositions are typical in the bandit literature, our bound is novel. This is owing to the fact that we need to employ results on the mixing times of Markov chains to decompose the regret into components arising from the selection of a suboptimal arm and that coming from the misalignment of the agent’s state distribution with the stationary distribution.

Define to be the random variable representing the number of epochs in which arm was selected by algorithm in the initial epochs. We use to denote the indicator function, meaning that when and when . Moreover, observe that . Our goal is to relate , where emphasizes the randomness in the algorithm and the rewards, to the regret . Toward this end, define for each to be the reward gap. We can add and subtract into (1) to obtain

 Rα(n) =nμ∗−∑j∈[m]\mbEα[Tαj(n)]μj+∑j∈[m]\mbEα[Tαj(n)]μj−\mbE[∑nk=1\brkα(k)] =∑j∈[m]\mbEα[Tαj(n)](μ∗−μj)+\mbEα[∑nk=1∑j∈[m]I{α(k)=j}μj] −\mbE[∑nk=1∑j∈[m]I{α(k)=j}\brkj] =∑j≠j∗\mbEα[Tαj(n)]Δj+\mbE[∑nk=1∑j∈[m]I{α(k)=j}(μj−\brkj)]. (2)

Compared to the regret decomposition for the stochastic multi-armed bandit problem, which has the form , the dynamic nature of the reward distributions in the problem leads any algorithm to incur an additional regret penalty through the term

 \mbE[∑nk=1∑j∈[m]I{α(k)=j}(μj−\brkj)]. (3)

Intuitively, this regret term is capturing the fact that the expected reward for selecting an arm at any given epoch can potentially deviate from the expected stationary distribution reward of the arm in an unfavorable way when the state distribution is not at the stationary distribution. [Markovian Regret Penalty] Consider an optimal arm with two states , stationary distribution given by and where is a small constant, and deterministic state-dependent rewards given by and , so that the expected stationary distribution reward for the arm is nearly one. Moreover, suppose that the initial state distribution of a problem instance is given by . The expected reward of arm is close to zero in the initial epoch for this problem instance, implying that the regret penalty for selecting arm in the epoch is almost one despite the reward gap being zero. This example highlights precisely what (3) elucidates in the regret decomposition: an arm selection can yield regret beyond the reward gap when the state distribution departs from the stationary distribution of the chosen arm. We often refer to the regret term in (3) as the Markovian regret penalty.

In order to bound the Markovian regret penalty, we need some technical machinery. Let be the reward received when arm is chosen for the –th time, where we include in the subscript to note the state dependence of the random reward. For each arm , define the –th filtration:

 \ftji=σ(\car1j,…,\carij,\thetarvt1j,…,\thetarvtij),

where is the time instance at which arm is chosen for the -th time. That is, is the smallest -algebra generated by the random variables . From the tower property of expectation, we have

 =\mbEα[∑nk=1∑j∈[m]I{α(k)=j}\mbE[μj−\brkj∣∣\ftjTαj(k)−1]] ≤\mbEα[∑nk=1∑j∈[m]I{α(k)=j}∣∣\mbE[μj−\brkj∣∣\ftjTαj(k)−1]∣∣]. (4)

Prior to continuing to bound the Markovian regret penalty, we introduce the epoch sequence considered in this work. Recall that represents the arm selected by the policy at the beginning of epoch and is the number of times this arm has been selected in previous epochs. At each epoch , the policy-dependent epoch length is

 \taukr=\tauz+ζTαα(k)(k−1), (5)

where are constants selected by the decision-maker. We also use the notation to denote the epoch length when at epoch . It is important to recognize that the length of each epoch is random owing to the dependence on not only the epoch index, but also on the identity of the arm selected in the epoch. This is a reasonable model since a learning strategy should only be altered for an arm as a result of acquiring more information about the arm. The sequence ensures that as an arm is repeatedly selected and the confidence in the expected stationary reward of the arm grows, so does the length of each epoch when the arm is selected. Consequently, once highly suboptimal arms are discarded, each epoch contains sufficiently many iterations to guarantee that the observed rewards closely approximate the stationary distribution rewards—this is crucial for discriminating between the optimal arm and nearly optimal arms. We remark that epochs of a fixed duration would lead to a regret bound that is linear in the time horizon under our analysis. This provides theoretical justification beyond Example 2.2 on the insufficiency of existing bandit algorithms for this problem. Informally, this is because algorithms that do not play arms an increasing number of times by design may never push the state distribution toward a stationary distribution and the rewards drawn from a distribution misaligned with a stationary distribution could be highly suboptimal. This observation serves as further motivation for the feedback model we study apart from relevant applications. More detail is provided on this point in Appendix D.

We now return to deriving a bound on the Markovian regret penalty. To do so, we adopt tools from the theory of Markov chains. Indeed, we need a classic result about the convergence rates of Markov chains.

[Fill (1991)]Let be an irreducible and aperiodic transition matrix on a finite state space and be the stationary distribution. Define the chi-squared distance from stationary at time as , where and is the initial distribution of the Markov chain. Then, . Furthermore,

 maxπ0∈\mcP(Θ)∥∥∑θPn(θ,⋅)π0(θ)−π(⋅)∥∥2≤14(1+(1−minθπ(θ))2minθπ(θ))(λ2(M(P)))n, (6)

where is the space of probability distributions on . Noting that is always bounded above by , Equation 6 is easily derived. Proposition 6 provides a bound certifying that the state distribution of a Markov chain with an ergodic transition matrix will converge toward its stationary distribution at least at a geometric rate in when . Recall that when is irreducible, .

The ensuing lemma translates Proposition 6 into a bound on the deviation of the expected reward of an arm selection from the expected stationary distribution reward of the arm; the deviation decays geometrically as a function of the epoch length. Beforehand, for each arm , define the following constants:

 λj=(λ2(M(Pj)))1/2,ηj=min{γ,λj},ϕj=max{γ,λj},ψj=ηj/ϕj.

[Convergence of Expected Reward to Expected Stationary Reward] Suppose Assumptions 2.1 and 2.1 hold and at epoch . Then,

 ∣∣\mbE[μj−\brkj∣∣\ftjTαj(k)−1]∣∣≤Cj\ujj\taukj\bsgrk,

where

 Cj=1/2(1+(1−minθπj(θ))2/minθπj(θ))1/2

and is defined as follows depending on the type of reward feedback:

1. [topsep=0pt,itemsep=-2pt]

2. Discount-Averaged Reward Feedback: .

 \ujj\taukj=⎧⎪⎨⎪⎩(ϕj)\taukj−1(1−(ψj)\taukj)(1−ψj),if γ≠λj(ϕj)\taukj−1\taukj,otherwise. (7)
3. Time-Averaged Reward Feedback: .

 \ujj\taukj=1−(λj)\taukj1−λj. (8)

The proof of Lemma 3.1 is primarily a consequence of Proposition 6 and can be found in Appendix B.2. Observe that since Proposition 6 holds for any initial state distribution, Lemma 3.1 holds irrespective of the state distribution at the beginning of an epoch, and hence, is independent of the algorithm. Lemma 3.1 contains a discount factor dependent definition for under discount-averaged reward feedback. To be precise, the definition of depends on if . The definition of provided for the case that holds even when . However, the bound specified for when is tighter than that specified for when . More generally, each bound we give in this paper for discount-averaged reward feedback contains similar discount factor dependent definitions; it will always be the case that the bounds provided for the event in which for some hold when for each , but the latter bounds are stronger.

Returning to the regret decomposition, we apply Lemma 3.1 to (4) and obtain

 \mbEα[n∑k=1∑j∈[m]I{α(k)=j}|\mbE[μj−\brkj|\ftjTαj(k)−1]|]≤\mbEα[n∑k=1∑j∈[m]I{α(k)=j}\ujj\taukj]. (9)

We now derive a bound on (9) dependent on the type of reward feedback. Building on Lemma 3.1, we need to consider several cases: 1) and for all , 2) and for some , and 3) .

Case 1. and .

 \mbEα[∑nk=1∑j∈[m]I{α(k)=j}\ujj\taukj] =\mbEα[∑nk=1∑j∈[m]I{α(k)=j}Cj(ϕj)\taukj−1(1−(ψj)\taukj)\bsgrk(1−ψj)] ≤\mbEα[∑j∈[m]Cj1−ψj∑nk=1I{α(k)=j}(ϕj)\taukj−1\bsgrk] ≤\mbEα[∑j∈[m]Cj1−ψj∑nk=1I{α(k)=j}(ϕj)\taukj−1] =\mbEα[∑j∈[m]Cj1−ψj∑Tαj(n)i=1(ϕj)\tauz+ζ(i−1)−1] ≤∑j∈[m]Cj1−ψj∑ni=1(ϕj)\tauz+ζ(i−1)−1 (10) =∑j∈[m]Cj((ϕj)\tauzϕj−ηj)(1−(ϕj)ζn1−(ϕj)ζ)

Observe that as , the inner sum found in (10) approaches the constant given as since it is a geometric series.

Case 2. and .

 \mbEα[∑nk=1∑j∈[m]I{α(k)=j} \ujj\taukj] =\mbEα[∑nk=1∑j∈[m]I{α(k)=j}Cj(ϕj)\taukj−1\taukj\bsgrk] ≤\mbEα[∑j∈[m]Cj∑nk=1I{α(k)=j}(ϕj)\taukj−1\taukj] =\mbEα[∑j∈[m]Cj∑Tαj(n)i=1(ϕj)\tauz+ζ(i−1)−1(\tauz+ζ(i−1))] ≤∑j∈[m]Cj∑ni=1(ϕj)\tauz+ζ(i−1)−1(\tauz+ζ(i−1)) (11) =∑j∈[m]Cj(ϕj)\tauz−1(\tauz−(ϕj)ζn(\tauz+ζn)1−(ϕj)ζ+ζ(ϕj)ζ(1−(ϕj)ζn)(1−(ϕj)ζ)2)

The final equality follows from recognizing that the inner sum contained in (11) is an arithmetico-geometric series and substituting the expression for the finite sum. Observe that as , the arithmetico-geometric series in (11) approaches the constant given as . For more details on this series, see Appendix B.3.

Case 3: .

 \mbEα[∑nk=1∑j∈[m]I{α(k)=j}\ujj\taukj] ≤\mbEα[∑j∈[m]Cj1−λj∑nk=1I{α(k)=j}1\taukj] =\mbEα[∑j∈[m]Cj1−λj∑Tαj(n)i=11\tauz+ζ(i−1)] ≤∑j∈[m]Cj1−λj∑ni=11\tauz+ζ(i−1) (12) ≤∑j∈[m]Cj1−λj(1\tauz+1ζlog(1+ζn\tauz))

The final inequality is obtained from the observation that the inner sum found in (12) is a harmonic sum that can be bound with standard techniques. We include the derivation in Appendix B.1.

The bounds we just derived give our final bounds on the Markovian regret penalty. Hence, plugging the bounds on the Markovian regret penalty back into the initial expression for the regret decomposition found in (2) gives rise to the following proposition. [Regret Decomposition] Suppose Assumptions 2.1 and 2.1 hold. Then, for any given algorithm with corresponding epoch length sequence as given in (5):

where is defined as follows depending on the type of reward feedback:

1. [topsep=0pt,itemsep=-2pt]

2. Discount-Averaged Reward Feedback: .

 \ljjn=⎧⎪ ⎪ ⎪⎨⎪ ⎪ ⎪⎩Cj((ϕj)\tauzϕj−ηj)(1−(ϕj)ζn1−(ϕj)ζ),if γ≠λj ∀ j∈[m]Cj(ϕj)\tauz−1(\tauz−(ϕj)ζn(\tauz+ζn)1−(ϕj)ζ+ζ(ϕj)ζ(1−(ϕj)ζn)(1−(ϕj)ζ)2),otherwise. (13)
3. Time-Averaged Reward Feedback: .

 \ljjn=Cj1−λj(1\tauz+1ζlog(1+ζn\tauz)). (14)

The type of reward feedback (discount-averaged or time-averaged) for which the Markovian regret penalty of is not as costly depends on the precise discount factor under discount-averaged reward feedback, the Markov chain statistics for each , and the time horizon . Typically however, the Markovian regret penalty will be smaller under discount-averaged reward feedback than under time-averaged reward feedback. In most cases, this is to be expected since the rewards are given increased weight as the state distribution tends closer to a stationary distribution.

#### 3.1.1 Discussion of Regret Decomposition

The dynamic and evolving reward structure present in the problem we study leads any algorithm to incur regret beyond the usual penalty for playing suboptimal arms via, what we refer to as, the Markovian regret penalty (see 3). Leveraging classic results on mixing of Markov chains and the construction of the epoch length sequence considered in this work, we bounded the Markovian regret penalty with . In essence, this bound limits the regret arising from the rewards on an arm being drawn from an evolving distribution to a term that quickly approaches a constant as the time horizon grows in the case of discount-averaged reward feedback and a term that grows only logarithmically in the time horizon in the case of time-averaged reward feedback. The regret decomposition allows us to now focus soley on the selection of suboptimal arms.

### 3.2 Preliminaries for Algorithm–Based Regret Bounds

Given Proposition 3.1, in order to obtain a bound on the regret for a particular algorithm , we need to limit for each . To do so, it is important to characterize the uncertainty in the empirical mean reward of each arm as a function of the number of times the arm has been pulled. Fundamentally, there are two sources of uncertainty in the observed rewards:

1. The reward distribution on each arm is dynamic owing to the dependence on the unobserved and evolving state distribution.

2. The observed rewards derive from stochastic reward distributions.

Hence, in contrast to the conventional stochastic multi-armed bandit problem, where the stochasticity of the observed rewards is the only source of uncertainty, we must also carefully consider how much uncertainty arises from the dynamic nature of the reward distributions as an artifact of the unobserved and evolving state distribution.

From Lemma 3.1, we can observe that the upper bound on the deviation between the expected reward of an arm selection and the expected stationary distribution reward for that arm decays as a function of the number of times the arm has been selected—since epochs grow linearly in the number of times an arm has been pulled in the past. Consequently, the mean of these deviations vanishes as the number of times the arm has been pulled grows. Using this observation, the following lemma delineates the maximum amount of uncertainty in the empirical mean reward of an arm arising from the dynamic nature of the reward distribution on the arm from that coming out of the stochasticity of the rewards. Precisely, Lemma 3.2 provides a bound on the deviation between the expected mean reward and the expected stationary distribution reward for an arm after it has been selected times. [Convergence of Expected Mean Reward to Expected Stationary Reward] Suppose Assumptions 2.1 and 2.1 hold. Then, after an arm has been played times by an algorithm with corresponding epoch length sequence as given in (5),

 ∣∣μj−1Tj∑Tji=1\mbE[\carij|\ftji−1]∣∣≤\ljjTjTj.

The proof of Lemma 3.2 follows from manipulating the expression that needs to be bounded into a sum over terms that can each be bounded using Lemma 3.1 and then applying similar analysis to that which was used to bound (9) when deriving Proposition 3.1. The full proof can be found in Appendix B.3.

In a similar manner to how we were able to limit the Markovian regret penalty, Lemma 3.2 limits the amount of uncertainty stemming from the dynamic nature of the reward distribution on an arm to a term that tends toward zero quickly as a function of the number of times the arm has been pulled.

Given that Lemma 3.2 characterizes the maximum amount of uncertainty coming solely from the evolution of the reward distributions in time, we are left to identify the uncertainty arising from the stochasticity in the rewards. To do so, we need a concentration inequality that does not require independence in the observed rewards of an arm since the underlying Markov chain that generates the rewards is common across the arms. On that account, an important technical tool for our impending algorithm-based regret analysis is the Azuma-Hoeffding inequality. [Azuma-Hoeffding Inequality (Azuma, 1967; Hoeffding, 1963)] Suppose is a martingale with respect to the filtration and there are finite, non-negative constants , such that almost surely for all . Then for all

 P(Zn−\mbE[Zn]≤−ϵ)≤exp(−ϵ22∑ni=1c2i).

To apply the Azuma-Hoeffding inequality, we need to formulate our problem as a Martingale difference sequence. Toward this end, define the random variables

 Xj,i=\carij−\mbE[\carij|\ftji−1],

where the expectation is taken with respect to , and

 Yj,Tj=∑Tji=1Xj,i, (15)

where denotes number of times arm has been played. Note that is a martingale; indeed, since is –measurable by construction,

 \mbE[Yj,Tj+1|\ftjTj]=\mbE[Xj,Tj+1|\ftjTj]+\mbE[Yj,Tj|\ftjTj]=Yj,Tj

and since rewards are bounded. Moreover, the boundedness of the rewards also implies the martingale has bounded differences: almost surely since rewards are normalized to be on the interval , without loss of generality.

The remainder of this section is devoted to presenting our proposed EpochUCB and EpochGreedy algorithms along with the regret bound guarantees we obtain for each of these algorithms. The environment simulation procedure for the algorithms is given in Algorithm 1. To derive the algorithm-based regret bounds, we make use of the techniques we have developed to reason about the uncertainty in the empirical mean reward of each arm in conjunction with the proof techniques developed to analyze the UCB and –greedy algorithms.

### 3.3 EpochUCB Algorithm Analysis

In this section, we analyze the regret of EpochUCB (Algorithm 2). At a high level, EpochUCB plays the arm that maximizes the sum of the empirical mean reward and the confidence window at each epoch for a time period that grows linearly as a function of the number of times the arm selection has been chosen in the past. More formally, for each arm , define the empirical mean reward after epochs to be

 \barRTαj(k−1)j=1Tαj(k−1)∑Tαj(k−1)i=1\carij,

and the confidence window at epoch to be

 kTαj(k−1)j=\ljjTαj(k−1)Tαj(k−1)+√6log(k)Tαj(k−1). (16)

Following an initialization round in which each arm is played once, the algorithm selects the arm at epoch such that:

 α(k)=argmaxj∈[m]\barRTαj(k−1)j+kT