A classical Markov Decision Process (MDP) provides a formal description of a sequential decision making problem. Markov decision processes are a standard model for problems in decision making with uncertainty (Puterman (1994), Bertsekas and Tsitsiklis (1996)) and in particular for reinforcement learning. In the classical MDP model, the uncertainty is modeled by stochastic state-transition dynamics and reward functions, which however remain fixed throughout. Unlike this, here we consider a setting in which both the transition dynamics and the reward functions are allowed to change over time. As a motivation, consider the problem of deciding which ads to place on a webpage. The instantaneous reward is the payoff when viewers are redirected to an advertiser, and the state captures the details of the current ad. With a heterogeneous group of viewers, an invariant state-transition function cannot accurately capture the transition dynamics. The instantaneous reward, dependent on external factors, is also better represented by changing reward functions. For more details of how this particular example fits our model, cf. Yuan Yu and Mannor (2009a), which studies a similar MDP problem, as well as Yuan Yu and Mannor (2009b) and Abbasi et al. (2013) for additional motivation and further practical applications of this problem setting.
1.1 Main contribution
For the mentioned switching-MDP problem setting in which an adversary can make abrupt changes to the transition probabilities and reward distributions a certain number of times, we provide an algorithm called SW-Ucrl, a version of Ucrl2 (Jaksch et al. (2010)) that employs a sliding window to quickly adapt to potential changes. We derive a high-probability upper bound on the cumulative regret of our algorithm of when the window size is adapted to the problem setting, including the number of changes. This improves upon the upper bound for Ucrl2 with restarts (Jaksch et al. (2010)) for the same problem in terms of dependence on , and . Moreover, our algorithm also works without the knowledge of the number of changes, although with a more convoluted regret bound, which shall be specified later.
1.2 Related work
There exist several works on reinforcement learning in finite (non-changing) MDPs, including Burnetas and Katehakis (1997), Bartlett and Tewari (2009), Jaksch et al. (2010) to mention only a few. MDPs in which the state-transition probabilities change arbitrarily but the reward functions remain fixed, have been considered by Nilim and El Ghaoui (2005), Xu and Mannor (2006). On the other hand, Even-dar et al. (2005) and Dick et al. (2014) consider the problem of MDPs with fixed state-transition probabilities and changing reward functions. Moreover, Even-dar et al. (2005, Theorem 11) also show that the case of MDPs with both changing state-transition probabilities and changing reward functions is computationally hard. Yuan Yu and Mannor (2009a) and Yuan Yu and Mannor (2009b) consider arbitrary changes in the reward functions and arbitrary, but bounded, changes in the state-transition probabilities. They also give regret bounds that scale with the proportion of changes in the state-transition kernel and which in the worst case grow linearly with time. Abbasi et al. (2013) consider MDP problems with (oblivious) adversarial changes in state-transition probabilities and reward functions and provide algorithms for minimizing the regret with respect to a comparison set of stationary (expert) policies. The MDP setting we consider is similar, however our regret formalization is different, in the sense that we consider the regret against an optimal non-stationary policy (across changes). This setting has already been considered by Jaksch et al. (2010) and we use the suggested Ucrl2 with restarts algorithm as a benchmark to compare our work with.
Sliding window approaches to deal with changing environments have been considered in other learning problems, too. In particular, Garivier and Moulines (2011) consider the problem of changing reward functions for multi-armed bandits and provide a variant of UCB(Auer et al. (2002)) using a sliding window.
The rest of the article is structured as follows. In Section 2, we formally define the problem at hand. This is followed by our algorithmic solution, SW-Ucrl, presented in Section 2, which also features regret bounds and a sample complexity bound. Next, in Section 4, we analyze our algorithm providing proofs for the regret bound. Section 5 provides some complementing experimental results followed by some concluding discussion in Section 6 .
2 Problem setting
In an MDP with finite state space (S = ) and a finite action space (A = ), the learner’s task at each time step is to choose an action to execute in the current state . Upon executing the chosen action in state , the learner receives a reward given by some reward function . Here, we assume that returns a value drawn iid from some unknown distribution on with mean and the environment transitions into the next state selected randomly according to the unknown probabilities .
In this article, we consider a setting in which reward distributions and state-transition probabilities are allowed to change (but not the state space and action space) at unknown time steps (called change-points henceforth). We call this setting a switching-MDP problem(following the naming of a similar MAB setting by Garivier and Moulines (2011)). Neither the change-points nor the changes in reward distributions and state transition probabilities depend on the previous behavior of the algorithm or the filtration of the history . It can be assumed that the change points are set in advance at time steps by an oblivious adversary. At time step , a switching-MDP is in its initial configuration where rewards are drawn from an unknown distribution on with mean and state transition occurs according to the transition probabilities . At time step , a switching-MDP is in configuration . Thus, a switching-MDP problem is completely defined by a tuple .
An algorithm attempting to solve a switching-MDP from an initial state chooses an action to execute at time step , i.e. it finds a policy . A policy can either choose the same action for a particular state at any time step (stationary policy), or it might choose different actions for the same state when it is visited at different time steps (non-stationary policy). The sequence of the states visited by at step as decided by its policy , the action chosen and the subsequent reward received for can be be thought of as a result of stochastic process.
As a performance measure, we use regret which is used in various other learning paradigms as well. In order to arrive at the definition of the regret of an algorithm for a switching-MDP , let us define a few other terms. The average reward for a constituent MDP is the limit of the expected average accumulated reward when an algorithm following a stationary policy is run on from an initial state .
We note that for a given (fixed) MDP the optimal average reward is attained by a stationary policy and cannot be increased by using non-stationary policies.
Another intrinsic parameter for MDP configuration is its diameter.
(Diameter of a MDP) The diameter of a MDP is defined as follows:
where the random variable
where the random variabledenotes the number of steps needed to reach state from state in an MDP for the first time following any policy from the set of feasible stationary policies.
For MDPs with finite diameter, the optimal average reward does not depend on the initial state (Puterman (1994)). Thus, assuming finite diameter for all the constituent MDPs of a switching-MDP problem, for constituent MDP is defined as
With the above in hand, we can state that the regret of an algorithm for a switching-MDP problem is the sum of the missed rewards compared to the optimal average rewards ’s when the corresponding constituent MDP is active.
(Regret for a switching-MDP problem) The regret of an algorithm operating on a switching-MDP problem = and starting at an initial state is defined
where, if is active at time .
When it is clear from the context, we drop the subtext and simply use to denote .
3 Proposed algorithm: SW-UCRL
Our proposed algorithm, called Sliding Window UCRL (SW-Ucrl) is a non-trivial modification of the Ucrl2 algorithm given by Jaksch et al. (2010). Unlike Ucrl2, our algorithm SW-Ucrl only maintains history of the last (called, window size) time steps. In a way, it could interpreted as SW-Ucrl slides a window of size across the filtration of history.
At its core, SW-Ucrl works on the principle of “optimism in the face of uncertainty”. It proceeds in episodes divided into three phases as its predecessor Ucrl2. At the start of every episode , it assesses its performance in the past time-steps and changes the policy, if necessary. More precisely (see Figure 1), during the initialization phase for episode (steps , and ), it computes the estimates and for mean rewards for each state-action pair and the state-transition probabilities for each triplet from the last observations. In the policy computation phase (steps and ), SW-Ucrl defines a set of MDPs which are statistically plausible given and . The mean rewards and the state-transition probabilities of every MDP in are stipulated to be close to the estimated mean rewards and estimated state-transition probabilities
, respectively. The corresponding confidence intervals are specified in Eq. (1) and Eq. (2). The algorithm then chooses an optimistic MDP from and uses extended value iteration (Jaksch et al., 2010) to select a near-optimal policy for . In the last phase of the episode (step ), is executed. The lengths of the episodes are not fixed a priori, but depend upon the observations made so far in the current episode as well as the observations before the start of the episode. Episode ends when the number of occurrences of the current state-action pair in the episode is equal to the number of occurrences of the same state-action pair in observations before the start of episode . It is worth restating that the values , , and are computed only from the previous observations at the start of each episode. Not considering observations beyond is done with the intention of “forgetting" previously active MDP configurations. Note that due to the episode termination criterion no episode can be longer than steps.
The following theorem provides an upper bound on the regret of SW-Ucrl. The elements of its proof can be found in Section 4.
Given a switching-MDP with changes in the reward distributions and state-transition probabilities, with probability at least , it holds that for any initial state and any , the regret of SW-Ucrl using window size is bounded by
From above, one can compute the optimal value of as follows:
If the time horizon and the number of changes are known to the algorithm, then can be set to its optimal value given by Eq. (3), and we get the following bound.
Given a switching-MDP problem with and changes in the reward distributions and state-transition probabilities, the regret of SW-Ucrl using for any initial state and any is upper bounded by
with probability at least .
The proof of this corollary is detailed in Appendix III.
This bound improves upon the bound provided for Ucrl2 with restarts (Jaksch et al. (2010, Theorem 6)) in terms of dependence of , and . Our bound features , and while the provided bound for Ucrl2 with restarts features , and . We note however that it might be be possible to get an improved bound for Ucrl2 with restarts using an optimized restarting schedule.
Finally, we also obtain the following PAC-bound for our algorithm.
Given a switching-MDP problem with changes, with probability at least , the average per-step regret of SW-Ucrl using is at most after any steps with
The proof of this corollary is detailed in Appendix IV.
4 Analysis of Sliding Window UCRL
The regret can be split up into two components: the regret incurred due to the changes in the MDP () and the regret incurred when the MDP remains the same (). Due to the definition of SW-Ucrl, a change in the MDP can only affect the episode in which the said change has occurred or the following episode. Due to the episode stopping criterion, the length of an episode can at-most be equal to the window size. Hence .
Now, we compute the regret in the episodes in which the MDP doesn’t change. This computation is similar to the analysis of Ucrl2 in (Jaksch et al., 2010, Section 4). We define the regret in episode in which the switching-MDP doesn’t change its configuration and only stays in configuration as
Then —now considering only episodes which are not affected by changes—, one can show that
with probability at least where is the respective number of episodes up to time-step .
Denoting the unchanged MDP in episode as , with probability at least ,
To proceed from here, we make use of the following novel lemmas which present some challenges related to handling the limitation of history to the sliding window.
Provided that , the number of episodes of SW-Ucrl up to time-step is upper bounded as
The proof for Lemma 1 is given in Appendix I. Here we only provide a key idea behind the proof. We argue that the number of episodes in a batch are maximum, if the state-action counts at the first step of the batch are all . Summing up such maximal number of episodes for batches of size gives the claimed bound.
Proof sketch. Divide the time horizon into batches such that first batch starts at and each batch ends with the earliest episode termination after the batch size reaches . Then size of each batch and the number of batches . Let in the current batch when episode starts, , and in batch . Then, and we have
For practical evaluation, we generated switching-MDPs with , and . The changes are set to happen at every time steps. This simple setting can be motivated from the ad example given in Section 1 in which changes happen at regular intervals.
For SW-Ucrl, the window size was chosen to be the optimal as given by Eq.(3), using a lower bound of for the diameter. For comparison, we used two algorithms : Ucrl2 with restarts as given in Jaksch et al. (2010) (referred to as Ucrl2-R henceforth) and Ucrl2 with restarts after every time steps (referred to as Ucrl2-RW henceforth). Note that the latter restarting schedule is a modification by us, not provided by Jaksch et al. (2010). SW-Ucrl, Ucrl2-R, and Ucrl2-RW were run with on switching-MDP problems with random rewards and state-transition probabilities.
Figure (a)a shows the average regret for changes and Figure (b)b for changes. A clearly noticeable trend in both plots (at least for SW-Ucrl and our modification, Ucrl2-RW) are the “bumps” in regret curves at time steps where the changes occur. That behaviour is expected as it shows that the algorithms were learning the MDP configuration indicated by the regret curves beginning to flatten, when a change to another MDP results in an ascent of regret curves. Ucrl2-R, and Ucrl2-RW give only slightly worse performance when the number of changes are limited to . However, even for a moderate number of changes as , SW-Ucrl and our modification, Ucrl2-RW are observed to give better performance than Ucrl2-R. In both cases, our proposed algorithm gives improved performance over Ucrl2-RW.
6 Discussion and Further Directions
Theoretical performance guarantee and experimental results demonstrate that the algorithm introduced in this article, SW-Ucrl, provides a competent solution for the task of regret-minimization on MDPs with arbitrarily changing rewards and state-transition probabilities. We have also provided a sample complexity bound on the number of sub-optimal steps taken by SW-Ucrl.
We conjecture that the sample complexity bound can be used to provide a variation-dependent regret bound, although the proof might present a few technical difficulties when handling the sliding window aspect of the algorithm. A related question is to establish a link between the extent of allowable variation in rewards and state-transition probabilities and the minimal achievable regret, as was done recently for the problem of multi-armed bandits with non-stationary rewards in Besbes et al. (2014). Another direction is to refine the episode-stopping criterion so that a new policy is computed only when the currently employed policy performs below a suitable reference value.
Abbasi et al. 
Yasin Abbasi, Peter L Bartlett, Varun Kanade, Yevgeny Seldin, and Csaba
Online learning in Markov decision processes with adversarially chosen transition probability distributions.In Advances in Neural Information Processing Systems 26, pages 2508–2516. Curran Associates, Inc., 2013.
- Auer et al.  Peter Auer, Nicolò Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Mach. Learn., 47(2-3):235–256, May 2002.
Bartlett and Tewari 
Peter L. Bartlett and Ambuj Tewari.
Regal: A regularization based algorithm for reinforcement learning in
weakly communicating mdps.
Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, UAI ’09, pages 35–42, Arlington, Virginia, United States, 2009. AUAI Press. ISBN 978-0-9749039-5-8.
- Bertsekas and Tsitsiklis  Dimitri P. Bertsekas and John N. Tsitsiklis. Neuro-Dynamic Programming. 1996.
- Besbes et al.  Omar Besbes, Yonatan Gur, and Assaf Zeevi. Stochastic multi-armed-bandit problem with non-stationary rewards. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 199–207. Curran Associates, Inc., 2014.
- Burnetas and Katehakis  Apostolos N. Burnetas and Michael N. Katehakis. Optimal adaptive policies for markov decision processes. Math. Oper. Res., 22(1):222–255, February 1997. ISSN 0364-765X.
Dick et al. 
T Dick, András György, and Csaba Szepesvári.
Online learning in Markov decision processes with changing cost
Proceedings of the International Conference on Machine Learning, pages 512–520, 01 2014.
- Even-dar et al.  Eyal Even-dar, Sham M Kakade, and Yishay Mansour. Experts in a Markov decision process. In L. K. Saul, Y. Weiss, and L. Bottou, editors, Advances in Neural Information Processing Systems 17, pages 401–408. MIT Press, 2005.
- Garivier and Moulines  Aurélien Garivier and Eric Moulines. On upper-confidence bound policies for switching bandit problems. In Proceedings of the 22Nd International Conference on Algorithmic Learning Theory, ALT’11, pages 174–188, Berlin, Heidelberg, 2011. Springer-Verlag.
- Jaksch et al.  Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regret bounds for reinforcement learning. J. Mach. Learn. Res., 11:1563–1600, August 2010.
- Nilim and El Ghaoui  Arnab Nilim and Laurent El Ghaoui. Robust control of Markov decision processes with uncertain transition matrices. Oper. Res., 53(5):780–798, September 2005.
- Puterman  M. L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley, New York, 1994.
- Xu and Mannor  Huan Xu and Shie Mannor. The robustness-performance tradeoff in Markov decision processes. In NIPS, pages 1537–1544. MIT Press, 2006.
- Yuan Yu and Mannor [2009a] Jia Yuan Yu and Shie Mannor. Arbitrarily modulated Markov decision processes. In Proceedings of the IEEE Conference on Decision and Control, pages 2946–2953, 12 2009a.
Yuan Yu and Mannor [2009b]
Jia Yuan Yu and Shie Mannor.
Online learning in Markov decision processes with arbitrarily
changing rewards and transitions.
2009 International Conference on Game Theory for Networks, pages 314–322, May 2009b.
Appendix I Proof of Lemma 1
Divide the time steps into batches of equal size (with the possible exception of the last batch). For each of these batches we consider the maximal number of episodes contained in this batch. Obviously, the maximal number of episodes can be obtained greedily, if each episode is shortest possible. For each time step with state action counts in the window reaching back to , the shortest possible episode starting at (according to the episode termination criterion) will consist of repeated visits to a fixed state-action pair contained in .
Accordingly, in a window of size , the number of episodes is largest, if the state-action counts at the first step of the batch are all . For this case we know (cf. Lemma of Jaksch et al. ) that the number of episodes within steps is bounded by Summing up over all batches gives the claimed bound.
Appendix II Technical Details for the proof of Lemma 2
A Proof of Lemma 2
We shall prove this lemma by dividing the time horizon into of batches (different from those used in the proof of Lemma 1) as follows. The first batch starts at and each batch ends with the earliest episode termination after the batch size reached . That way, each episode is completely contained in one batch. As any episode can be at most of size , it holds that size of each batch . Therefore, the number of batches .
Let be the set containing the episodes in batch b, and let number of occurrences of state-action pair in the current batch when episode starts. Clearly . Let . Furthermore, let number of occurrences of state-action pair in batch , setting . Note that .
In the above, the first inequality follows from using Proposition 1 with , , , , and , while the second inequality follows from Jensen’s inequality. ∎
B Proposition required to prove Lemma 2
For any non-negative integers , and with the following properties
it holds that,
We now prove the proposition by induction over .
The first equality is true because and the last inequality is true because
if , then , and and the RHS is non-negative since all and are non-negative integers.
if , then and using (12).