1 Introduction
A classical Markov Decision Process (MDP) provides a formal description of a sequential decision making problem. Markov decision processes are a standard model for problems in decision making with uncertainty (Puterman (1994), Bertsekas and Tsitsiklis (1996)) and in particular for reinforcement learning. In the classical MDP model, the uncertainty is modeled by stochastic statetransition dynamics and reward functions, which however remain fixed throughout. Unlike this, here we consider a setting in which both the transition dynamics and the reward functions are allowed to change over time. As a motivation, consider the problem of deciding which ads to place on a webpage. The instantaneous reward is the payoff when viewers are redirected to an advertiser, and the state captures the details of the current ad. With a heterogeneous group of viewers, an invariant statetransition function cannot accurately capture the transition dynamics. The instantaneous reward, dependent on external factors, is also better represented by changing reward functions. For more details of how this particular example fits our model, cf. Yuan Yu and Mannor (2009a), which studies a similar MDP problem, as well as Yuan Yu and Mannor (2009b) and Abbasi et al. (2013) for additional motivation and further practical applications of this problem setting.
1.1 Main contribution
For the mentioned switchingMDP problem setting in which an adversary can make abrupt changes to the transition probabilities and reward distributions a certain number of times, we provide an algorithm called SWUcrl, a version of Ucrl2 (Jaksch et al. (2010)) that employs a sliding window to quickly adapt to potential changes. We derive a highprobability upper bound on the cumulative regret of our algorithm of when the window size is adapted to the problem setting, including the number of changes. This improves upon the upper bound for Ucrl2 with restarts (Jaksch et al. (2010)) for the same problem in terms of dependence on , and . Moreover, our algorithm also works without the knowledge of the number of changes, although with a more convoluted regret bound, which shall be specified later.
1.2 Related work
There exist several works on reinforcement learning in finite (nonchanging) MDPs, including Burnetas and Katehakis (1997), Bartlett and Tewari (2009), Jaksch et al. (2010) to mention only a few. MDPs in which the statetransition probabilities change arbitrarily but the reward functions remain fixed, have been considered by Nilim and El Ghaoui (2005), Xu and Mannor (2006). On the other hand, Evendar et al. (2005) and Dick et al. (2014) consider the problem of MDPs with fixed statetransition probabilities and changing reward functions. Moreover, Evendar et al. (2005, Theorem 11) also show that the case of MDPs with both changing statetransition probabilities and changing reward functions is computationally hard. Yuan Yu and Mannor (2009a) and Yuan Yu and Mannor (2009b) consider arbitrary changes in the reward functions and arbitrary, but bounded, changes in the statetransition probabilities. They also give regret bounds that scale with the proportion of changes in the statetransition kernel and which in the worst case grow linearly with time. Abbasi et al. (2013) consider MDP problems with (oblivious) adversarial changes in statetransition probabilities and reward functions and provide algorithms for minimizing the regret with respect to a comparison set of stationary (expert) policies. The MDP setting we consider is similar, however our regret formalization is different, in the sense that we consider the regret against an optimal nonstationary policy (across changes). This setting has already been considered by Jaksch et al. (2010) and we use the suggested Ucrl2 with restarts algorithm as a benchmark to compare our work with.
Sliding window approaches to deal with changing environments have been considered in other learning problems, too. In particular, Garivier and Moulines (2011) consider the problem of changing reward functions for multiarmed bandits and provide a variant of UCB(Auer et al. (2002)) using a sliding window.
1.3 Outline
The rest of the article is structured as follows. In Section 2, we formally define the problem at hand. This is followed by our algorithmic solution, SWUcrl, presented in Section 2, which also features regret bounds and a sample complexity bound. Next, in Section 4, we analyze our algorithm providing proofs for the regret bound. Section 5 provides some complementing experimental results followed by some concluding discussion in Section 6 .
2 Problem setting
In an MDP with finite state space (S = ) and a finite action space (A = ), the learner’s task at each time step is to choose an action to execute in the current state . Upon executing the chosen action in state , the learner receives a reward given by some reward function . Here, we assume that returns a value drawn iid from some unknown distribution on with mean and the environment transitions into the next state selected randomly according to the unknown probabilities .
In this article, we consider a setting in which reward distributions and statetransition probabilities are allowed to change (but not the state space and action space) at unknown time steps (called changepoints henceforth). We call this setting a switchingMDP problem(following the naming of a similar MAB setting by Garivier and Moulines (2011)). Neither the changepoints nor the changes in reward distributions and state transition probabilities depend on the previous behavior of the algorithm or the filtration of the history . It can be assumed that the change points are set in advance at time steps by an oblivious adversary. At time step , a switchingMDP is in its initial configuration where rewards are drawn from an unknown distribution on with mean and state transition occurs according to the transition probabilities . At time step , a switchingMDP is in configuration . Thus, a switchingMDP problem is completely defined by a tuple .
An algorithm attempting to solve a switchingMDP from an initial state chooses an action to execute at time step , i.e. it finds a policy . A policy can either choose the same action for a particular state at any time step (stationary policy), or it might choose different actions for the same state when it is visited at different time steps (nonstationary policy). The sequence of the states visited by at step as decided by its policy , the action chosen and the subsequent reward received for can be be thought of as a result of stochastic process.
As a performance measure, we use regret which is used in various other learning paradigms as well. In order to arrive at the definition of the regret of an algorithm for a switchingMDP , let us define a few other terms. The average reward for a constituent MDP is the limit of the expected average accumulated reward when an algorithm following a stationary policy is run on from an initial state .
We note that for a given (fixed) MDP the optimal average reward is attained by a stationary policy and cannot be increased by using nonstationary policies.
Another intrinsic parameter for MDP configuration is its diameter.
Definition 1.
(Diameter of a MDP) The diameter of a MDP is defined as follows:
where the random variable
denotes the number of steps needed to reach state from state in an MDP for the first time following any policy from the set of feasible stationary policies.For MDPs with finite diameter, the optimal average reward does not depend on the initial state (Puterman (1994)). Thus, assuming finite diameter for all the constituent MDPs of a switchingMDP problem, for constituent MDP is defined as
With the above in hand, we can state that the regret of an algorithm for a switchingMDP problem is the sum of the missed rewards compared to the optimal average rewards ’s when the corresponding constituent MDP is active.
Definition 2.
(Regret for a switchingMDP problem) The regret of an algorithm operating on a switchingMDP problem = and starting at an initial state is defined
where, if is active at time .
When it is clear from the context, we drop the subtext and simply use to denote .
3 Proposed algorithm: SWUCRL
Our proposed algorithm, called Sliding Window UCRL (SWUcrl) is a nontrivial modification of the Ucrl2 algorithm given by Jaksch et al. (2010). Unlike Ucrl2, our algorithm SWUcrl only maintains history of the last (called, window size) time steps. In a way, it could interpreted as SWUcrl slides a window of size across the filtration of history.
At its core, SWUcrl works on the principle of “optimism in the face of uncertainty”. It proceeds in episodes divided into three phases as its predecessor Ucrl2. At the start of every episode , it assesses its performance in the past timesteps and changes the policy, if necessary. More precisely (see Figure 1), during the initialization phase for episode (steps , and ), it computes the estimates and for mean rewards for each stateaction pair and the statetransition probabilities for each triplet from the last observations. In the policy computation phase (steps and ), SWUcrl defines a set of MDPs which are statistically plausible given and . The mean rewards and the statetransition probabilities of every MDP in are stipulated to be close to the estimated mean rewards and estimated statetransition probabilities
, respectively. The corresponding confidence intervals are specified in Eq. (
1) and Eq. (2). The algorithm then chooses an optimistic MDP from and uses extended value iteration (Jaksch et al., 2010) to select a nearoptimal policy for . In the last phase of the episode (step ), is executed. The lengths of the episodes are not fixed a priori, but depend upon the observations made so far in the current episode as well as the observations before the start of the episode. Episode ends when the number of occurrences of the current stateaction pair in the episode is equal to the number of occurrences of the same stateaction pair in observations before the start of episode . It is worth restating that the values , , and are computed only from the previous observations at the start of each episode. Not considering observations beyond is done with the intention of “forgetting" previously active MDP configurations. Note that due to the episode termination criterion no episode can be longer than steps.The following theorem provides an upper bound on the regret of SWUcrl. The elements of its proof can be found in Section 4.
Theorem 1.
Given a switchingMDP with changes in the reward distributions and statetransition probabilities, with probability at least , it holds that for any initial state and any , the regret of SWUcrl using window size is bounded by
where .
From above, one can compute the optimal value of as follows:
(3) 
If the time horizon and the number of changes are known to the algorithm, then can be set to its optimal value given by Eq. (3), and we get the following bound.
Corollary 1.
Given a switchingMDP problem with and changes in the reward distributions and statetransition probabilities, the regret of SWUcrl using for any initial state and any is upper bounded by
with probability at least .
The proof of this corollary is detailed in Appendix III.
This bound improves upon the bound provided for Ucrl2 with restarts (Jaksch et al. (2010, Theorem 6)) in terms of dependence of , and . Our bound features , and while the provided bound for Ucrl2 with restarts features , and . We note however that it might be be possible to get an improved bound for Ucrl2 with restarts using an optimized restarting schedule.
Finally, we also obtain the following PACbound for our algorithm.
Corollary 2.
Given a switchingMDP problem with changes, with probability at least , the average perstep regret of SWUcrl using is at most after any steps with
The proof of this corollary is detailed in Appendix IV.
4 Analysis of Sliding Window UCRL
The regret can be split up into two components: the regret incurred due to the changes in the MDP () and the regret incurred when the MDP remains the same (). Due to the definition of SWUcrl, a change in the MDP can only affect the episode in which the said change has occurred or the following episode. Due to the episode stopping criterion, the length of an episode can atmost be equal to the window size. Hence .
Now, we compute the regret in the episodes in which the MDP doesn’t change. This computation is similar to the analysis of Ucrl2 in (Jaksch et al., 2010, Section 4). We define the regret in episode in which the switchingMDP doesn’t change its configuration and only stays in configuration as
Then —now considering only episodes which are not affected by changes—, one can show that
(4) 
with probability at least where is the respective number of episodes up to timestep .
Denoting the unchanged MDP in episode as , with probability at least ,
(5) 
Furthermore, as for derivation of (4) and (5) following the proof of Jaksch et al. (2010), one can show that
To proceed from here, we make use of the following novel lemmas which present some challenges related to handling the limitation of history to the sliding window.
Lemma 1.
Provided that , the number of episodes of SWUcrl up to timestep is upper bounded as
The proof for Lemma 1 is given in Appendix I. Here we only provide a key idea behind the proof. We argue that the number of episodes in a batch are maximum, if the stateaction counts at the first step of the batch are all . Summing up such maximal number of episodes for batches of size gives the claimed bound.
Lemma 2.
The detailed proof for Lemma 2 is given in Appendix II. Here, we provide a brief overview of the proof.
Proof sketch. Divide the time horizon into batches such that first batch starts at and each batch ends with the earliest episode termination after the batch size reaches . Then size of each batch and the number of batches . Let in the current batch when episode starts, , and in batch . Then, and we have
The first inequality follows from a proposition 1 given in Appendix B, while the second inequality follows from Jensen’s inequality. ∎
(6) 
(7) 
(8) 
(9) 
5 Experiments
For practical evaluation, we generated switchingMDPs with , and . The changes are set to happen at every time steps. This simple setting can be motivated from the ad example given in Section 1 in which changes happen at regular intervals.
For SWUcrl, the window size was chosen to be the optimal as given by Eq.(3), using a lower bound of for the diameter. For comparison, we used two algorithms : Ucrl2 with restarts as given in Jaksch et al. (2010) (referred to as Ucrl2R henceforth) and Ucrl2 with restarts after every time steps (referred to as Ucrl2RW henceforth). Note that the latter restarting schedule is a modification by us, not provided by Jaksch et al. (2010). SWUcrl, Ucrl2R, and Ucrl2RW were run with on switchingMDP problems with random rewards and statetransition probabilities.
Figure (a)a shows the average regret for changes and Figure (b)b for changes. A clearly noticeable trend in both plots (at least for SWUcrl and our modification, Ucrl2RW) are the “bumps” in regret curves at time steps where the changes occur. That behaviour is expected as it shows that the algorithms were learning the MDP configuration indicated by the regret curves beginning to flatten, when a change to another MDP results in an ascent of regret curves. Ucrl2R, and Ucrl2RW give only slightly worse performance when the number of changes are limited to . However, even for a moderate number of changes as , SWUcrl and our modification, Ucrl2RW are observed to give better performance than Ucrl2R. In both cases, our proposed algorithm gives improved performance over Ucrl2RW.
6 Discussion and Further Directions
Theoretical performance guarantee and experimental results demonstrate that the algorithm introduced in this article, SWUcrl, provides a competent solution for the task of regretminimization on MDPs with arbitrarily changing rewards and statetransition probabilities. We have also provided a sample complexity bound on the number of suboptimal steps taken by SWUcrl.
We conjecture that the sample complexity bound can be used to provide a variationdependent regret bound, although the proof might present a few technical difficulties when handling the sliding window aspect of the algorithm. A related question is to establish a link between the extent of allowable variation in rewards and statetransition probabilities and the minimal achievable regret, as was done recently for the problem of multiarmed bandits with nonstationary rewards in Besbes et al. (2014). Another direction is to refine the episodestopping criterion so that a new policy is computed only when the currently employed policy performs below a suitable reference value.
References

Abbasi et al. [2013]
Yasin Abbasi, Peter L Bartlett, Varun Kanade, Yevgeny Seldin, and Csaba
Szepesvari.
Online learning in Markov decision processes with adversarially chosen transition probability distributions.
In Advances in Neural Information Processing Systems 26, pages 2508–2516. Curran Associates, Inc., 2013.  Auer et al. [2002] Peter Auer, Nicolò CesaBianchi, and Paul Fischer. Finitetime analysis of the multiarmed bandit problem. Mach. Learn., 47(23):235–256, May 2002.

Bartlett and Tewari [2009]
Peter L. Bartlett and Ambuj Tewari.
Regal: A regularization based algorithm for reinforcement learning in
weakly communicating mdps.
In
Proceedings of the TwentyFifth Conference on Uncertainty in Artificial Intelligence
, UAI ’09, pages 35–42, Arlington, Virginia, United States, 2009. AUAI Press. ISBN 9780974903958.  Bertsekas and Tsitsiklis [1996] Dimitri P. Bertsekas and John N. Tsitsiklis. NeuroDynamic Programming. 1996.
 Besbes et al. [2014] Omar Besbes, Yonatan Gur, and Assaf Zeevi. Stochastic multiarmedbandit problem with nonstationary rewards. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 199–207. Curran Associates, Inc., 2014.
 Burnetas and Katehakis [1997] Apostolos N. Burnetas and Michael N. Katehakis. Optimal adaptive policies for markov decision processes. Math. Oper. Res., 22(1):222–255, February 1997. ISSN 0364765X.

Dick et al. [2014]
T Dick, András György, and Csaba Szepesvári.
Online learning in Markov decision processes with changing cost
sequences.
In
Proceedings of the International Conference on Machine Learning
, pages 512–520, 01 2014.  Evendar et al. [2005] Eyal Evendar, Sham M Kakade, and Yishay Mansour. Experts in a Markov decision process. In L. K. Saul, Y. Weiss, and L. Bottou, editors, Advances in Neural Information Processing Systems 17, pages 401–408. MIT Press, 2005.
 Garivier and Moulines [2011] Aurélien Garivier and Eric Moulines. On upperconfidence bound policies for switching bandit problems. In Proceedings of the 22Nd International Conference on Algorithmic Learning Theory, ALT’11, pages 174–188, Berlin, Heidelberg, 2011. SpringerVerlag.
 Jaksch et al. [2010] Thomas Jaksch, Ronald Ortner, and Peter Auer. Nearoptimal regret bounds for reinforcement learning. J. Mach. Learn. Res., 11:1563–1600, August 2010.
 Nilim and El Ghaoui [2005] Arnab Nilim and Laurent El Ghaoui. Robust control of Markov decision processes with uncertain transition matrices. Oper. Res., 53(5):780–798, September 2005.
 Puterman [1994] M. L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley, New York, 1994.
 Xu and Mannor [2006] Huan Xu and Shie Mannor. The robustnessperformance tradeoff in Markov decision processes. In NIPS, pages 1537–1544. MIT Press, 2006.
 Yuan Yu and Mannor [2009a] Jia Yuan Yu and Shie Mannor. Arbitrarily modulated Markov decision processes. In Proceedings of the IEEE Conference on Decision and Control, pages 2946–2953, 12 2009a.

Yuan Yu and Mannor [2009b]
Jia Yuan Yu and Shie Mannor.
Online learning in Markov decision processes with arbitrarily
changing rewards and transitions.
In
2009 International Conference on Game Theory for Networks
, pages 314–322, May 2009b.
Appendix I Proof of Lemma 1
Proof.
Divide the time steps into batches of equal size (with the possible exception of the last batch). For each of these batches we consider the maximal number of episodes contained in this batch. Obviously, the maximal number of episodes can be obtained greedily, if each episode is shortest possible. For each time step with state action counts in the window reaching back to , the shortest possible episode starting at (according to the episode termination criterion) will consist of repeated visits to a fixed stateaction pair contained in .
Accordingly, in a window of size , the number of episodes is largest, if the stateaction counts at the first step of the batch are all . For this case we know (cf. Lemma of Jaksch et al. [2010]) that the number of episodes within steps is bounded by Summing up over all batches gives the claimed bound.
∎
Appendix II Technical Details for the proof of Lemma 2
A Proof of Lemma 2
Proof.
We shall prove this lemma by dividing the time horizon into of batches (different from those used in the proof of Lemma 1) as follows. The first batch starts at and each batch ends with the earliest episode termination after the batch size reached . That way, each episode is completely contained in one batch. As any episode can be at most of size , it holds that size of each batch . Therefore, the number of batches .
Let be the set containing the episodes in batch b, and let number of occurrences of stateaction pair in the current batch when episode starts. Clearly . Let . Furthermore, let number of occurrences of stateaction pair in batch , setting . Note that .
We have
In the above, the first inequality follows from using Proposition 1 with , , , , and , while the second inequality follows from Jensen’s inequality. ∎
B Proposition required to prove Lemma 2
Proposition 1.
For any nonnegative integers , and with the following properties
(10) 
(11) 
(12) 
(13) 
(14) 
it holds that,
Proof.
We now prove the proposition by induction over .
Base case:
The first equality is true because and the last inequality is true because

if , then , and and the RHS is nonnegative since all and are nonnegative integers.

if , then and using (12).
Inductive step: