I Introduction
Reinforcement learning (RL) deals with the problem of optimal sequential decision making in an unknown environment. Sequential decision making in an environment with an unknown statistical model is typically modeled as a Markov decision process (MDP) where the decision maker, at each time step, has to take an action based on the state of the environment, resulting to a probabilistic transition to the next state and a reward accrued by the decision maker depending on the current state and current action. RL has widespread applications in many areas including robotics [kober2013reinforcement], resource allocation in wireless networks [5137416], healthcare [gottesman2019guidelines], finance [bacoyannis2018idiosyncrasies] etc.
In a stationary MDP, the unknown transition probabilities and reward functions are invariant with time. However, the ubiquitous presence of nonstationarity in real world scenarios often limits the application of stationary reinforcement learning algorithms. Most of the existing works require information about the maximum possible amount of changes that occur in the environment via variation budget in the transition and reward function, or via the number of times the environment changes; this does not require any assumption on the nature of nonstationarity in the environment. On the contrary, we consider a periodic MDP whose state transition probabilities and reward functions are unknown but periodic with a known period . In this setting, we propose the PUCRL2 algorithm and analyse its regret.
Nonstationary RL has been extensively studied in varied scenarios [auer2008near, gajane2018sliding, li2019online, ortner2020variational, cheung2020reinforcement, fei2020dynamic, domingues2021kernel, mao2021near, zhou2020nonstationary, touati2020efficient, wei2021non].The authors of [auer2008near] propose a restart version of the popular UCRL2 algorithm meant for stationary RL problems, which achieves an regret where is the number of time steps, under the setting in which the MDP changes at most number of times. In the same setting [gajane2018sliding] shows that UCRL2 with sliding windows achieves the same regret. In timevarying environment, a more apposite measure for performance of an algorithm is dynamic regret which measures the difference between accumulated reward through online policy and that of the optimal offline nonstationary policy. This was first analysed in [li2019online] in a solely reward varying environment. The authors of [ortner2020variational] propose first variational dynamic regret bound of , where represents the total variation in the MDP. The work of [cheung2020reinforcement] provides the slidingwindow UCRL2 with confidence widening, which achieves an dynamic regret, where and represent the maximum amount of possible variation in reward function and transition kernel respectively. They also propose a BanditoverRL (BORL) algorithm which tunes the UCRL2based algorithm in the setting of unknown variational budgets. Further, in the modelfree and episodic setting, [wei2021non] propose policy optimization algorithms and [fei2020dynamic] propose RestartQUCB which achieves a dynamic regret bound of ,where represent the amount of changes in the MDP and H represents the episode length. The paper [domingues2021kernel] studies a kernel based approach for nonstationarity in MDPs with metric spaces. In the linear MDP case, [mao2021near] and [zhou2020nonstationary] provide optimal regret guarantees. Finally the authors of [wei2021non] provide a blackbox algorithm which turns any (near)stationary algorithm to work in a nonstationary environment with optimal dynamic regret , where and represent the number and amount of changes of the environment, respectively.
Periodic MDP has been marginally studied in literature. The authors of [riis1965discounted] study it in the discounted reward setting, where a policyiteration algorithm is proposed. The authors of [veugen1983numerical] propose the first stateaugmentation method for conversion of periodic MDP into a stationary one, and analyse the performance of various iterative methods for finding the optimal policy. Recently, [hu2014near] derive a corresponding value iteration algorithm suitable for periodic problems in discounted reward case and provide nearoptimal bounds for greedy periodic policies. To the best of our knowledge, RL in periodic MDP has not been studied.
In this paper, we make the following contributions:

We study a special form of nonstationarity where the unknown reward and transition functions vary periodically with a known period .

We propose a modification PUCRL2 of UCRL2, which treats the periodic MDP as stationary MDP with augmented state space. We derive a static regret bound which has a linear dependence on and sublinear dependence on .

Numerical results show that PUCRL2 performs much better against competing algorithms.
Ii Problem Formulation
A discrete time periodic MDP is defined as the tuple . We consider a finite state space and a finite action space , with cardinality and respectively. For the period index, defines the transition probability function such that
is the probability distribution for next state given current stateaction pair, for all
pair and denotes the reward function where is the mean reward given current stateaction pair, for all pair. The number represents the period of the MDP such that and for any time index . The horizon length is and we assume that .Now, the PMDP can be transformed into a stationary MDP with augmented statespace (henceforth referred as AMDP). In this AMDP, we couple the period index and states together to obtain an augmented state space ; if the state of the original MDP is at time , then the corresponding state in the AMDP will be , where represents the modulo operator. Consequently, the (timehomogeneous) transition probability of the AMDP for current state and current action becomes:
The corresponding mean reward of the AMDP is given by . Obviously, under any deterministic stationary policy for the AMDP, each (state,period index) pair can only be visited after number of time steps. Thus, the PMDP becomes a stationary AMDP with periodic transition matrix as shown in Figure (1). Let denote the optimal timeaveraged (average expected reward over large number of time steps and then taking a Cesaro limit) reward [puterman2014markov, Section 8.2.1] of the AMDP. In this paper, we seek to develop an RL algorithm so as to minimize the static regret with respect to this optimal average reward . Let be any generic policy for the AMDP. Our problem is:
Iii The proposed algorithm
In this section, we provide a nontrivial modification to the state of the art UCRL2 algorithm [auer2008near] for PMDP. Our proposed Algorithm (1
) is named as PUCRL2. PUCRL2 estimates the mean reward and the transition kernel for each augmented stateaction pair, while keeping in mind that the transition occurs only to augmented states with the next period index and the probability of transitioning to other augmented states is zero. Hence the algorithm only estimates the nonzero transition probabilities
at any time .Iiia PUCRL2 algorithm
(1) 
(2) 
Like UCRL2, PUCRL2 proceeds in episodes. At the beginning of each episode, it computes the estimates from previous observations of visits, transitions and rewards accumulated prior to the episode for each (state,period index)action pair which are stored in , and respectively. With high probability, the true AMDP lies within a confidence region computed around these estimates as shown in Lemma (2). Then PUCRL2 utilizes the confidence bounds as in (8) and (8), to find an optimistic MDP and policy using ModifiedEVI Algorithm (2) adapted from the extended value iteration (EVI) algorithm depicted in [auer2008near, Section 3.1.2]. This policy is used to take action in the episode until the cumulative number of visits to any (state,period index) pair gets doubled; similar to the doubling criteria for episode termination of [auer2008near].
IiiB ModifiedEVI
Extended value iteration is used in the class of UCRL algorithms to obtain an optimistic MDP model and policy from a high probability confidence region. According to the convergence criteria of Extended Value Iteration as in [auer2008near, Section 3.1.3], aperiodicity is essential i.e. the algorithm should not choose a policy with periodic transition matrix. However, as discussed in Section (II), the AMDP is periodic in nature. Hence, in order to guarantee convergence, we modify the EVI algorithm by applying an aperiodicity transformation (as in [puterman2014markov, Section 8.5.4] ) (3).
Thus at each iteration, ModifiedEVI (Algorithm (2)) applies a self transition probability of , where , to the same (state,period index) pair. As shown in [puterman2014markov, Proposition 8.5.8], this transformation does not affect the average reward of any stationary policy.
(3) 
IiiC Analysis
Let denote the expected first hitting time of of an AMDP , starting from under a stationary policy . As in [auer2008near, Definition 1] the diameter of an AMDP is defined as:
(4) 
Theorem 1.
With probability at least , the regret for PUCRL2 is:
Proof.
See Appendix (A). ∎
Remark.
The confidence bound (8) ignores the known sparsity in the transition function. If we include that knowledge, we obtain the same regret bound. However, when implementing this case ModifiedEVI does not converge for few iterations. This issue is left as open work for now.
Iv Numerical results
We compare the performance of PUCRL2 with three other algorithms: (i) UCRL2 [auer2008near] which provides optimal static regret in stationary MDP setting, (ii) UCRL3 [bourel2020tightening] which is a recent improvement over UCRL2, and (iii) BORL [cheung2020reinforcement] which is a parameter free algorithm for the nonstationary setting.
Iva Regret of BORL for PMDP
The variation budget as in [cheung2020reinforcement] for the rewards is defined as:
For a PMDP:
Regret bounds of BORL and SWUCRL [cheung2020reinforcement] for nonstationary MDP are derived in terms of the reward variation budget and a very similar variation budget on the transition kernels. However, for a PMDP, these two algorithms do not exploit the additional structure arising out of periodicity. Since or turn out to be of the order , the regret bound of BORL or SWUCRL becomes for PMDP.
IvB Our experiment
Our synthetic dataset formulation is inspired by [cheung2020reinforcement]. We consider a MDP with two states , two actions and . The variation in the rewards and transition function are modeled using cosine functions as follows:
and
where, . We set the period and , , and compare the cumulative reward of the algorithms after averaging over independent runs. The results are shown in Figure (2). We clearly observe that PUCRL2 outperforms other algorithms.
V Conclusion
Periodic nonstationarity in Markov Decision Processes has been studied in this paper, where the state transition and reward functions vary periodically. Existing RL algorithms for nonstationary and stationary MDPs fail to perform optimally in this setting. We provide a new algorithm called PUCRL2, which outperforms competing algorithms in the field. The static regret term depends linearly on the diameter of the AMDP, the comparison of which with the maximum diameter of nonstationary MDPs is left as our future work.
References
Appendix A Proof of Theorem 1
The proof borrows some ideas from [auer2008near] and is divided into sections. In Appendix (AA), we upper bound the total regret by removing the randomness in the rewards accumulated. The regret in the episodes where the true AMDP does not lie in the set of plausible AMDPs is bounded above in Appendix (AB), and with the assumption that it does in Appendix (AC). Finally, we complete the proof in Appendix (AD).
Aa Splitting into episodes
As in [auer2008near, Section 4.1] using Hoeffding’s inequality , we can decompose the regret as:
with probability at least , where is the count of (state,period)action pair after steps.
Let there be m episodes in total , thus .
The regret in each episode can be defined as : . Hence,
(5) 
AB Dealing with failing confidence regions
Lemma 2.
For any , the probability that the true AMDP M is not contained in the set of plausible AMDPs at time t is at most , that is
Proof.
As in [auer2008near, Section C.1] we bound the transition functions using deviation concentration inequality over distinct events from samples [weissman2003inequalities]:
As the state space has been augmented, we have states and hence events.
Thus, setting
we get,
For rewards, we use Hoeffding’s inequality to bound the deviation of empirical mean from true mean given i.i.d samples
Setting
we get for all pair
A union bound over all possible values of i.e. = 1,2,….. ,gives ( denotes the number of visits in )
Summing these probabilities over all (state,period)action pairs we obtain the claimed bound .
∎
Lemma 3.
With probability at least , the regret occurred due to failing confidence region i.e.
(6) 
Proof.
Refer [auer2008near, Section 4.2] with Lemma (2) instead of [auer2008near, Appendix C.1] ∎
AC Episodes with
By the assumption and [auer2008near, Theorem 7], the optimistic optimal average reward of the near optimal policy chosen in ModifiedEVI (2) is such that .
Thus, we can write the regret of an episode as :
(7) 
Let us define to be the last iteration when convergence criteria holds and ModifiedEVI terminates, thus as in [auer2008near, Section 4.3.1]
Putting it in (8), we get
Thus, putting the above result in (7), and noting that , for , we get
(9) 
AC1 Bounding
(10) 
By the property of extended value iteration[auer2008near, Section 4.3.1], extended to ModifiedEVI
(11) 
where represents the diameter of the augmented MDP with aperiodicity transformation.
Since, and , we can replace by
(12) 
such that it follows from (11) that ).
Hence, .
According to [fruit2019exploration, Section 3.3.1], . Hence, .
(13) 
where the last inequality uses the confidence bound (8). We note that the aperiodicity transformation coefficient gets canceled out and does not appear in the regret term.
Following the proof of [auer2008near, Second term, Section 4.3.2], the second term in (10) can be bounded as:
(14) 
with probability at least ,where is the number of episodes as in [auer2008near, Appendix C.2].
AC2 Bounding
(15) 
where the last inequality uses the confidence bound (8).
AD Completing the Proof
Thus, we can write the total episodic regret using (9), (13),(14), and (15), with probability at least :
We can bound the term as in [auer2008near, Section 4.3.3]. Also, noting that .Thus,
(16) 
Further simplifications as in [auer2008near, Appendix C.4] yield the total regret as :