DeepAI
Log In Sign Up

Online Reinforcement Learning for Periodic MDP

We study learning in periodic Markov Decision Process(MDP), a special type of non-stationary MDP where both the state transition probabilities and reward functions vary periodically, under the average reward maximization setting. We formulate the problem as a stationary MDP by augmenting the state space with the period index, and propose a periodic upper confidence bound reinforcement learning-2 (PUCRL2) algorithm. We show that the regret of PUCRL2 varies linearly with the period and as sub-linear with the horizon length. Numerical results demonstrate the efficacy of PUCRL2.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

05/14/2019

Variational Regret Bounds for Reinforcement Learning

We consider undiscounted reinforcement learning in Markov decision proce...
10/24/2020

Efficient Learning in Non-Stationary Linear Markov Decision Processes

We study episodic reinforcement learning in non-stationary linear (a.k.a...
07/22/2019

Efficient Policy Learning for Non-Stationary MDPs under Adversarial Manipulation

A Markov Decision Process (MDP) is a popular model for reinforcement lea...
09/28/2019

Accelerating the Computation of UCB and Related Indices for Reinforcement Learning

In this paper we derive an efficient method for computing the indices as...
05/17/2020

Optimizing for the Future in Non-Stationary MDPs

Most reinforcement learning methods are based upon the key assumption th...
05/09/2012

Exploring compact reinforcement-learning representations with linear regression

This paper presents a new algorithm for online linear regression whose e...

I Introduction

Reinforcement learning (RL) deals with the problem of optimal sequential decision making in an unknown environment. Sequential decision making in an environment with an unknown statistical model is typically modeled as a Markov decision process (MDP) where the decision maker, at each time step, has to take an action based on the state of the environment, resulting to a probabilistic transition to the next state and a reward accrued by the decision maker depending on the current state and current action. RL has widespread applications in many areas including robotics [kober2013reinforcement], resource allocation in wireless networks [5137416], healthcare [gottesman2019guidelines], finance [bacoyannis2018idiosyncrasies] etc.

In a stationary MDP, the unknown transition probabilities and reward functions are invariant with time. However, the ubiquitous presence of non-stationarity in real world scenarios often limits the application of stationary reinforcement learning algorithms. Most of the existing works require information about the maximum possible amount of changes that occur in the environment via variation budget in the transition and reward function, or via the number of times the environment changes; this does not require any assumption on the nature of non-stationarity in the environment. On the contrary, we consider a periodic MDP whose state transition probabilities and reward functions are unknown but periodic with a known period . In this setting, we propose the PUCRL2 algorithm and analyse its regret.

Non-stationary RL has been extensively studied in varied scenarios [auer2008near, gajane2018sliding, li2019online, ortner2020variational, cheung2020reinforcement, fei2020dynamic, domingues2021kernel, mao2021near, zhou2020nonstationary, touati2020efficient, wei2021non].The authors of [auer2008near] propose a restart version of the popular UCRL2 algorithm meant for stationary RL problems, which achieves an regret where is the number of time steps, under the setting in which the MDP changes at most number of times. In the same setting [gajane2018sliding] shows that UCRL2 with sliding windows achieves the same regret. In time-varying environment, a more apposite measure for performance of an algorithm is dynamic regret which measures the difference between accumulated reward through online policy and that of the optimal offline non-stationary policy. This was first analysed in [li2019online] in a solely reward varying environment. The authors of [ortner2020variational] propose first variational dynamic regret bound of , where represents the total variation in the MDP. The work of [cheung2020reinforcement] provides the sliding-window UCRL2 with confidence widening, which achieves an dynamic regret, where and represent the maximum amount of possible variation in reward function and transition kernel respectively. They also propose a Bandit-over-RL (BORL) algorithm which tunes the UCRL2-based algorithm in the setting of unknown variational budgets. Further, in the model-free and episodic setting, [wei2021non] propose policy optimization algorithms and [fei2020dynamic] propose RestartQ-UCB which achieves a dynamic regret bound of ,where represent the amount of changes in the MDP and H represents the episode length. The paper [domingues2021kernel] studies a kernel based approach for non-stationarity in MDPs with metric spaces. In the linear MDP case, [mao2021near] and [zhou2020nonstationary] provide optimal regret guarantees. Finally the authors of [wei2021non] provide a black-box algorithm which turns any (near)-stationary algorithm to work in a non-stationary environment with optimal dynamic regret , where and represent the number and amount of changes of the environment, respectively.

Periodic MDP has been marginally studied in literature. The authors of [riis1965discounted] study it in the discounted reward setting, where a policy-iteration algorithm is proposed. The authors of [veugen1983numerical] propose the first state-augmentation method for conversion of periodic MDP into a stationary one, and analyse the performance of various iterative methods for finding the optimal policy. Recently, [hu2014near] derive a corresponding value iteration algorithm suitable for periodic problems in discounted reward case and provide near-optimal bounds for greedy periodic policies. To the best of our knowledge, RL in periodic MDP has not been studied.

In this paper, we make the following contributions:

  • We study a special form of non-stationarity where the unknown reward and transition functions vary periodically with a known period .

  • We propose a modification PUCRL2 of UCRL2, which treats the periodic MDP as stationary MDP with augmented state space. We derive a static regret bound which has a linear dependence on and sub-linear dependence on .

  • Numerical results show that PUCRL2 performs much better against competing algorithms.

Ii Problem Formulation

A discrete time periodic MDP is defined as the tuple . We consider a finite state space and a finite action space , with cardinality and respectively. For the period index, defines the transition probability function such that

is the probability distribution for next state given current state-action pair, for all

pair and denotes the reward function where is the mean reward given current state-action pair, for all pair. The number represents the period of the MDP such that and for any time index . The horizon length is and we assume that .

Now, the PMDP can be transformed into a stationary MDP with augmented state-space (henceforth referred as AMDP). In this AMDP, we couple the period index and states together to obtain an augmented state space ; if the state of the original MDP is at time , then the corresponding state in the AMDP will be , where represents the modulo operator. Consequently, the (time-homogeneous) transition probability of the AMDP for current state and current action becomes:

The corresponding mean reward of the AMDP is given by . Obviously, under any deterministic stationary policy for the AMDP, each (state,period index) pair can only be visited after number of time steps. Thus, the PMDP becomes a stationary AMDP with periodic transition matrix as shown in Figure (1). Let denote the optimal time-averaged (average expected reward over large number of time steps and then taking a Cesaro limit) reward [puterman2014markov, Section 8.2.1] of the AMDP. In this paper, we seek to develop an RL algorithm so as to minimize the static regret with respect to this optimal average reward . Let be any generic policy for the AMDP. Our problem is:

Fig. 1: Augmented MDP with periodic states.

Iii The proposed algorithm

In this section, we provide a non-trivial modification to the state of the art UCRL2 algorithm [auer2008near] for PMDP. Our proposed Algorithm (1

) is named as PUCRL2. PUCRL2 estimates the mean reward and the transition kernel for each augmented state-action pair, while keeping in mind that the transition occurs only to augmented states with the next period index and the probability of transitioning to other augmented states is zero. Hence the algorithm only estimates the non-zero transition probabilities

at any time .

Iii-a PUCRL2 algorithm

  Input: confidence parameter .
  Initialization:
  for phase k = 1,2,… do
      {starting time of episode k}
     1. Initialize episode k: , ,
      
      
     2. Update the confidence set: We define the confidence region for transition probability function and reward functions as:
(1)
(2)
Then, is the set of all MDP models, such that (8) and (8) is satisfied for all pair.
     3. Optimistic Planning: Compute Modified-Extended Value Iteration (2)
     4. Execute Policies:
     while  do
        Draw ; observe reward , and the next state .
        Set and
     end while
  end for
Algorithm 1 P-UCRL2

Like UCRL2, PUCRL2 proceeds in episodes. At the beginning of each episode, it computes the estimates from previous observations of visits, transitions and rewards accumulated prior to the episode for each (state,period index)-action pair which are stored in , and respectively. With high probability, the true AMDP lies within a confidence region computed around these estimates as shown in Lemma (2). Then PUCRL2 utilizes the confidence bounds as in (8) and (8), to find an optimistic MDP and policy using Modified-EVI Algorithm (2) adapted from the extended value iteration (EVI) algorithm depicted in [auer2008near, Section 3.1.2]. This policy is used to take action in the episode until the cumulative number of visits to any (state,period index) pair gets doubled; similar to the doubling criteria for episode termination of [auer2008near].

Iii-B Modified-EVI

Extended value iteration is used in the class of UCRL algorithms to obtain an optimistic MDP model and policy from a high probability confidence region. According to the convergence criteria of Extended Value Iteration as in [auer2008near, Section 3.1.3], aperiodicity is essential i.e. the algorithm should not choose a policy with periodic transition matrix. However, as discussed in Section (II), the AMDP is periodic in nature. Hence, in order to guarantee convergence, we modify the EVI algorithm by applying an aperiodicity transformation (as in [puterman2014markov, Section 8.5.4] ) (3).

Thus at each iteration, Modified-EVI (Algorithm (2)) applies a self transition probability of , where , to the same (state,period index) pair. As shown in [puterman2014markov, Proposition 8.5.8], this transformation does not affect the average reward of any stationary policy.

  Input:
  Initialization:
  for i = 0,1,2,… do
     
(3)
      
      if   then
        Break the for loop.
      end if
  end for
Algorithm 2 Modified - EVI

Iii-C Analysis

Let denote the expected first hitting time of of an AMDP , starting from under a stationary policy . As in [auer2008near, Definition 1] the diameter of an AMDP is defined as:

(4)
Theorem 1.

With probability at least , the regret for PUCRL2 is:

Proof.

See Appendix (A). ∎

Remark.

The confidence bound (8) ignores the known sparsity in the transition function. If we include that knowledge, we obtain the same regret bound. However, when implementing this case Modified-EVI does not converge for few iterations. This issue is left as open work for now.

Iv Numerical results

We compare the performance of PUCRL2 with three other algorithms: (i) UCRL2 [auer2008near] which provides optimal static regret in stationary MDP setting, (ii) UCRL3 [bourel2020tightening] which is a recent improvement over UCRL2, and (iii) BORL [cheung2020reinforcement] which is a parameter free algorithm for the non-stationary setting.

Iv-a Regret of BORL for PMDP

The variation budget as in [cheung2020reinforcement] for the rewards is defined as:

For a PMDP:

Regret bounds of BORL and SW-UCRL [cheung2020reinforcement] for non-stationary MDP are derived in terms of the reward variation budget and a very similar variation budget on the transition kernels. However, for a PMDP, these two algorithms do not exploit the additional structure arising out of periodicity. Since or turn out to be of the order , the regret bound of BORL or SW-UCRL becomes for PMDP.

Fig. 2: Cumulative reward for a 2-state, 2-action PMDP with N = 5(Above) and N = 15(Below).

Iv-B Our experiment

Our synthetic data-set formulation is inspired by [cheung2020reinforcement]. We consider a MDP with two states , two actions and . The variation in the rewards and transition function are modeled using cosine functions as follows:

and

where, . We set the period and , , and compare the cumulative reward of the algorithms after averaging over independent runs. The results are shown in Figure (2). We clearly observe that PUCRL2 outperforms other algorithms.

V Conclusion

Periodic non-stationarity in Markov Decision Processes has been studied in this paper, where the state transition and reward functions vary periodically. Existing RL algorithms for non-stationary and stationary MDPs fail to perform optimally in this setting. We provide a new algorithm called PUCRL2, which outperforms competing algorithms in the field. The static regret term depends linearly on the diameter of the AMDP, the comparison of which with the maximum diameter of non-stationary MDPs is left as our future work.

References

Appendix A Proof of Theorem 1

The proof borrows some ideas from [auer2008near] and is divided into sections. In Appendix (A-A), we upper bound the total regret by removing the randomness in the rewards accumulated. The regret in the episodes where the true AMDP does not lie in the set of plausible AMDPs is bounded above in Appendix (A-B), and with the assumption that it does in Appendix (A-C). Finally, we complete the proof in Appendix (A-D).

A-a Splitting into episodes

As in [auer2008near, Section 4.1] using Hoeffding’s inequality , we can decompose the regret as:

with probability at least , where is the count of (state,period)-action pair after steps.

Let there be m episodes in total , thus .

The regret in each episode can be defined as : . Hence,

(5)

A-B Dealing with failing confidence regions

Lemma 2.

For any , the probability that the true AMDP M is not contained in the set of plausible AMDPs at time t is at most , that is

Proof.

As in [auer2008near, Section C.1] we bound the transition functions using -deviation concentration inequality over distinct events from samples [weissman2003inequalities]:

As the state space has been augmented, we have states and hence events.
Thus, setting

we get,

For rewards, we use Hoeffding’s inequality to bound the deviation of empirical mean from true mean given i.i.d samples

Setting

we get for all pair

A union bound over all possible values of i.e. = 1,2,….. ,gives ( denotes the number of visits in )

Summing these probabilities over all (state,period)-action pairs we obtain the claimed bound .

Lemma 3.

With probability at least , the regret occurred due to failing confidence region i.e.

(6)
Proof.

Refer [auer2008near, Section 4.2] with Lemma (2) instead of [auer2008near, Appendix C.1]

A-C Episodes with

By the assumption and [auer2008near, Theorem 7], the optimistic optimal average reward of the near optimal policy chosen in Modified-EVI (2) is such that .

Thus, we can write the regret of an episode as :

(7)

Let us define to be the last iteration when convergence criteria holds and Modified-EVI terminates, thus as in [auer2008near, Section 4.3.1]

(8)

for all . Expanding as in (3)

Putting it in (8), we get

Thus, putting the above result in (7), and noting that , for , we get

(9)

A-C1 Bounding

(10)

By the property of extended value iteration[auer2008near, Section 4.3.1], extended to Modified-EVI

(11)

where represents the diameter of the augmented MDP with aperiodicity transformation.

Since, and , we can replace by

(12)

such that it follows from (11) that ).

Hence, .

According to [fruit2019exploration, Section 3.3.1], . Hence, .

Thus, the first term in (10) can be bounded as :

(13)

where the last inequality uses the confidence bound (8). We note that the aperiodicity transformation coefficient gets canceled out and does not appear in the regret term.

Following the proof of [auer2008near, Second term, Section 4.3.2], the second term in (10) can be bounded as:

(14)

with probability at least ,where is the number of episodes as in [auer2008near, Appendix C.2].

A-C2 Bounding

(15)

where the last inequality uses the confidence bound (8).

A-D Completing the Proof

Thus, we can write the total episodic regret using (9), (13),(14), and (15), with probability at least :

We can bound the term as in [auer2008near, Section 4.3.3]. Also, noting that .Thus,

(16)

Using (5), (6), (16), with a probability of , we can bound the total regret as:

Further simplifications as in [auer2008near, Appendix C.4] yield the total regret as :