# A Tractable Algorithm For Finite-Horizon Continuous Reinforcement Learning

We consider the finite horizon continuous reinforcement learning problem. Our contribution is three-fold. First,we give a tractable algorithm based on optimistic value iteration for the problem. Next,we give a lower bound on regret of order Ω(T^2/3) for any algorithm discretizes the state space, improving the previous regret bound of Ω(T^1/2) of Ortner and Ryabko contrl for the same problem. Next,under the assumption that the rewards and transitions are Hölder Continuous we show that the upper bound on the discretization error is const.Ln^-αT. Finally,we give some simple experiments to validate our propositions.

## Authors

• 3 publications
• 1 publication
• 1 publication
• ### Minimax Regret Bounds for Reinforcement Learning

We consider the problem of provably optimal exploration in reinforcement...
03/16/2017 ∙ by Mohammad Gheshlaghi Azar, et al. ∙ 0

• ### Regret Bounds for Kernel-Based Reinforcement Learning

We consider the exploration-exploitation dilemma in finite-horizon reinf...
04/12/2020 ∙ by Omar Darwiche Domingues, et al. ∙ 0

• ### Dynamic Assortment Selection under the Nested Logit Models

We study a stylized dynamic assortment planning problem during a selling...
06/27/2018 ∙ by Xi Chen, et al. ∙ 0

• ### Regret Analysis of the Anytime Optimally Confident UCB Algorithm

I introduce and analyse an anytime version of the Optimally Confident UC...
03/29/2016 ∙ by Tor Lattimore, et al. ∙ 0

• ### Model-Based Reinforcement Learning with Value-Targeted Regression

This paper studies model-based reinforcement learning (RL) for regret mi...
06/01/2020 ∙ by Alex Ayoub, et al. ∙ 11

• ### Optimistic Value Iteration

Markov decision processes are widely used for planning and verification ...
10/02/2019 ∙ by Arnd Hartmanns, et al. ∙ 0

• ### Mo' States Mo' Problems: Emergency Stop Mechanisms from Observation

In many environments, only a relatively small subset of the complete sta...
12/03/2019 ∙ by Samuel Ainsworth, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## I Introduction

In most reinforcement learning algorithms the size of the state and action spaces are assumed to be finite. But many real-world applications have continuous state or action spaces. Though the case of continuous state space has been considered before like in Ortner and Ryabko [1] and Lakshmanan et al. [2] there is no tractable algorithm whose regret has been analyzed.In this work we consider a simpler finite-horizon setting and give a tractable algorithm with near-optimal regret bound.

There have been many assumptions on the rewards and transition function considered before like deterministic transitions in [3] or transition functions that are linear in state and action [4], [5], [6] and [7]. Kakade et al. [8] have considered more general setting of PAC learning in RL with metric state spaces. Another result is by Osband and Roy [9] where bounds on expected regret was derived when reward and transition functions belonged to a class of functions characterized by parameters like eluder dimension and Kolmogorov dimension. In this paper we consider the most general assumption that the reward function and the transition functions are Hölder continuous like the papers [1] and [2]. We derive our bound on the discretization error without any further assumptions. By assuming unbiased sampling, we also derive an upper bound on the regret which is

when both transition probabilities and reward function are Lipschitz continuous.

In recent work, Azar et al. [10] have proposed a new algorithm Optimistic Value Iteration (UCBVI). This is different from the UCRL algorithm in Jacksh et al [11] upon which the algorithms of Ortner and Ryabko [1] was based. For episodic problems, it is known that UCBVI algorithm has better regret bounds. We extend this UCBVI algorithm for continuous state space problems using the state space aggregation used in [1]. It should also be noted that both the algorithms in [1] and [2] are not tractable. Though we have considered an easier episodic setting, our algorithm is tractable. And we show that the discretization error(discretizing the state space to intervals) is bounded above by under the assumptions that rewards and transitions are Hölder Continuous.

For infinite horizon setting, in the one-dimensional case with Lipschitz rewards and transition functions the bound by Ortner and Ryabko is of order in [1]. If the transition functions are in addition smooth, a regret bound of is shown in [2].

The best known lower bound for Lipschitz reward and general transition function is by Ortner and Ryabko [1] which is order . We improve upon this by giving a lower bound of , matching the upper bound for algorithms that discretizes the state-space. We have also implemented the algorithm and give experimental results on one and two dimensional setting. The empirical performance is seen to match with the theoretical results.

## Ii Terminology and Problem Definition

We define Markov Decision Process (MDP) by state space , action space , reward function and transition probability . We consider continuous state space and finite action MDP. For simplicity we derive results for the one dimensional state space, they can be easily generalized to higher dimensional state spaces.The random rewards given state and action are bounded in with mean . The probability of going to state given state and action is given by the transition probability . We make the following assumptions like in [1], [2]. This guarantees that rewards and transitions are close in close states.

###### Assumption 1.

There are such that for any two states and all actions a,

 ∣∣r(s,a)−r(s′,a)∣∣≤L|s−s′|α. (1)
###### Assumption 2.

There are such that for any two states and all actions a,

 ∥∥p(⋅|s,a)−p(⋅|s′,a)∥∥1≤L|s−s′|α. (2)

For simplicity we assume are same for both assumptions.

We consider the finite horizon setting where the agent interacts with the environment in number of steps per episode. We denote by the set ). The policy during an episode is expressed as a mapping . Let denote the state in the step of the episode and denote the policy for the episode . The value function of each state in episode from step by following a policy is defined by

 Vπkh(s)=Eπk(H∑i=hr(xk,i,πk(xk,i,i))|xk,h=s). (3)

The optimal value function is defined as for all and . The performance of the algorithm is measured according to regret incurred in all the episodes as given by

 Regret(K)=K∑k=1(V∗1(xk,1)−Vπk1(xk,1)). (4)

We discretize the continuous state space into intervals of length like in [1]. Let denote interval of the state . We define aggregate rewards and aggregate transition probabilities with respect to as

 pagg(Ij|s,a)=∫Ijp(ds′|s,a) (5)
 ragg(Ij,a)=n.∫Ijr(s,a)ds. (6)

Here can be interpreted as the mean reward in the interval . The algorithm treats each interval as single state and the aggregate policy is . The aggregated value function following this policy is .At any state and and step ,the mapping between the policies are given by

 πk(s,i)=πaggk(I(s),i) (7)
 Vaggπk,h(I)=Eπaggk(H∑i=hragg(Ik,i,πaggk(Ik,i,i))|Ik,h=I). (8)

For the discretised MDP, the transitions are between intervals,so we define as the average of in

 Pagg(I|I′,a)=n.∫I′pagg(I|t,a)dt (9)

## Iii Algorithm

In UCRL [11] based algorithms confidence sets are built around rewards and transition probabilities. But here we build the confidence set around optimal value function of the discretized MDP as in Azar et al ([10]). The algorithm proceeds similar to UCBVI in [10]. The only difference is that here we use aggregated rewards, transitions probabilities and value functions. The bonus (Algorithm 2

) which is used in calculating the Q-Values is built from the empirical variance of the estimated next values. This relies on the Bernstein-Freedman’s concentration inequality for building the concentration sets.

In UCRL based algorithms [11] we need to find the optimistic MDP and optimal policy for that MDP. This step is not tractable when the state space is continuous as we need knowledge of the bias span (section 3 of [1]). Though this is not needed for finite-horizon problems, to the best of our knowledge, UCCRL algorithm is not computationally tractable. We note that in our algorithm which is based on UCBVI, this step is not needed. Briefly the algorithm consists of the following parts/sections.

1. Initialization part which also includes the discretizing the continuous MDP based on the input parameters. This step is executed only once.

2. The next part consists of an iterative flow of three processes.(The number of iterations is equal to the number of episodes )

• Estimating the transition probabilities based on the history till that iterate.

• Finding the Q values using a modified bellman operator(which includes bonus) according to the current transition probabilities.This part can be considered as a simple Dynamic Programming(DP) problem for finding Q values.

• Execute the current(discrete) policy(according to the Q values found) and record the feedback given by the environment.

### Iii-a Bonus

This algorithm follows the heuristic principle known as

optimism in the face of uncertainty. The algorithms in this paradigm employ optimism to guide exploration. In simple words, the algorithm assigns higher bonus value to that action(given a state/interval) which it is most uncertain about. This encourages the exploration of that action next time when it visits the same state again. Indirectly as the number of times the action chosen increases the algorithm would become more and more certain about it. Given a (state/interval,action) pair, the bonus in the algorithm depends on the variance of value function of the next possible states. Larger the variance,larger the uncertainty of taking that action.

Let us understand this using a simple example. Consider a discretized MDP with 4 possible actions. Given a particular interval , let the Q values and the corresponding bonus found by the algorithm be equal to those given in the figure 1. In the case of not using bonus, the best action would be because has the highest Q-value in . But when bonus is considered, the action is the most likely to be chosen. According to the bonus, the decreasing order of uncertainties of actions are . The combined effect of Q value and the bonus will lead to the exploration of action . Due to this exploration of action , the uncertainty about choosing reduces.

## Iv Results

The regret analysis of UCBVI-CRL is straight forward. The regret of the continuous MDP upto K episodes is defined in equation (4). Now add and subtract terms , to this to get,

 Regret(K)=K∑k=1 ([V∗1(xk,1)−Vagg∗1(Ik,1) +Vaggπk,1(Ik,1)−Vπk1(xk,1)] +[Vagg∗1(Ik,1)−Vaggπk,1(Ik,1)]Δagg) Δerror=K∑k=1 ([V∗1(xk,1)−Vagg∗1(Ik,1) +Vaggπk,1(Ik,1)−Vπk1(xk,1)]

Here is the regret of the discretized MDP and is the error which is a consequence of discretizing the state space.

### Iv-a Bounding Δerror

###### Lemma 1.

Using the discretization technique mentioned above, is bounded above by .

###### Proof.

Let us split the terms in into two parts as

 qk1 =V∗1(xk,1)−Vagg∗1(Ik,1) and qk2 =Vaggπk,1(Ik,1)−Vπk1(xk,1) Δerror ≤K∑k=1(|qk1|+|qk2||)%            (Triangle Inequality)

Consider

 V∗1(xk,1) =maxa∈A(r(xk,1,a)+∫Sp(ds′|xk,1,a)V∗2(s′)) (Chapter 4 of \@@cite[cite]{[\@@bibref{}{hernandez2012% discrete}{}{}]})
 Vagg∗1(Ik,1)=maxa∈A (ragg(Ik,1,a) (Bellman Equation)
 Using the property |maxf −maxg|≤max|f−g| |qk1|≤maxa∈A∣∣[r(xk,1,a) −ragg(Ik,1,a)] +[∫Sp(ds′|xk,1,a) V∗2(s′)
 Let lk1 =maxa∈A∣∣r(xk,1,a)−ragg(Ik,1,a)∣∣% and lk2 =maxa′∈A∣∣∫Sp(ds′|xk,1,a′)V∗2(s′) −n∑j=1Pagg(Ij|Ik,1,a′)Vagg∗2(Ij)∣∣
 Using the property max|f+g|≤max|f|+max|g| |qk1| ≤(lk1+lk2)

For any action , is bounded by

 ∣∣r(xk,1,a) −ragg(Ik,1,a)∣∣=∣∣r(xk,1,a)−n.∫Ik,1r(s,a)ds∣∣ (from (6)) ≤∣∣∣r(xk,1,a)−n.(n∑i=1miΔxi)∣∣∣ =∣∣∣r(xk,1,a)−n.(n∑i=1r(s′i,a)1n2)∣∣∣ ≤1n.n∑i=1∣∣r(xk,1,a)−r(s′i,a)∣∣≤Ln−α.

The first inequality is obtained by replacing the integral with lower Riemann sum by dividing into n intervals each of length and being the infimum in the sub-interval. The second inequality follows since function is continuous w.r.t and infimum is the minimum. The last inequality follows from the assumption (1) and length of .

For any action , is bounded by

 ≤∣∣Hn∑j=1∫Ijp(ds′|xk,1,a)−n∑j=1Pagg(Ij|Ik,1,a)Vagg∗2(Ij)∣∣ (rewards ∈[0,1]) ≤n∑j=1∣∣H∫Ijp(ds′|xk,1,a)−Pagg(Ij|Ik,1,a)Vagg∗2(Ij)∣∣ a=:H and b=:∫Ijp(ds′|xk,1,a) c=:Vagg∗2(Ij) and d=:Pagg(Ij|Ik,1,a)
 ≤n∑j=1[|(Vagg∗2(Ij)+H) (∫Ijp(ds′|xk,1,a)−Pagg(Ij|Ik,1,a))| +|Vagg∗2(Ij)∫Ijp(ds′|xk,1,a)−Pagg(Ij|Ik,1,a)H|] lk2≤n∑j=1[3H∣∣∫Ijp(ds′|xk,1,a)−Pagg(Ij|Ik,1,a)∣∣] (Vagg∗2(Ij)≤H−1)

From (9), replacing the integral with lower Riemann sum with n sub intervals and using the property of continuity that infimum of a function is equal to minimum in that closed interval.

 lk2 ≤n∑j=1[3H∣∣∫Ijp(ds′|xk,1,a)−n.(n∑i=1pagg(Ij|s′i,a)1n2)∣∣] ≤n∑j=1[3Hnn∑i=1∣∣∫Ijp(ds′|xk,1,a)−pagg(Ij|s′i,a)∣∣] =n∑j=1[3Hnn∑i=1∣∣∫Ijp(ds′|xk,1,a)−∫Ijp(ds′|s′i,a)∣∣] =3Hnn∑i=1n∑j=1∣∣∫Ijp(ds′|xk,1,a)−∫Ijp(ds′|s′i,a)∣∣ =3Hnn∑i=1∥∥p(⋅|xk,1,a)−p(⋅|s′i,a)∥∥1 ≤3HLn−α(xk,1,s′i∈Ik,1∀i and using (???))

Now we have,

 |qk2|=∣∣Eπk(H∑h=1(r(xk,h,πk) −ragg(I′k,h,πk)))∣∣ (Using (???))

Hence can be bound by bounding the difference of the rewards inside the expectation which is similar to bounding . For any action we have,

 |qk2|≤HLn−α

Now is bounded by

 Δerror=K∑k=1(|qk1|+|qk2|) ≤K∑k=1(4H+1)Ln−α ≤5Ln−αT (as T≥(KH))

## V Additional Result: Upper Bound on Regret

We derived a tighter upper bound on the regret for continuous case(when using the proposed algorithm) when the transition estimates found by the algorithm are unbiased estimates of (

9).The below theorem states that.

###### Theorem 1.

Let M be an MDP with continuous state space , actions, known reward function and unknown transitions satisfying assumptions 1,2 and the transition estimates are unbiased estimates of (9). Then with probability , the regret of UCVBI-CRL after T steps is upper bounded by

 5Ln−αT+30L′√nATH+2500H2n2AL2+4H√TL′ (6)

where .

###### Proof.

The regret in the continuous setting can be split into and . The bound on comes from Lemma 1 and bound on is from Theorem 2 of [10].∎

###### Corollay 1.

We set . It can be seen that when , setting first term dominates while the third term dominates when . Thus the optimal regret of is obtained by setting . This is better than the regret bound of order for the same in [1]. And in the Lipschitz case i.e., when we have regret of order .

###### Remark 1.

The dependence on , and is not optimal. If the algorithms of [1] and [2] are adapted for finite-horizon setting We again wish to emphasize the fact that our algorithm is tractable unlike the algorithms in the other two papers.

###### Remark 2.

The regret bound in [2] is also of order for the infinite horizon problem in the Lipschitz case. But there is an additional parameter which depends on the smoothness of the transition function. And only in the asymptotic case when the regret of is attained.

###### Remark 3.

For -dimensional state space we have states in the last three terms of equation (6). So setting we get a regret of which is better than the bound of in [1] for the case .

## Vi Lower Bounds

We have the following theorem giving the lower bound.

###### Theorem 2.

For any algorithm using state-space discretization, any natural numbers and , the expected regret for after timesteps for the MDP with state , actions defined above is

 E(ΔT(A))≥L.√HAT2/3 (12)
###### Proof.

We note that any algorithm using state-space discretization also incurs a regret by choosing a wrong action due to discretization error. Hence we have

1. Regret in the discretized MDP

2. Regret due to the discretization error

The total regret is atleast maximum of these two. The first term is of order (See [11]). Now consider the MDP shown in figure 2 with two actions in each state.

Let the algorithm discretize the state into intervals.Let us assume that there are two sub-intevals in each interval and the optimal action in each sub-interval is different. Assume for simplicity that all intervals are of equal size. Otherwise divide each interval into sub-intervals of size half of the least interval size and assume constant reward for the remaining length of the intervals.

The mean rewards for actions are as shown in the figure 2, i.e., they are fixed for a state. It can be seen that it is not possible to determine the optimal action for both these sub-intervals simultaneously. And the regret obtained due to discretization is of order as the the rewards are Lipschitz continuous. Since this regret is incurred on an average for atleast half of the states visited . The total regret is of order . Balancing this and the regret in the discretized MDP by taking we get a regret of order . ∎

###### Remark 4.

We note that algorithms like value iteration and non-stationary value iteration algorithms (Chapter 3 of [13]) which do not use state-space discretization cannot be used here as the probability transition function is not known.

###### Remark 5.

It is straight forward to show a regret bound of for infinite horizon setting where is the span of the optimal bias function.

## Vii Experiments

We have performed our experiments on one dimensional and two dimensional state spaces. For one-dimensional case we implemented the algorithm on a simple problem with with being the state space . In this problem, the reward functions for the actions are and respectively. So the optimal policy is, for a state in interval , the one which takes action if

is odd and

if is even. The reward was scaled to the range to meet the requirements of [10]. From the current state , after taking an action such that , the next state is sampled uniformly from the state space .

We can see from Figure 6 that the empirical regret is converging faster for larger number of intervals, for the same horizon length The convergence can be better understood by comparing the the points where the empirical regret approaches to . The very first point, in every plot, where the empirical regret is zero is indicated by a dashed vertical line showing the episode number. The results match the intuition that for the same problem having more number of states improves the performance.

By keeping constant and varying it can be seen from the plots that regret approaches zero slower as is increases. It can also be seen that the regret for larger horizon length at a particular episode was higher than the regret at the same episode for smaller horizon length. Thus the regret increases with the horizon length as expected.

Next we conducted experiments on a two-dimensional problem with and a Lipschitz continuous reward function for the respective actions. The reward functions were and for actions and respectively. The analysis of this case is also similar to the one-dimensional setting. The algorithm shows similar trend for different values of for same number of intervals (see figure 5) i.e, regret approaches zero slower for higher values of as in one-dimensional setting.

## Viii Conclusion

We considered the finite-horizon continuous reinforcement learning problem. We have given an algorithm based on UCBVI for the same. With the only assumption that the reward function and transition probabilities are Lipscitz continuous we show that the upper bound on the discretization error is . We have shown the matching lower bound under the assumption that the algorithm discretizes the state-space of MDP. In future we would like to show similar bound without this assumption. Also, with an additional assumption that the sampling is unbiased, we proved that the regret is of order , when using our algorithm.

We have also given some experimental results to validate our propositions. In future we would like to extend the algorithm to infinite-horizon case. This seems to be difficult as the UCBVI algorithm needs the horizon length as input. We also want to improve the dependence on size of action , Lipschitz constant and horizon length on the regret.

## References

• [1] R. Ortner and D. Ryabko, “Online regret bounds for undiscounted continuous reinforcement learning,” in Advances in Neural Information Processing Systems NIPS, vol. 25, pp. 1763–1772, 2012.
• [2] K. Lakshmanan, R. Ortner, and D. Ryabko, “Improved regred bounds for undiscounted continuous reinforcement learning,” in

Proceedings of 32nd International Conference on Machine Learning (ICML)

, vol. 37, pp. 524–32, 2015.
• [3] A. Bernstein and N. Shimkin, “Adaptive-resolution reinforcement learning with polynomial exploration in deterministic domains,” Machine Learning, vol. 81, no. 3, pp. 359–397, 2010.
• [4]

A. L. Strehl and M. L. Littman, “Online linear regression and its application to model-based reinforcement learning,” in

Advances Neural Information Processing Systems NIPS, vol. 20, pp. 1417–1424, 2008.
• [5]

E. Brunskill, B. R. Leffler, L. Li, M. L. Littman, and N. Roy, “Provably efficient learning with typed parametric models,”

Journal of Machine Learning Research, vol. 10, pp. 1955–1988, 2009.
• [6] Y. Abbasi-Yadkori and C. Szepesvári, “Regret bounds for the adaptive control of linear quadratic systems,” in Learning Theory, 24th Annual Conference on Learning Theory COLT, JMLR Proceedings Track, vol. 19, pp. 1–26, 2011.
• [7] M. Ibrahmi, A. Javanmard, and B. V. Roy, “Efficient reinforcement learning for high dimensional linear quadratic systems,” in Advances Neural Information Processing Systems NIPS, vol. 25, pp. 2645–2653, 2012.
• [8] S. Kakade, M. J. Kearns, and J. Langford, “Exploration in metric state spaces,” in Proceedings of 20th International Conference on Machine Learning ICML, pp. 306–312, 2003.
• [9] I. Osband and B. V. Roy, “Model-based reinforcement learning and the eluder dimension,” vol. Preprint, 2014.
• [10] M. G. Azar, I. Osband, and R. Munos, “Minimax regret bounds for reinforcement learning,” arXiv preprint arXiv:1703.05449, 2017.
• [11] T. Jaksch, R. Ortner, and P. Auer, “Near-optimal regret bounds for reinforcement learning,” Journal of Machine Learning Research, vol. 11, pp. 1563–1600, 2010.
• [12] O. Hernández-Lerma and J. B. Lasserre, Discrete-time Markov control processes: basic optimality criteria, vol. 30. Springer Science & Business Media, 2012.
• [13] O. Hernández-Lerma, Adaptive Markov control processes, vol. 79 of Applied Mathematical Sciences. Springer, 1989.