1 Introduction
Reinforcement Learning (RL) refers to machine learning (ML) techniques designed for sequential decision making when an agent needs to “learn” a policy which maximizes a reward (or minimizes a cost) criterion when some parameters of the model are not known in advance, c.f.
Bertsekas [8], Sutton and Barto [43], Mohri et al. [34], Alpaydin [4], Tewari and Bartlett [46, 47], Ortner et al. [37]. Reinforcement learning is experiencing significant growth in recognition due to successful applications in many areas c.f. Wiering [52], Russo and Van Roy [39], Chang et al. [11], Neu et al. [36], Munos et al. [35], Szepesvári [44], Szepesvári [45], Filippi et al. [19], and Tewari and Bartlett [46, 47].In this paper we consider the basic version of a probabilistic sequential decision system the discrete time, finite state and action Markovian decision process (MDP) cf. Dynkin and Yushkevich [17]. After a very brief survey of the state of the art of the area of computing optimal data driven (adaptive) policies for MDPs with unknown transition probabilities. Then, we compare the performance of the classic UCB policy of Burnetas and Katehakis [9] with a new policy developed herein which we call MDPDeterministic Minimum Empirical Divergence (MDPDMED), and a method based on Posterior sampling (MDPPS). The MDPDMED algorithm is inspired by the DMED method for the Multiarmed Bandit Problem developed in Honda and Takemura [24, 25]
and is based on estimating the optimal rates at which actions should be taken. The MDPPS method is based on ideas of greedy posterior sampling that go back to
Thompson [48], cf. Osband and Van Roy [38]. Indeed many modern ideas of RL originate in work done for the multiarmed bandit problem cf. Gittins [22], Gittins et al. [23], Auer et al. [6] Whittle [51], Weber [50], Villar et al. [49] Sonin [41], Sonin and Steinberg [42], Mahajan and Teneketzis [33], Katehakis and Veinott Jr [30], Katehakis and Rothblum [29], Katehakis and Derman [28].Some additional related work and areas of potential applications are contained in Cowan and Katehakis [12], Cowan and Katehakis [14] Azar et al. [7], Katehakis et al. [31], Cowan and Katehakis [15] Abbeel and Ng [1], Katehakis et al. [31], Ferreira et al. [18], Jaksch et al. [27], Asmussen and Glynn [5].
2 Formulation
A finite MPD is specified by a quadruple , where is the state space, is the action space, with being the set of admissible actions (or controls) in state , , is the reward structure and is the transition law. Here and are respectively the one step expected reward and transition probability from state to state under action For extensions regarding state and action spaces and continuous time we refer to [13] and references therein. Lerma [21], and Dynkin and Yushkevich [17].
When all elements of are known the model is said to be an MDP with complete information (CIMDP). In this case, optimal polices can be obtained via the appropriate version of optimality equations, given the prevailing optimization criterion, state  action  time conditions and regularity assumptions c.f. Lerma [21], Dekker et al. [16], Dynkin and Yushkevich [17].
When some of the elements of are unknown the model is said to be an MDP with incomplete or partial information (PIMDP).
For the body of the paper, we consider the following partial information model: the transition probability vector
is taken to be an element of parameter spacethat is, the space of all
dimensional probability vectors. The restriction that each transition probability be nonnegative is simply to ensure that for any control policy, the resulting Markov chain is irreducible. Additionally, for the body of the paper we will take the reward structure
to be known, and constant. Unknown or probabilistic reward structures are to be considered in future work.Under this model, we define a sequence of state valued random variables
representing the sequence of states of the MDP (taking as a given initial state), and action valued random variables as the action taken by the controller, action being taken at time when the MDP is in state . It is convenient to define a control policy as a (potentially random) history dependent sequence of actions such that . We may then define the value of a policy as the total expected reward over a given horizon of action:(1) 
Let be the set of all feasible MDP policies . We are interested in policies that maximize the expected reward from the MDP, in particular policies that are capable of maximizing the expected reward irrespective of the initial uncertainty that exists about the underlying MDP dynamics (i.e., for all possible under consideration). It is convenient then to define . We may then define the “regret” as the expected loss due to ignorance of the underlying dynamics,
(2) 
We are interested in Uniformly Fast c.f. Burnetas and Katehakis [9] policies, that achieve for all feasible transition laws . In this case, despite the controller’s initial lack of knowledge about the underlying dynamics, she can be assured that her expected loss due to ignorance grows not only sublinearly over time, but slower than any power of . It is shown in Burnetas and Katehakis [9] that any uniformly fast policy has a strict lower bound of logarithmic asymptotic growth of regret, with a bound on the order coefficient in terms of the unknown transition law and the known reward structure . Policies that achieve this lower bound are Asymptotically Optimal c.f. Burnetas and Katehakis [9]; see also Cowan and Katehakis [12], Cowan et al. [13], Burnetas and Katehakis [10], and references therein.
It is additionally convenient to define the following notation: with a given policy to be understood, we denote by the number of times the MDP has been in state in the first periods; we denote by the number of times the MDP has been in state and had action taken; we denote by the number of times the MDP has transitioned from to under action .
In the next subsection, we consider the case of the controller having complete information (the best possible case) and use this to motivate notation and machinery for the remainder of the paper. The body of the paper is devoted to presenting and discussing three control policies that are either provably asymptotically optimal, or at least appear to be. While no proofs are presented, the results of numerical experiments are presented demonstrating the efficacy of these policies.
2.1 The Optimal Policy Under Known Parameters
In this section, we consider the case of complete information, when and are known. In this case, it can be shown that there is a deterministic policy, one in which the action taken at any time depends only on the current state, that realizes the maximal long term average expected reward. Letting be the (finite) set of all such deterministic policies:
(3) 
That there is such an optimal deterministic policy is a classical result cf. [6].
We may characterize this optimal policy in terms of the solution for of the following system of optimality equations:
(4) 
Given the solution and vector to the above equations, the asymptotically optimal policy can be characterized as, whenever in state , take any action for which
(5) 
We denote the set of such asymptotically optimal actions as . In general, should be taken to denote an action .
The solution above represents the maximal long term average expected reward. The vector , i.e., for any , represents in some sense the immediate value of being in state relative to the long term average expected reward. The value essentially encapsulates the future opportunities for value available due to being in state .
It will be convenient in what is to follow to define the following notation:
(6) 
The function effectively represents the value of a given action in a given state, for a given transition vector  both the immediate reward, and the expected future value of whatever state the MDP transitions into. The value of an asymptotically optimal action for any state is thus given by . It can be shown that the “expected loss” due to an asymptotically suboptimal action, taking action when the MDP is in state , is effectively in the limit given by
(7) 
In the general (partial or complete information) case, it is shown in [6] that the regret of a given policy can be expressed asymptotically as
(8) 
Note, the above formula justifies the description of as the “average loss due to suboptimal activation of in state ’. Additionally, from the above it is clear that in the case of complete information, when is known and therefore the asymptotically optimal actions are computable, the total regret at any time is bound by a constant. Any expected loss at time is due only to finite horizon effects. In general, for the incomplete information case, we have the following bound due to [6], for any uniformly fast policy ,
(9) 
where
represents the minimal KullbackLeibler divergence between
and any such that substituting for in renders the unique optimal action for . Note, the KullbackLeibler divergence is given by . Policies that achieve this lower bound, for all , are referred to as Asymptotically Optimal.3 The UCB Algorithm for MDPs Under Unknown Transition Distributions
The policy we present here is a simplified version of the UCBMDP policy developed in Burnetas and Katehakis [9]
. In this classical upper confidence MDPUCB setting in each time instance estimates of the values of each available action are computed based on available data, inflated by a certain confidence interval (based on the KullbackLeibler divergence). The more data on a given action that is available, the tighter the confidence interval will be, and therefore the less the corresponding estimate will be inflated.
At any time , let be the current (given) state of the MDP. We construct the following estimators:

Transition Probability Estimators: for each state and action , construct based on
(10) Note, the biasing terms (the in the numerator, in the denominator) serve to force the estimated transition probabilities away from , and thus our estimates of will be in .

“Good” Action Sets: construct the following subset of the available actions ,
(11) The set represents the actions available from state that have been sampled frequently enough that the estimates of the associated transition probabilities should be “good’. In the limit, we expect that suboptimal actions will be taken only logarithmically, and hence for sufficiently large , will contain only actions that are truly optimal. If no actions have been taken sufficiently many times, we take to prevent it from being empty.

Value Estimates: having constructed these estimators, we compute and as the solution to the optimality equations in Eq. (4), essentially treating the estimated probabilities as correct and computing the optimal values and policy for the resulting estimated MDP.
At this point, we implement the following UCB index based decision rule: for each action , we compute the following index:
(12) 
where is the KullbackLeibler divergence, and take action
(13) 
This is a natural extension of several classical KLdivergence based UCB policies for the multiarmed bandit problem cf. Cowan and Katehakis [12], Burnetas and Katehakis [10], and references therein, taking the view of the function as the “value” of taking a given action in a given state, estimated with the current data. In Burnetas and Katehakis [9], a version of the above policy is in fact shown to be asymptotically optimal. The modification is largely for analytical benefit however, the pure UCB index policy defined above shows excellent performance cf. Figure 1. Further discussion of the performance of this policy is given in the Comparison of Performance section.
4 A DMEDType Algorithm for MDPs Under Uncertain Transitions
In the classical DMED algorithm for Multiarmed Bandit Problems, the decision process proceeds by attempting to successively estimate the asymptotically minimal rates with which suboptimal actions must be taken, and then attempting to take actions in such a way so as to realize the estimated minimal rates. As applied to MDPs, we have the following relationship from [6]. For any uniformly fast policy , for any state and suboptimal action ,
(14) 
where is, as before, the minimal KullbackLeibler divergence between the true transition probability vector , and any transition probability vector such that substituting for in would render action uniquely optimal for state .
Computing the function is not easy. We consider the following substitute, then:
(15) 
The function measures how far the transition vector associated with and must be perturbed (under the KLdivergence) to make the optimal action for . The function measures how far the transition vector associated with and must be perturbed (under the KLdivergence) to make the value of , as measured by the function, greater than the value of an optimal action .
In this way, we have the following approximate MDPDMED algorithm; see Honda and Takemura [24, 25] for a multiarmed bandit version of this policy.
At any time , let be the current state, and construct the estimators as in the UCBMDP algorithm in section 3, , , and utilize these to compute the estimated optimal values, and .
Let be the estimated “best” action to take at time . For each , compute the discrepancies
If , take , otherwise, take
Following this algorithm, we perpetually reduce the discrepancy between the estimated suboptimal actions, and the estimated rate at which those actions should be taken. The exchange from to sacrifices some performance in the pursuit of computational simplicity, however it also seems clear from computational experiments that DMEDMDP as above is not only computationally tractable, but also produces reasonable performance in terms of achieving small regret cf. Figure 1. Further discussion of the performance of this policy is given in the Comparison of Performances section.
5 A Posterior Sampling Algorithm for MDPs
In this section we introduce a Posterior Sampling (ThompsonType ) policy for MDPs, or PSMDP. This type of policy is also known as Thompson Sampling, or Probability matching. The basic idea is to generate estimates for the unknown parameters (transition probabilities) randomly, according to the posterior distribution for those unknown parameters, based on the current data. In particular, PSMDP proceeds in the following way:
At any time , let be the current state of the MDP. As in UCBMDP and DMEDMDP previously, construct the estimators . In addition, generate the following random vectors.
For each action , let be the vector of observed transition counts from state to under action . Generate the random vector according to
(16) 
The are distributed according to the joint posterior distribution of with a uniform prior.
At this point, define the following values as posterior estimates of the potential value of each action:
(17) 
and take action
6 Comparison of Performance
In this section we discuss the results of our simulation test of these policies on a small example with 3 states ( and ) with 2 available actions ( and ) in each state. Below we show the transition probabilities, as well as the reward, returned under each action.
0.04 0.69 0.27 0.88 0.01 0.11 0.02 0.46 0.52
0.28 0.68 0.04 0.26 0.33 0.41 0.43 0.35 0.22
0.13 0.47 0.89 0.18 0.71 0.63
If these transition probabilities were known, the optimal policy for this MDP would be and .
We simulated each policy 100 times over a time horizon of 10,000 and for each time step we computed the mean regret as well as the variance. In Figure
1, we plot the mean regret over time for each policy, [1] PS, [2] UCB, and [3] DMED, along with a confidence interval for all sample paths.We can see that all policies seem to have logarithmic growth of regret. There are a few interesting differences that the plot highlights, at least for these specific parameter values:
DMEDMDP has not only the highest finite time regret, but also large variance that seems to increase over time. This seems primarily due to the “epoch” based nature of the policy, which results in exponentially long periods when the policy may get trapped taking suboptimal actions, incurring large regret until the true optimal actions are discovered. The benefit of this epoch structure is that once the optimal actions are discovered, they are taken for exponentially long periods, to the exclusion of suboptimal actions.
PSMDP seems to perform best, exhibiting lowest finite time regret as well as the tightest variance. This seems largely in agreement with the performance of PStype policies in other bandit problems as well, in which they are frequently asymptotically optimal cf. Agrawal S and Goyal N. [3, 2], Honda and Takemura [26], Kaufmann et al. [32] and references therein.
6.1 Policy Robustness  Inaccurate Priors
How do these policies respond to potentially “unlucky” or nonrepresentative streaks of data? Can these policies be fooled, and what are the resulting costs before they recover?
To test the robustness of these policies, with respect to prior information, we “rigged” the first 60 actions and transitions, such that under the estimated transition probabilities the optimal policy would be to activate the suboptimal action in each state. In more detail, let be the number of times we transitioned from state to state under action . Then we “rigged” so that it started like so,
8 1 1 1 1 8 8 1 1 ,
1 1 8 8 1 1 1 1 8
Under the resulting (bad) estimated transition probabilities, we have that the optimal policy is , and . Under these initial estimates, the assumed optimal policy chooses the suboptimal action in each state.
The subsequent performances of the MDP policies are plotted in Figure 2. All policies still appear to have logarithmic growth in regret, suggesting they can all “recover” from the initial bad estimates. It is striking though, the extent to which the average regrets for DMEDMDP and PSMDP are affected, increasing dramatically as a result, PSMDP demonstrating an increase in variance as well. However, the UCBMDP policy seems relatively stable: its average regret has barely increased, and maintains a small variance. Empirically, this phenomenon appears common for the UCBMDP policy under other extreme conditions.
Acknowledgments
We acknowledge support for this work from the National Science Foundation, NSF grant CMMI1662629.
References
 Abbeel and Ng [2004] Abbeel, P. and Ng, A. Y. (2004). Apprenticeship learning via inverse reinforcement learning. In Proceedings of the twentyfirst international conference on Machine learning, page 1. ACM.
 Agrawal S and Goyal N. [2012] Agrawal, S. and Goyal, N. (2012). Analysis of Thompson sampling for the multiarmed bandit problem. In Conference on Learning Theory 39–1, Springer.

Agrawal S and Goyal N. [2012]
Agrawal, S. and Goyal, N. (2013).
Further optimal regret bounds for Thompson sampling.
In Artificial intelligence and statistics
99–107.  Alpaydin [2014] Alpaydin, E. (2014). Introduction to machine learning. MIT press.
 Asmussen and Glynn [2007] Asmussen, S. and Glynn, P. W. (2007). Stochastic simulation: algorithms and analysis, volume 57. Springer Science & Business Media.
 Auer et al. [2002] Auer, P., CesaBianchi, N., and Fischer, P. (2002). Finitetime analysis of the multiarmed bandit problem. Machine learning, 47(23), 235–256.
 Azar et al. [2017] Azar, M. G., Osband, I., and Munos, R. (2017). Minimax regret bounds for reinforcement learning. arXiv preprint arXiv:1703.05449.
 Bertsekas [2019] Bertsekas, D. P. (2019). Reinforcement learning and optimal control. Athena Scientific, Belmont, Massachusetts.

Burnetas and Katehakis [1997]
Burnetas, A. N. and Katehakis, M. N. (1997).
Optimal adaptive policies for Markov decision processes.
Mathematics of Operations Research, 22(1), 222–255.  Burnetas and Katehakis [1996] Burnetas, A. N. and Katehakis, M. N. (1996). Optimal Adaptive Policies for Sequential Allocation Problems. Advances in Applied Mathematics, 17 (2) 122–142.
 Chang et al. [2006] Chang, M., Chow, S.C., and Pong, A. (2006). Adaptive design in clinical research: issues, opportunities, and recommendations. Journal of biopharmaceutical statistics, 16(3), 299–309.
 Cowan and Katehakis [2019] Cowan W., and M.N. Katehakis (2019). Exploration–exploitation policies with almost sure, arbitrarily slow growing asymptotic regret, DOI=10.1017/S0269964818000529. Probability in the Engineering and Informational Sciences, 1–23.
 Cowan et al. [2018] Cowan W., Honda Y. and M.N. Katehakis (2018). Normal Bandits of Unknown Means and Variances: Asymptotic Optimality, Finite Horizon Regret Bounds, and a Solution to an Open Problem, Journal of Machine Learning Research(JMLR), 18, 1–28.
 Cowan and Katehakis [2015] Cowan W., and M.N. Katehakis (2015). Asymptotically Optimal Sequential Experimentation Under Generalized Ranking. arXiv:1510.02041
 Cowan and Katehakis [2015] Cowan W., and M.N. Katehakis (2015). Multiarmed Bandits under General Depreciation and Commitment, Probability in the Engineering and Informational Sciences, 29 (1), 51–76.
 Dekker et al. [1994] Dekker, R., Hordijk, A., and Spieksma, F. M. (1994). On the relation between recurrence and ergodicity properties in denumerable Markov decision chains. Mathematics of Operations Research, 19, 3.
 Dynkin and Yushkevich [1979] Dynkin, E. and Yushkevich, A. (1979). Controlled Markov processes, volume 235. Springer.
 Ferreira et al. [2018] Ferreira, K. J., SimchiLevi, D., and Wang, H. (2018). Online network revenue management using thompson sampling. Operations research, 66(6), 1586–1602.
 Filippi et al. [2010] Filippi, S., Cappé, O., and Garivier, A. (2010). Optimism in reinforcement learning based on Kullback Leibler divergence. In 48th Annual Allerton Conference on Communication, Control, and Computing.
 Henderson et al. [2018] Henderson, P., Islam, R., Bachman, P., Pineau, J., Precup, D., and Meger, D. (2018). Deep reinforcement learning that matters. In ThirtySecond AAAI Conference on Artificial Intelligence.
 Lerma [2012] Lerma, H.O. (2012). Adaptive Markov control processes, volume 79. Springer Science & Business Media.
 Gittins [1979] Gittins, J. (1979) Bandit processes and dynamic allocation indices (with discussion). J. Roy. Stat. Soc. Ser. B, 41:335–340.
 Gittins et al. [2011] John C. Gittins, Kevin Glazebrook, and Richard R. Weber. Multiarmed Bandit Allocation Indices. John Wiley & Sons, West Sussex, U.K..
 Honda and Takemura [2010] Honda J. and Takemura A. (2010). An asymptotically optimal bandit algorithm for bounded support models. In COLT, 67–79.
 Honda and Takemura [2011] Honda J. and Takemura A. (2011). An asymptotically optimal policy for finite support models in the multiarmed bandit problem. Machine Learning, 85(3):361–391.
 Honda and Takemura [2013] Honda J. and Takemura A. (2013). Optimality of Thompson sampling for Gaussian bandits depends on priors. arXiv preprint arXiv:1311.1894.
 Jaksch et al. [2010] Jaksch, T., Ortner, R., and Auer, P. (2010). Nearoptimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(Apr), 1563–1600.
 Katehakis and Derman [1986] Katehakis, Michael N. and C. Derman (1986). Computing optimal sequential allocation rules in clinical trials. Lecture NotesMonograph Series, 29 – 39.
 Katehakis and Rothblum [1996] Katehakis, Michael N. and Uriel G. Rothblum (1996). Finite state multiarmed bandit problems: Sensitivediscount, averagereward and averageovertaking optimality. The Annals of Applied Probability, 6, 1024–1034.
 Katehakis and Veinott Jr [1987] Katehakis, Michael N. and Arthur F Veinott Jr (1987). The multiarmed bandit problem: decomposition and computation. Math. Oper. Res., 12, 262 – 68.
 Katehakis et al. [2016] Katehakis, M., Smit, L. C., and Spieksma, F. (2016). A comparative analysis of the successive lumping and the lattice path counting algorithms. Journal of Applied Probability, 53(1), 106–120.
 Kaufmann et al. [2016] Kaufmann, E., Korda, N. and Munos, R., (2012). Thompson sampling: An asymptotically optimal finitetime analysis. In International Conference on Algorithmic Learning Theory, 199–213). Springer, Berlin, Heidelberg.
 Mahajan and Teneketzis [2008] Mahajan, A. and Teneketzis, D. (2008). Multiarmed bandit problems. In Foundations and Applications of Sensor Management, 121–151. Springer.
 Mohri et al. [2018] Mohri, M., Rostamizadeh, A., and Talwalkar, A. (2018). Foundations of machine learning. MIT press.
 Munos et al. [2016] Munos, R., Stepleton, T., Harutyunyan, A., and Bellemare, M. (2016). Safe and efficient offpolicy reinforcement learning. In Advances in Neural Information Processing Systems, 1054–1062.
 Neu et al. [2010] Neu, G., Antos, A., György, A., and Szepesvári, C. (2010). Online markov decision processes under bandit feedback. In Advances in Neural Information Processing Systems, 1804–1812.
 Ortner et al. [2014] Ortner, R., Ryabko, D., Auer, P., and Munos, R. (2014). Regret bounds for restless Markov bandits. Theoretical Computer Science, 558, 62–76.
 Osband and Van Roy [2017] Osband, I. and Van Roy, B. (2017). Why is posterior sampling better than optimism for reinforcement learning? In Proceedings of the 34th International Conference on Machine LearningVolume 70, 2701–2710. JMLR. org.
 Russo and Van Roy [2014] Russo, D. J. and Van Roy, B. (2014). Learning to optimize via posterior sampling. Mathematics of Operations Research, 39(4), 1221–1243.
 Sonin [2008] Sonin, I.M. (2008). A generalized Gittins index for a Markov chain and its recursive calculation. Statistics & Probability Letters, 78, 1526 – 1533.
 Sonin [2011] Sonin, I.M. (2011). Optimal stopping of Markov chains and three abstract optimization problems. Stochastics An International Journal of Probability and Stochastic Processes, 83, 405 – 414.
 Sonin and Steinberg [2016] Sonin, Isaac M and Constantine Steinberg (2016). Continue, quit, restart probability model. Annals of Operations Research, 241, 295–318.
 Sutton and Barto [2018] Sutton, R. and Barto, A. (2018). Reinforcement learning: An introduction. MIT press.
 Szepesvári [2009] Szepesvári, C. (2009). Algorithms for reinforcement learning. Morgan and Claypool.
 Szepesvári [2010] Szepesvári, C. (2010). Algorithms for reinforcement learning. Synthesis Lectures on Artificial Intelligence and Machine Learning, 4(1), 1–103.

Tewari and Bartlett [2008a]
Tewari, A. and Bartlett, P. (2008a).
Optimistic linear programming gives logarithmic regret for irreducible MDPs.
In Advances in Neural Information Processing Systems, 1505 – 1512.  Tewari and Bartlett [2008b] Tewari, A. and Bartlett, P. (2008b). Optimistic linear programming gives logarithmic regret for irreducible MDPs. In Y. S. J.C. Platt, D. Koller and S. Roweis, editors, Advances in Neural Information Processing Systems, volume 20, 1505 – 1512. NIPS, New York.
 Thompson [1933(@] Thompson W.R. (1933). On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25, 285–94.
 Villar et al. [2015] Villar, Sofía S, Jack Bowden, and James Wason (2015). Multiarmed bandit models for the optimal design of clinical trials: benefits and challenges. Statistical Science, 30, 199–215.
 Weber [1992] Weber, R. (1992). On the Gittins index for multiarmed bandits. The Annals of Applied Probability, 2, 1024 – 1033.
 Whittle [1980] Whittle, P. (1980). Multiarmed bandits and the Gittins index. J. R. Statist. Soc. B, 42, 143 – 49.
 Wiering [2018] Wiering, M. (2018). Reinforcement learning: from methods to applications. Nieuw Archief voor Wiskunde, 5(19), 157–167.
Comments
There are no comments yet.