1 Introduction
In problems involving sequential decision making under uncertainty, there exist at least two different objectives: a) optimize the online performance, and b) find an optimal behavior. In the context of Multi Armed Bandits (MAB), these objectives correspond to: a) maximize cumulative reward, and b) identify the best arm. The first objective is the widely studied problem of cumulative regret minimization. The second one is a Pure Exploration (PE) problem, where agent has to efficiently gather information to identify the optimal arm. For Markov Decision Processes (MDPs), the first objective is the classical Reinforcement Learning (RL) problem. On the other hand, PE in MDPs is an area which has not been explored in detail.
Our first contribution is that we propose an algorithm called Posterior Sampling for Pure Exploration (PSPE), for PE in fixedhorizon episodic MDPs. We define an objective similar to the notion of simple regret in MABs and analyze its convergence when using PSPE. In PSPE, the agent’s goal is to explore the MDP such that it maximizes the probability of following an optimal policy after some number of episodes (not necessarily known in advance). The following table captures PSPE’s relation to other algorithms.
Thompson Sampling (TS) [5] is a Bayesian algorithm for maximizing the cumulative rewards received in bandits. The idea is to pull an arm according to its confidence (i.e, its probability of being optimal the optimal arm). It maintains a prior distribution over bandit instances. At each step, it samples an instance of a bandit from the posterior and pulls its optimal arm. Pure exploration Thompson Sampling (PTS)[4]
modifies TS by adding a resampling step. TS is not suitable for PE as it pulls the estimated best arm almost all the time. It takes a very long time to ascertain that none of the other arms offer better rewards. The resampling step prevents pulling the estimated best arm too often and helps in achieving a higher confidence in lesser number of arm pulls. Posterior Sampling for Reinforcement Learning (PSRL)
[2] extends TS for the RL problem on episodic fixedhorizon MDPs. It maintains a prior distribution over MDPs. At the beginning of each episode, it samples a MDP instance from the posterior, finds its optimal policy using dynamic programming and acts according to this policy for the duration of the episode. It updates the posterior with the rewards and transitions witnessed. For PE in MDPs, we propose PSPE, which adds a resampling step to PSRL.In reality however, agents may have a different objective: Optimize online performance after a period of exploration without regret. For instance, consider a student preparing for a series of tests. She would typically take a few practice tests to know which areas she needs to improve upon. Based of the scores she obtains in these practice tests, she would formulate a strategy for maximizing her scores in the actual tests. Another example is a robot in a robotics competition. It typically has a few practice rounds before the evaluation rounds. In the practice rounds, the robot can freely explore the environment such that it maximizes its score in the evaluation round.
For this new objective, we claim that the best strategy is to use PSPE during practice and switch to PSRL during evaluation. This is our second contribution. At the end of the practice phase, PSPE maximizes the probability of following an optimal policy. It essentially initializes the priors of PSRL such that they are very close to the true MDP. PSRL can thus leverage these priors to obtain near optimal rewards.
2 Episodic Fixed Horizon MDP
An episodic fixed horizon MDP is defined by the tuple . Here and are finite sets of states and actions respectively. The agent interacts with the MDP in episodes of length . The initial state distribution is given by . In each step of an episode, the agent observes a state and performs an action . It receives a reward sampled from the reward distribution and transitions to a new state
sampled from the transition probability distribution
. The average reward received for a particular stateaction is .For fixed horizon MDPs, a policy is a mapping from and to . The value of a state and action under a policy is: . Let . A policy is an optimal policy for the MDP if for all and . Let the set of optimal policies be . For a MDP , let be the set of optimal policies.
3 Posterior Sampling for Reinforcement Learning
Consider a MDP with states, actions and horizon length . PSRL maintains a prior distribution on the set of MDPs , i.e on the reward distribution (on variables) and the transition probability distribution (on variables). At the beginning of each episode , a MDP is sampled from the current posterior. Let and be the transition and reward distributions of . The set of optimal policies for this MDP can be found using Dynamic Programming as and are known. The agent samples a policy from and follows it for steps. The rewards and transitions witnessed during this episode are used to update the posteriors. Let be the prior density over the MDPs and be the history of episodes seen until . Let be the state observed, be the action performed and be the reward received at time in episode .
Like TS, PSRL maintains a prior distribution over the model, in this case a MDP. At each episode, it samples a model from the posterior and acts greedily according to the sample. TS selects arms according to their posterior probability of being optimal and PSRL selects policies according to the posterior probability they are optimal. It is possible to compute the posterior efficiently and sample from it by a proper choice of conjugate prior distributions or by the use of Markov Chain Monte Carlo methods.
4 Posterior Sampling for Pure Exploration
PSRL is not suitable for PE as after a certain point, it almost certainly follows the optimal policy and does not spend much effort in refining its knowledge of other policies. PSPE modifies PSRL by adding a resampling step. This is an extension of the TopTwo sampling idea of PTS to PSRL. This prevents it from following an estimated optimal policy too frequently.
The algorithm depends on a parameter , where , which controls how often an optimal policy of the sampled MDP is followed. At each episode , PSPE samples a MDP and finds its set of optimal policies . With probability it follows a policy from this set. With probability it resamples MDPs until a different set of policies is obtained. It then follows a policy from the set for steps. In the case of bandits, PSPE is equivalent to PTS. PSRL is the same as PSPE with .
5 Analysis
Let . The confidence of policy after episode is: . The mean episodic reward of a policy is: . Let . The gap of a policy is for . The simple regret after the episode is : . Let be the confidence of the suboptimal policies: . We rewrite as:
Upper and lower bounds for can be expressed in terms of and :
is bounded above and below by asymptotically. The convergence of dictates the convergence of .
We use results from the analysis of PTS[4] about the convergence of .
There exist constants such that exists, is unique for a given MDP and the following hold with probability :

Under PSPE with parameter , . Under any algorithm,

Under PSPE with parameter ,

and
cannot converge faster than . When using PSPE with parameter , converges at rate of in the limit. When , this rate of convergence is , which is optimal. When is close to , is close to . In particular, the choice of is robust as is atleast half of for any MDP.
6 Experiments
We compare the performance of PSPE with different values of and random exploration. To ease the procedure of computing posterior distributions and sampling MDPs from the posterior, we use suitable conjugateprior distributions. For the transition probabilities, we use a uniform Dirichlet prior and a categorical likelihood, and for reward distribution, we use a Gaussian prior (
) and a Gaussian likelihood with unit variance. We calculate the simple regret by sampling 1000 independent MDPs from the posterior and approximating
using sample means. All the results are averaged across 50 trials.Stochastic Chains [Figure 1], are a family of MDPs which consist of a long chain of states. At each step, the agent can choose to go left or right. The left actions (indicated by thick lines) are deterministic, but the right actions (indicated by dotted lines) result in going right with probability or going left with probability . The only two rewards in this MDP are obtained by choosing left in state and choosing right in state
. These rewards are drawn from a normal distribution with unit variance. Each episode is of length
. The agent begins each episode at state . The optimal policy is to go right at every step to receive an expected reward of . For the RL problem on these MDPs, dithering strategies like greedy or Boltzmann exploration are highly inefficient and could lead to regret that grows exponentially in chain length.We consider a stochastic chain of length 10. The total number of deterministic policies for this MDP are . We plot the simple regret of PSPE with and random exploration for episodes in Figure 2. For this MDP, appears to be close to as the simple regret converges at the fastest rate when . As values of closer to have a faster rate of convergence, the convergence of and is similar. For PSRL, which is , the convergence is much slower. Random exploration however, is highly inefficient. This is because PSPE is able to achieve “Deep Exploration” [osband2016deep] whereas random exploration does not. Deep Exploration means that the algorithm selects actions which are oriented towards positioning the agent to gain useful information further down in the episode.
7 Reinforcement Learning with Practice
In this section, we try to answer the question: “When does it make sense for the agent to use PSPE?”. Consider the following situation: The agent’s goal is to maximize the cumulative reward, but the rewards are accumulated from the th episode. The rewards obtained during episodes to are not used to evaluate the agent’s performance.
The first episodes can be considered as practice, where the agent gathers information so that it obtains near optimal rewards from episode . The agent may not know in advance. It will be told at the beginning of the th episode that its performance is being evaluated. It is not entirely apparent which strategy the agent should use during practice. The agent could ignore the fact that the rewards accumulated during practice do not matter and always use a reward maximizing strategy such as PSRL. We argue that the best strategy is to use PSPE during practice and switching to PSRL during evaluation. Logically, having lower simple regret after practice should result in lower cumulative regret during evaluation. Since PSPE with parameter reaches a lower simple regret faster than PSRL, an optimal policy of a sampled MDP will be an optimal policy of the true MDP with high probability. Hence, we claim that lower regret is incurred by PSRL in the evaluation when PSPE is used during practice.
Like before, we consider a stochastic chain of length 10. We let the practice phase last for different intervals starting from to in steps of 10. During practice, the agents use PSPE with with . After practice, the agents use PSRL for 1000 episodes. The cumulative regret of these agents after the 1000 episodes is plotted against the simple regret at the end of practice. The simple and cumulative regrets are highly correlated, as show in Figure 3.
8 Conclusion and Future Work
In this paper, we present PSPE, a Bayesian algorithm for the Pure exploration problem in episodic fixedhorizon MDPs. PSPE combines the TopTwo sampling procedure of PTS with PSRL. We define a notion of simple regret and show that it converges at an optimal exponential rate when using PSPE. Using stochastic chain MDPs, we compare the convergence of simple regret for PSPE with various values of parameter . We also define the practical problem of Reinforcement Learning with practice. We empirically show that a combination of PSPE and PSRL can offer a feasible solution for this problem. We intend to further explore the problem of RL with practice and provide theoretical guarantees in the case of bandits and MDPs.
PSPE requires solving MDPs through dynamic programming at each step. An alternative approach, which avoids solving sampled MDPs is value function sampling [1]. Using value function sampling approaches to achieve pure exploration remains an open research direction.
References

[1]
R. Dearden, N. Friedman, and S. Russell.
Bayesian Qlearning.
In
AAAI Conference on Artificial Intelligence
, pages 761–768, 1998.  [2] I. Osband, D. Russo, and B. Van Roy. (More) efficient reinforcement learning via posterior sampling. In Advances in Neural Information Processing Systems, pages 3003–3011, 2013.
 [3] I. Osband and B. Van Roy. Why is posterior sampling better than optimism for reinforcement learning. arXiv preprint arXiv:1607.00215, 2016.
 [4] D. Russo. Simple bayesian algorithms for best arm identification. Twenty ninth Annual Conference on Learning Theory, pages 1417–1418, 2016.
 [5] W. R. Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285–294, 1933.
Comments
There are no comments yet.