Reinforcement learning (RL) (Sutton and Barto, 1998) studies the problem of learning in sequential decision-making problems where the dynamics of the environment is unknown, but can be learnt by performing actions and observing their outcome in an online fashion. A sample-efficient RL agent must trade off the exploration needed to collect information about the environment, and the exploitation of the experience gathered so far to gain as much reward as possible. In this paper, we focus on the regret framework in infinite-horizon average-reward problems (Jaksch et al., 2010), where the exploration-exploitation performance is evaluated by comparing the rewards accumulated by the learning agent and an optimal policy. Jaksch et al. (2010) showed that it is possible to efficiently solve the exploration-exploitation dilemma using the optimism in face of uncertainty
(OFU) principle. OFU methods build confidence intervals on the dynamics and reward (i.e., construct a set of plausible MDPs), and execute the optimal policy of the “best” MDP in the confidence region(e.g., Jaksch et al., 2010; Bartlett and Tewari, 2009; Fruit et al., 2017; Talebi and Maillard, 2018; Fruit et al., 2018). An alternative approach is posterior sampling (PS) (Thompson, 1933), which maintains a posterior distribution over MDPs and, at each step, samples an MDP and executes the corresponding optimal policy (e.g., Osband et al., 2013; Abbasi-Yadkori and Szepesvári, 2015; Osband and Roy, 2017; Ouyang et al., 2017a; Agrawal and Jia, 2017).
Weakly-communicating MDPs and misspecified states. One of the main limitations of UCRL (Jaksch et al., 2010) and optimistic PSRL (Agrawal and Jia, 2017) is that they require the MDP to be communicating so that its diameter (i.e., the length of the longest path among all shortest paths between any pair of states) is finite. While assuming that all states are reachable may seem a reasonable assumption, it is rarely verified in practice. In fact, it requires a designer to carefully define a state space that contains all reachable states (otherwise it may not be possible to learn the optimal policy), but it excludes unreachable states (otherwise the resulting MDP would be non-communicating). This requires a considerable amount of prior knowledge about the environment. Consider a problem where we learn from images e.g., the Atari Breakout game (Mnih et al., 2015). The state space is the set of “plausible” configurations of the brick wall, ball and paddle positions. The situation in which the wall has an hole in the middle is a valid state (e.g., as an initial state) but it cannot be observed/reached starting from a dense wall (see Fig. 0(a)). As such, it should be removed to obtain a “well-designed” state space. While it may be possible to design a suitable set of “reachable” states that define a communicating MDP, this is often a difficult and tedious task, sometimes even impossible. Now consider a continuous domain e.g., the Mountain Car problem (Moore, 1990). The state is decribed by the position and velocity along the -axis. The state space of this domain is usually defined as the cartesian product . Unfortunately, this set contains configurations that are not physically reachable as shown on Fig. 0(b). The dynamics of the system is constrained by the evolution equations. Therefore, the car can not go arbitrarily fast. On the leftmost position () the speed cannot exceed due to the fact that such position can be reached only with velocity . To have a higher velocity, the car would need to acquire momentum from further left (i.e., ) which is impossible by design ( is the left-boundary of the position domain). The maximal speed reachable for can be attained by applying the maximum acceleration at any time step starting from the state . This identifies the curve reported in the Fig. 0(b) which denotes the boundary of the unreachable region. Note that other states may not be reachable. Whenever the state space is misspecified or the MDP is weakly communicating (i.e., ), OFU-based algorithms (e.g., UCRL
) optimistically attribute large reward and non-zero probability to reach states that have never been observed, and thus they tend to repeatedly attempt toexplore unreachable states. This results in poor performance and linear regret. A first attempt to overcome this major limitation is Regal.C (Bartlett and Tewari, 2009) (Fruit et al. (2018) recently proposed SCAL, an implementable efficient version of Regal.C), which requires prior knowledge of an upper-bound to the span (i.e., range) of the optimal bias function . The optimism of UCRL is then “constrained” to policies whose bias has span smaller than . This implicitly “removes” non-reachable states, whose large optimistic reward would cause the span to become too large. Unfortunately, an accurate knowledge of the bias span may not be easier to obtain than designing a well-specified state space. Bartlett and Tewari (2009) proposed an alternative algorithm – Regal.D– that leverages on the doubling trick (Auer et al., 1995) to avoid any prior knowledge on the span. Nonetheless, we recently noticed a major flaw in the proof of (Bartlett and Tewari, 2009, Theorem 3) that questions the validity of the algorithm (see App. A for further details). PS-based algorithms also suffer from similar issues.111We notice that the problem of weakly-communicating MDPs and misspecified states does not hold in the more restrictive setting of finite horizon (e.g., Osband et al., 2013) since exploration is directly tailored to the states that are reachable within the known horizon, or under the assumption of the existence of a recurrent state (e.g., Gopalan and Mannor, 2015). To the best of our knowledge, the only regret guarantees available in the literature for this setting are (Abbasi-Yadkori and Szepesvári, 2015; Ouyang et al., 2017b; Theocharous et al., 2017). However, the counter-example of Osband and Roy (2016) seems to invalidate the result of Abbasi-Yadkori and Szepesvári (2015). On the other hand, Ouyang et al. (2017b) and Theocharous et al. (2017) present PS algorithms with expected Bayesian regret scaling linearly with , where is an upper-bound on the optimal bias spans of all the MDPs that can be drawn from the prior distribution ((Ouyang et al., 2017b, Asm. 1) and (Theocharous et al., 2017, Sec. 5)). In (Ouyang et al., 2017b, Remark 1), the authors claim that their algorithm does not require the knowledge of to derive the regret bound. However, in App. B we show on a very simple example that for most continuous prior distributions (e.g., uninformative priors like Dirichlet), it is very likely that implying that the regret bound may not hold (similarly for (Theocharous et al., 2017)). As a result, similarly to Regal.C, the prior distribution should contain prior knowledge on the bias span to avoid poor performance.
In this paper, we present TUCRL, an algorithm designed to trade-off exploration and exploitation in weakly-communicating and multi-chain MDPs (e.g., MDPs with misspecified states) without any prior knowledge and under the only assumption that the agent starts from a state in a communicating subset of the MDP (Sec. 3). In communicating MDPs, TUCRL eventually (after a finite number of steps) performs as UCRL, thus achieving problem-dependent logarithmic regret. When the true MDP is weakly-communicating, we prove that TUCRL achieves a regret that with polynomial dependency on the MDP parameters. We also show that it is not possible to design an algorithm achieving logarithmic regret in weakly-communicating MDPs without having an exponential dependence on the MDP parameters (see Sec. 5). TUCRL is the first computationally tractable algorithm in the OFU literature that is able to adapt to the MDP nature without any prior knowledge. The theoretical findings are supported by experiments on several domains (see Sec. 4).
We consider a finite weakly-communicating Markov decision process (Puterman, 1994, Sec. 8.3) with a set of states and a set of actions . Each state-action pair is characterized by a reward distribution with mean and support in
as well as a transition probability distributionover next states. In a weakly-communicating MDP, the state-space can be partioned into two subspaces (Puterman, 1994, Section 8.3.1): a communicating set of states (denoted in the rest of the paper) with each state in accessible –with non-zero probability– from any other state in under some stationary deterministic policy, and a –possibly empty– set of states that are transient under all policies (denoted ). We also denote by , and the number of states and actions, and by the maximum support of all transition probabilities with . The sets and form a partition of i.e., and . A deterministic policy maps states to actions and it has an associated long-term average reward (or gain) and a bias function defined as
where the bias measures the expected total difference between the rewards accumulated by starting from and the stationary reward in Cesaro-limit222 For policies whose associated Markov chain is aperiodic, the standard limit exists.
For policies whose associated Markov chain is aperiodic, the standard limit exists.(denoted ). Accordingly, the difference of bias values quantifies the (dis-)advantage of starting in state rather than . In the following, we drop the dependency on whenever clear from the context and denote by the span of the bias function. In weakly communicating MDPs, any optimal policy has constant gain, i.e., for all . Finally, we denote by , resp. , the diameter of , resp. the diameter of the communicating part of (i.e., restricted to the set ):
where is the expected time of the shortest path from to in .
Learning problem. Let be the true (unknown) weakly-communicating MDP. We consider the learning problem where , and are known, while sets and , rewards and transition probabilities are unknown
and need to be estimated on-line. We evaluate the performance of a learning algorithmafter time steps by its cumulative regret . Furthermore, we state the following assumption.
The initial state belongs to the communicating set of states .
While this assumption somehow restricts the scenario we consider, it is fairly common in practice. For example, all the domains that are characterized by the presence of a resetting distribution (e.g., episodic problems) satisfy this assumption (e.g., mountain car, cart pole, Atari games, taxi, etc.).
Multi-chain MDPs. While we consider weakly-communicating MDPs for ease of notation, all our results extend to the more general case of multi-chain MDPs.333In the case of misspecified states, we implicitly define a multi-chain MDP, where each non-reachable state has a self-loop dynamics and it defines a “singleton” communicating subset. In this case, there may be multiple communicating and transient sets of states and the optimal gain is different in each communicating subset. In this case we define as the set of states that are accessible –with non-zero probability– from ( included) under some stationary deterministic policy. is defined as the complement of in i.e., . With these new definitions of and , Asm. 1 needs to be reformulated as follows:
Assumption 1 for Multi-chain MDPs. The initial state is accessible –with non-zero probability– from any other state in under some stationary deterministic policy. Equivalently, is a communicating set of states.
Note that the states belonging to can either be transient or belong to other communicating subsets of the MDP disjoint from . It does not really matter because the states in will never be visited by definition. As a result, the regret is still defined as before, where the learning performance is compared to the optimal gain related to the communicating set of states .
3 Truncated Upper-Confidence for Reinforcement Learning (Tucrl)
In this section we introduce Truncated Upper-Confidence for Reinforcement Learning (TUCRL), an optimistic online RL algorithm that efficiently balances exploration and exploitation to learn in non-communicating MDPs without prior knowledge (Fig. 2).
Similar to UCRL, at the beginning of each episode , TUCRL constructs confidence intervals for the reward and the dynamics of the MDP. Formally, for any we define
where is the number of visits in before episode , , , and
are the empirical variances ofand and . The set of plausible MDPs associated with the confidence intervals is then . UCRL is optimistic w.r.t. the confidence intervals so that for all states that have never been visited the optimistic reward is set to , while all transitions to (i.e., ) are set to the largest value compatible with . Unfortunately, some of the states with may be actually unreachable (i.e., ) and UCRL would uniformly explore the policy space with the hope that at least one policy reaches those (optimistically desirable) states. TUCRL addresses this issue by first constructing empirical estimates of and (i.e., the set of communicating and transient states in ) using the states that have been visited so far, that is and , where is the starting time of episode .
In order to avoid optimistic exploration attempts to unreachable states, we could simply execute UCRL on , which is guaranteed to contain only states in the communicating set (since by Asm. 1, we have that ). Nonetheless, this algorithm could under-explore state-action pairs that would allow discovering other states in , thus getting stuck in a subset of the communicating states of the MDP and suffering linear regret. While the states in are guaranteed to be in the communicating subset, it is not possible to know whether states in are actually reachable from or not. Then TUCRL first “guesses” a lower bound on the probability of transition from states to and whenever the maximum transition probability from to compatible with the confidence intervals (i.e., ) is below the lower bound, it assumes that such transition is not possible. This strategy is based on the intuition that a transition either does not exist or it should have a sufficiently “big” mass. However, these transitions should be periodically reconsidered in order to avoid under-exploration issues. More formally, let be a non-increasing sequence to be defined later, for all , and , the empirical mean and variance are zero (i.e., this transition has never been observed so far), so the largest probability (most optimistic) of transition from to through any action is . TUCRL compares to and forces all transition probabilities below the threshold to zero, while the confidence intervals of transitions to states that have already been explored (i.e., in ) are preserved unchanged. This corresponds to constructing the alternative confidence interval
Given , TUCRL (implicitly) constructs the corresponding set of plausible MDPs and then solves the optimistic optimization problem
The resulting algorithm follows the same structure as UCRL and it is shown in Fig. 2. The episode stopping condition at line 4 is slightly modified w.r.t. UCRL. In fact, it guarantees that one action is always executed and it forces an episode to terminate as soon as a state previously in is visited (i.e., ). This minor change guarantees that for all the states that were not reachable at the beginning of the episode. The algorithm also needs minor modifications to the extended value iteration (EVI) algorithm used to solve (5) to guarantee both efficiency and convergence. All technical details are reported in App. C.
In practice, we set , so that the condition to remove transition reduces to . This shows that only transitions from state-action pairs that have been poorly visited so far are enabled, while if the state-action pair has already been tried often and yet no transition to is observed, then it is assumed that is not reachable from . When the number of visits in is big, the transitions to “unvisited” states should be discarded because if the transition actually exists, it is most likely extremely small and so it is worth exploring other parts of the MDP first. Symmetrically, when the number of visits in is small, the transitions to “unvisited” states should be enabled because the transitions are quite plausible and the algorithm should try to explore the outcome of taking action in and possibly reach states in . We denote the set of state-action pairs that are not sufficiently explored by .
3.1 Analysis of Tucrl
We prove that the regret of TUCRL is bounded as follows.
For any weakly communicating MDP , with probability at least it holds that for any , the regret of TUCRL is bounded as
The first term in the regret shows the ability of TUCRL to adapt to the communicating part of the true MDP by scaling with the communicating diameter and MDP parameters and . The second term corresponds to the regret incurred in the early stage where the regret grows linearly. When is communicating, we match the square-root term of UCRL (first term), while the second term is bigger than the one appearing in UCRL by a multiplicative factor (ignoring logarithmic terms, see Sec. 5).
We now provide a sketch of the proof of Thm. 1 (the full proof is reported in App. D). In order to preserve readability, all following inequalities should be interpreted up to minor approximations and in high probability.
Let be the regret incurred in episode , where is the number of visits to in episode . We decompose the regret as
where defines the length of a full exploratory phase, where the agent may suffer linear regret.
Optimism. The first technical difficulty is that whenever some transitions are disabled, the plausible set of MDPs may actually be biased and not contain the true MDP . This requires to prove that TUCRL (i.e., the gain of the solution returned by EVI) is always optimistic despite “wrong” confidence intervals. The following lemma helps to identify the possible scenarios that TUCRL can produce (see App. D.2).444Notice that is true w.h.p. since is obtained using non-truncated confidence intervals.
Let episode be such that , and . Then, either (case I) or , i.e., for which transitions to are allowed (case II).
This result basically excludes the case where (i.e., some states have not been reached) and yet no transition from to them is enabled. We start noticing that when , the true MDP w.h.p. by construction of the confidence intervals. Similarly, if then w.h.p., since TUCRL only truncates transitions that are indeed forbidden in itself. In both cases, we can use the same arguments in (Jaksch et al., 2010) to prove optimism. In case II the gain of any state is set to and, since there exists a path from to , the gain of the solution returned by EVI is , which makes it trivially optimistic. As a result we can conclude that (up to the precision of EVI).
Per-episode regret. After bounding the optimistic reward w.r.t. , the only part left to bound the per-episode regret is the term . Similar to UCRL, we could use the (optimistic) optimality equation and rewrite as
is a shifted version of the vectorreturned by EVI at episode , and then proceed by bounding the difference between and using standard concentration inequalities. Nonetheless, we would be left with the problem of bounding the norm of (i.e., the range of the optimistic vector ) over the whole state space, i.e., . While in communicating MDPs, it is possible to bound this quantity by the diameter of the MDP as (Jaksch et al., 2010, Sec. 4.3), in weakly-communicating MDPs , thus making this result uninformative. As a result, we need to restrict our attention to the subset of communicating states , where the diameter is finite. We then split the per-step regret over states depending on whether they are explored enough or not as . We start focusing on the poorly visited state-action pairs, i.e., . In this case TUCRL may suffer the maximum per-step regret but the number of times this event happen is cumulatively “small” (App. D.4.1):
For any and any sequence of states and actions we have:
When (i.e., holds), can be bounded as in Eq. 6 but now restricted on , so that,
Since the stopping condition guarantees that for all , we can first restrict the outer summation to states in . Furthermore, all state-action pairs are such that the optimistic transition probability is forced to zero for all , thus reducing the inner summation. We are then left with providing a bound for the range of restricted to the states in , i.e., . We recall that EVI run on a set of plausible MDPs returns a function such that , for any pair , where is the expected shortest path in the extended MDP . Furthermore, since , for all , . Unfortunately, since may not belong to , the bound on the shortest path in (i.e., ) may not directly translate into a bound for the shortest path in , thus preventing from bounding the range of even on the subset of states in . Nonetheless, in App. E we show that a minor modification to the confidence intervals of makes the shortest paths between any two states equivalent in both sets of plausible MDPs, thus providing the bound . 555Note that there is not a single way to modify the confidence intervals of to keep under control. In App. F we present an alternative modifications for which the shortest paths between any two states is not equal but smaller than in thus ensuring that . The final regret in Thm. 1 is then obtained by combining all different terms.
In this section, we present experiments to validate the theoretical findings of Sec. 3. We compare TUCRL against UCRL and SCAL.666To the best of out knowledge, there exists no implementable algorithm to solve the optimization step of Regal and Regal.D. We first consider the taxi problem (Dietterich, 2000) implemented in OpenAI Gym (Brockman et al., 2016).777The code is available on GitHub. Even such a simple domain contains misspecified states, since the state space is constructed as the outer product of the taxi position, the passenger position and the destination. This leads to states that cannot be reached from any possible starting configuration (all the starting states belong to ). More precisely, out of states in , are non-reachable. On Fig. 3(left) we compare the regret of UCRL, SCAL and TUCRL when the misspecified states are present (top) and when they are removed (bottom). In the presence of misspecified states (top), the regret of UCRL clearly grows linearly with while TUCRL is able to learn as expected. On the other hand, when the MDP is communicating (bottom) TUCRL performs similarly to UCRL. The small loss in performance is most likely due to the initial exploration phase during which the confidence intervals on the transition probabilities used by UCRL (see definition of ) are tighter than those used by TUCRL (see definition of ). TUCRL uses a “loose” bound on the -norm while UCRL uses different bounds, one for every possible next state. Finally, SCAL outperforms TUCRL by exploiting prior knowledge on the bias span.
We further study TUCRL regret in the simple three-state domain introduced in (Fruit et al., 2018) (see App. H for details) with different reward distributions (uniform instead of Bernouilli). The environment is composed of only three states (, and ) and one action per state, except in where two actions are available. As a result, the agent only has the choice between two possible policies. Fig. 3(left) shows the cumulative regret achieved by TUCRL and SCAL (with different upper-bounds on the bias span) when the diameter is infinite i.e., and (we omit UCRL, since it suffers linear regret). Both SCAL and TUCRL quickly achieve sub-linear regret as predicted by theory. However, SCAL and TUCRL seem to achieve different growth rates in regret: while SCAL appears to reach a logarithmic growth, the regret of TUCRL seems to grow as with periodic “jumps” that are increasingly distant (in time) from each other. This can be explained by the way the algorithm works: while most of the time TUCRL is optimistic on the restricted state space (i.e., ), it periodically allows transitions to the set (i.e., ), which is indeed not reachable. Enabling these transitions triggers aggressive exploration during an entire episode. The policy played is then sub-optimal creating a “jump” in the regret. At the end of this exploratory episode, will be set again to and the regret will stop increasing until the condition occurs again (the time between two consecutive exploratory episodes grows quadratically). The cumulative regret incurred during exploratory episodes can be bounded by the term plotted in green on Fig. 3(left). In Lem. 2 we proved that this term is always bounded by . Therefore, it is not surprising to observe a increase of both the green and red curves. Unfortunately, the growth rate of the regret will keep increasing as and will never become logarithmic unlike SCAL (or UCRL when the MDP is communicating). This is because the condition will always be triggered times for any . In Sec. 5 we show that this is not just a drawback specific to TUCRL, but it is rather an intrinsic limitation of learning in weakly-communicating MDPs.
5 Exploration-exploitation dilemma with infinite diameter
In this section we further investigate the empirical difference between SCAL and TUCRL and prove an impossibility result characterising the exploration-exploitation dilemma when the diameter is allowed to be infinite and no prior knowledge on the optimal bias span is available.
We first recall that the expected regret of UCRL (with input parameter ) after time steps and for any finite MDP can be bounded in several ways:
where is the gap in gain, and are numerical constants independent of , and with a measure of the “mixing time” of policy . The three different bounds lead to three different growth rates for the function (see Fig. 4): 1) for , the expected regret is linear in , 2) for the expected regret grows as , 3) finally for , the increase in regret is only logarithmic in . These different “regimes” can be observed empirically (see (Fruit et al., 2018, Fig. 5, 12)). Using (7), it is easy to show that the time it takes for UCRL to achieve sub-linear regret is at most . We say that an algorithm is efficient when it achieves sublinear regret after a number of steps that is polynomial in the parameters of the MDP (i.e., UCRL is then efficient). We now show with an example that without prior knowledge, any efficient learning algorithm must satisfy when has infinite diameter (i.e., it cannot achieve logarithmic regret).
We consider a family of weakly-communicating MDPs represented on Fig. 4(right). Every MDP instance in is characterised by a specific value of which corresponds to the probability to go from to . For (Fig. 4), the optimal policy of is such that and the optimal gain is while for (Fig. 4) the optimal policy is such that and the optimal gain is . We assume that the learning agent knows that the true MDP belongs to but does not know the value associated to . We assume that all rewards are deterministic and that the agent starts in state (coloured in grey).
Let be positive real numbers and a function defined for all by . There exists no learning algorithm (with known horizon ) satisfying both
for all , there exists