While learning in an unknown environment, a reinforcement learning (RL) agent must trade off theexploration needed to collect information about the dynamics and reward, and the exploitation of the experience gathered so far to gain as much reward as possible. The performance of an online learning agent is usually measured in terms of cumulative regret which compares the rewards accumulated by the agent with the rewards accumulated by an optimal agent. A popular strategy to deal with the exploration-exploitation dilemma (i.e., minimize regret) is to follow the optimism in the face of uncertainty (OFU) principle.
Optimistic approaches have been widely studied in the context of stochastic multi-armed bandit (MAB) problems. In this setting, OFU-based algorithms maintain optimistic estimates of the expected reward of each action (i.e., arm), and play the action with highest optimistic estimate (see e.g., Bubeck and Cesa-Bianchi, 2012; Lattimore and Szepesvári, 2018). These optimistic estimates are usually obtained by adding a high probability confidence bound to the empirical average reward i.e., . The confidence bound plays the role of an exploration bonus: the higher , the more likely will be explored. As an example, based on Hoeffding’s inequality, the Upper-Confidence Bound (UCB) algorithm uses where is the total number of times action has been played before and all rewards are assumed to lie between and with probability . UCB can be shown to achieve nearly-optimal regret guarantees.
Strehl and Littman (2008) later generalized the idea of enforcing exploration by using a bonus on the reward to the RL framework. They analysed the infinite-horizon -discounted setting and introduced the Model Based Interval Estimation with Exploration Bonus (MBIE-EB) algorithm. MBIE-EB plays the optimal policy of the empirically estimated MDP where for each state-action pair , a bonus is added to the empirical average reward i.e., the immediate reward associated to is . Unlike in MAB where the optimal arm is the one with maximal immediate reward, the goal of RL is to find a policy maximizing the cumulative reward i.e., the -function. Therefore, the bonus needs to account for the uncertainty in both the rewards and transition probabilities and so where is the range of the -function. Strehl and Littman (2008) also derived PAC guarantees on the sample complexity of MBIE-EB. More recently, count-based methods (e.g., Bellemare et al., 2016; Tang et al., 2017; Ostrovski et al., 2017; Martin et al., 2017) tried to combine the idea of MBIE-EB with Deep RL (DRL) techniques to achieve a good exploration-exploitation trade off in high dimensional problems. The exploration bonus usually used has a similar form where is now an hyper-parameter tuned for the specific task at hand, and the visit count is approximated using discretization (e.g., hashing) or density estimation methods.
Exploration bonuses have also been successfully applied to finite-horizon problems (Azar et al., 2017; Kakade et al., 2018; Jin et al., 2018). In this setting, the planning horizon is known to the learning agent and the range of the -function is . A natural choice for the bonus is then . UCBVI_1 introduced by Azar et al. (2017) uses such a bonus and achieves near-optimal regret guarantees
. Extensions of UCBVI_1 exploiting the variance instead of the range of the-function achieve a better regret bound (Azar et al., 2017; Kakade et al., 2018; Jin et al., 2018).
Both the finite horizon setting and infinite horizon discounted setting assume that there exists an intrinsic horizon (respectively and ) known to the learning agent. Unfortunately, in many common RL problems it is not clear how to define or and it is often desirable to set them as big as possible (e.g., in episodic problem, the time to the goal is not known in advance and random in general). As tends to infinity the regret (of UCBVI_1, etc.) will become linear while as tends to to the sample complexity (of MBIE-EB, etc.) tends to infinity (not to mention the numerical instabilities that may arise). In this paper we focus on the much more natural infinite horizon undiscounted setting (Puterman, 1994, Chap. 8) which generalizes the two previous settings to the case where and respectively. Several algorithms implementing the OFU principle in the infinite horizon undiscounted case have been proposed in the literature (e.g., Jaksch et al., 2010; Ortner and Ryabko, 2012; Fruit et al., 2017, 2018b; Talebi and Maillard, 2018), but none of these approaches exploits the idea of an exploration bonus. Instead, they all construct an extended MDP111The extended MDP is sometimes called bounded-parameter MDP with continuous action space, which can be interpreted as the concatenation of all possible MDPs compatible with some high probability confidence bounds on the transition model, among which is the true MDP. The policy executed by the algorithm is the optimal policy of the extended MDP. UCRL (Jaksch et al., 2010) achieves a regret of order222The original bound of Jaksch et al. (2010) has instead of but can be easily achieved by replacing Hoeffding inequality by empirical Bernstein’s inequality for transition probabilities. after time steps where , , and are respectively the diameter of the true MDP, the maximum number of reachable next states from any state, the number of states and the number of actions. (Fruit et al., 2018b) showed an improved bound for SCAL when a known upper bound on the optimal bias span is known to the learning agent. Although such algorithms can be efficiently implemented in the tabular case, it is difficult to extend them to more scalable approaches like DRL. In contrast, as already mentioned, the exploration bonus approach is simpler to adapt to large scale problems and inspired count based methods in DRL.
In this paper we introduce and analyse SCAL+, the first algorithm that relies on an exploration bonus to efficiently balance exploration and exploitation in the infinite-horizon undiscounted setting. All the exploration bonuses that were previously introduced in the RL literature explicitly depend on or which are known to the learning agent. In the infinite-horizon undiscounted case, there is no predefined parameter informing the agent about the range of the -function. This makes the design of an exploration bonus very challenging. To overcome this limitation, we make the same assumption as Bartlett and Tewari (2009); Fruit et al. (2018b) i.e., we assume that the agent knows an upper-bound on the span (i.e., range) of the optimal bias (i.e., value function). The exploration bonus used by SCAL+ is thus . In comparison, state-of-the-art algorithms in the infinite horizon undiscounted setting like UCRL or SCAL can, to a certain extent, be interpreted as virtually using an exploration bonus of order and respectively. This is bigger by a multiplicative factor . As a result, to the best of our knowledge, SCAL+ achieves a “tighter” optimism than any other existing algorithm in the infinite horizon undiscounted setting and is therefore less prone to over-exploration.
To further illustrate the generality of the exploration bonus approach, we also present C-SCAL+, an extension of SCAL+ to continuous state space –but finite action space– MDPs. As in (Ortner and Ryabko, 2012; Lakshmanan et al., 2015), we require the reward and transition functions to be Hölder continuous with parameters and . C-SCAL+ is also the first implementable algorithm in continuous problem with theoretical guarantees (existing algorithms with theoretical guarantees such as UCCRL (Ortner and Ryabko, 2012) cannot be implemented). C-SCAL+ combines the idea of SCAL+ with state aggregation. Compared to SCAL+, the exploration bonus contains an additional term due to the discretization: for any aggregated state , .
The main result of the paper is summarized in Thm. 1: For any MDP with states, actions and next states, the regret of SCAL+ is bounded with high probability by For any “smooth” MDP with smoothness parameters and , -dimensional state space and actions, the regret of C-SCAL+ is bounded with high probability by The regret bound of SCAL+ (resp. C-SCAL+) matches the one of SCAL (UCCRL). Surprisingly, the tighter optimism introduced by SCAL+ compared to SCAL and UCRL is not reflected in the final regret bound with the current statistical analysis ( appears in the bound although despite not being included in the bonus). We isolate and discuss where the term appears in the proof sketch of Sect. 3.4. While Azar et al. (2017); Kakade et al. (2018); Jin et al. (2018) managed to remove the term in the finite horizon setting, it remains an open question whether their result can be extended to the infinite horizon case (for example, the two definitions of regret do not match and differ by a linear term) or it is an intrinsic difficulty of the setting. Finally, SCAL+ and C-SCAL+ are very appealing due to their simplicity and flexibility of implementation since the planning is performed on the empirical MDP (rather than on a much more complex extended MDP). This change of paradigm results in a more computationally efficient planning compared to UCRL and SCAL, as explained in Sec. 3.1.
2.1 Markov Decision Processes
We consider a weakly-communicating333 In a weakly-communicating MDP, the set can be decomposed into two subsets: a communicating set in which for any pair of states there exists a policy that has a non-zero probability to reach starting from , and a set of states that are transient under all policies. MDP (Puterman, 1994, Sec. 8.3) with a set of states and a set of actions . For sake of clarity, here we consider a finite MDP but all the stated concepts extend to the case of continuous state space, (see e.g., Ortner and Ryabko, 2012).
Each state-action pair is characterized by a reward distribution with mean and support in
as well as a transition probability distributionover next states. We denote by and the number of states and action, and by the maximum support of all transition probabilities . A stationary Markov randomized policy maps states to distributions over actions. The set of stationary randomized (resp. deterministic) policies is denoted by (resp. ). Any policy has an associated long-term average reward (or gain) and a bias function defined as
where and the bias measures the expected total difference between the reward and the stationary reward in Cesaro-limit444For policies with an aperiodic chain, the standard limit exists. (denoted ). Accordingly, the difference of bias values quantifies the (dis-)advantage of starting in state rather than and we denote by the span of the bias function. In weakly communicating MDPs, any optimal policy has constant gain, i.e., for all . Moreover, there exists a policy for which satisfy the optimality equation
where L is the optimal Bellman operator:
Note that is finite, i.e., . Finally, denotes the diameter of , where is the minimal expected number of steps needed to reach from in (under any policy).
2.2 Planning under span constraint
In this section we introduce and analyse the problem of planning under bias span constraint, i.e., by imposing that , for any policy . This problem is at the core of the proposed algorithms (SCAL+ and C-SCAL+) for exploration-exploitation. Formally, we define the optimization problem:
where is any MDP (with discrete or continuous state space) s.t. .555Fruit et al. (2018b, Lem. 2) showed that there may not exist a deterministic optimal policy for problem 3. This problem is a slight variation of the bias-span constrained problem considered by (Bartlett and Tewari, 2009; Ortner and Ryabko, 2012; Lakshmanan et al., 2015), for which no known-solution is available. On the other hand, problem 3 has been widely analysed by Fruit et al. (2018b).
Problem 3 can be solved using ScOpt (Fruit et al., 2018b), a version of (relative) value iteration (Puterman, 1994; Bertsekas, 1995), where the optimal Bellman operator is modified to return value functions with span bounded by , and the stopping condition is tailored to return a constrained-greedy policy with near-optimal gain. Given and , we define the value operator as
where and is the span constrain projection operator (see (Fruit et al., 2018b, App. D) for details). In other words, operator applies a span truncation to the one-step application of , which guarantees that
. Given a vectorand a reference state ScOpt implements relative value iteration where is replaced by : . We can now state the convergence guarantees of ScOpt (see Fruit et al., 2018b, Lem. 8 and Thm. 10). Let’s assume that I) the optimal Bellman operator is a -span-contraction; II) all policies are unichain; III) operator is globally feasible at any vector such that i.e., for all . Then:
Optimality equation: there exists a solution to the optimality equation . Moreover, any solution satisfies .
Convergence: for any initial vector , ScOpt converges to a solution of the optimality equation, and .
2.3 Learning Problem
Let be the true unknown MDP. We consider the learning problem where , and are known, while rewards and transition probabilities are unknown and need to be estimated on-line. We evaluate the performance of a learning algorithm after time steps by its cumulative regret: Finally, we assume that the algorithm is provided with the knowledge of a constant such that . This assumption has been widely used in the literature (see e.g., Ortner, 2008; Ortner and Ryabko, 2012; Fruit et al., 2018b) and, as shown by (Fruit et al., 2018a), it is necessary in order to achieve a logarithmic regret bound in weakly-communicating MDPs.
3 Scal+: Scal with exploration bonus
In this section, we introduce SCAL+, the first online RL algorithm –in the infinite horizon undiscounted setting– that leverages on an exploration bonus to achieve provable good regret guarantees. Similarly to SCAL (Fruit et al., 2018b), SCAL+ takes advantage of the prior knowledge on the optimal bias span through the use of ScOpt. In Sec. 3.1 we present the details of SCAL+ and we give an explicit formula for the exploration bonus. We then show that all the conditions of Prop. 2.2 are satisfied for SCAL+, meaning that ScOpt can be used. Finally, we justify the choice of the bonus by showing that SCAL+ is gain-optimistic (Sec. 3.3) and we conclude this section with the regret guarantees of SCAL+ (Thm. 3.4) and a sketch of the regret proof.
3.1 The algorithm
SCAL+ is a variant of SCAL that uses ScOpt to (approximately) solve (3) on MDP at the beginning of each episode (see Fig. 1).666The algorithm is reported in its general form, which applies to both finite and continuous MDPs. Before defining we need to introduce some notations and an intermediate MDP .
Denote by the starting time of episode , the number of observations of 3-tuples before episode ( excluded) and . As in UCRL, we define the empirical averages and by:
The exploration bonus is defined by aggregating the uncertainty on the reward and transition functions:
where is derived from Hoeffding-Azuma inequality. The application of ScOpt to the MDP defined by will not lead to a solution of problem 3 in general since none of the three assumptions of Prop. 2.2 is met. To satisfy the first and second assumptions, we introduce MDP where for all , is an arbitrary reference state and
is a biased (but asymptotically consistent) estimator of the probability of transition .
To satisfy the third assumption, we define the augmented MDP obtained by duplicating every action in with transition probability unchanged and reward set to . Formally, and for the sake of clarity, any pair is denoted by . We then define and . In the next section we will verify that satisfies all the assumptions of Prop. 2.2.
Note that the policy returned by ScOpt takes action in the augmented set . The projection on is simply , for all (we use the same notation for the two policies). is executed until the episode ends i.e., until the number of visits in at least one state-action pair has doubled (see Fig. 1).
only requires to plan on an empirical MDP with exploration bonus rather than an extended MDP (with continuous action space). This removes the burden of computing the best probability in a confidence interval which has a worst-case computational complexity linear in the number of states(Jaksch et al., 2010, Sec. 3.1.2). Therefore, SCAL+ is not only simpler to implement but also less computationally demanding. Furthermore, removing the optimistic step on the transition probabilities allows the exploration bonus scheme to be easily adapted to any MDP that can be efficiently solved (e.g., continuous smooth MDPs).
3.2 Requirements for ScOpt
We show that the three assumptions of Prop. 2.2 required from ScOpt to solve (3) for are satisfied. The arguments are similar to those used by Fruit et al. (2018b, Sec. 6) for SCAL. We denote by , and the optimal Bellman operators of , and respectively. Similarly, we denote by , and the truncated Bellman operators (Eq. 4) of , and respectively.
Contraction. The small bias in the definition of ensures that the “attractive” state is reached with non-zero probability from any state-action pair implying that the ergodic coefficient of defined as is smaller than and thus (the Bellman operator of ) is -contractive (Puterman, 1994, Thm. 6.6.6).
Unichain. By construction, the attractive state necessarily belongs to all recurrent classes of all policies implying that is unichain (i.e., all policies are unichain).
Global feasibility. Let such that and let be such that . For all we have:
and . Therefore, for all implying that is globally feasible at .
3.3 Optimistic Exploration Bonus
All algorithms relying on the OFU principle (e.g., UCRL, Opt-PSRL, SCAL, etc.) have the property that the optimal gain of the MDP used for planning is an upper bound on the optimal gain of the true MDP . This is a key step in deriving regret guarantees. If we want to use the same proof technique for SCAL+, we also have to ensure that the policy is gain-optimistic (up to an - accuracy), i.e., . The exploration bonus was tailored to enforce this property. To prove gain-optimism we rely on the following proposition which is a direct consequence of Fruit et al. (2018b, Lem. 8): [Dominance] If there exists satisfying then . By induction, using the monotonicity and linearity of (Fruit et al., 2018b, Lemma 16 (a) & (c)), we have that . By Prop. 2.2, . Taking the limit when tends to infinity in the previous inequality yields: . Recall that the optimal gain and bias of the true MDP satisfy the optimality equation (Sec. 2.1). Since in addition (by assumption), we also have and so . According to Prop. 3.3, it is sufficient to show that to prove optimism. Fruit et al. (2018b, Lemma 15) also showed that the span projection (see Eq. 4) is monotone implying that a sufficient condition for to hold is to have . With our choice of bonus, this inequality holds with high probability (w.h.p) as a consequence of the following lemma: For all and , with probability at least , for any we have: . Hoeffding-Azuma inequality implies that with probability at least , for all and for all pairs , and . Finally, we also need to take into account the small bias introduced by compared to which is not bigger than by definition. Denote by the optimal Bellman operator of . A direct implication of Lem. 3.3 is that w.h.p. Since by definition and it is immediate to see that implying that w.h.p. As a result, we have the following desired property: For all and , with probability at least , .
Note that the argument used in this section to prove optimism (Lem. 3.3) significantly differs from the one used by Jaksch et al. (2010, UCRL) and Fruit et al. (2018b, SCAL). UCRL and SCAL compute a (nearly) optimal policy of an extended MDP that “contains” the true MDP (w.h.p.). This immediately implies that the gain of the extended MDP is bigger than (analogue property of Lem. 3.3). The main advantage of our argument compared to theirs is that it allows for a tighter optimism. To see why, note that the exploration bonus quantifies by how much is bigger than and approximately scales as . In contrast, UCRL and SCAL use an optimistic Bellman operator such that is bigger than by respectively (UCRL) and (SCAL). In other words, the optimism in SCAL+ is tighter by a multiplicative factor . A natural next step would be to investigate whether our argument could be extended to UCRL and SCAL in order to save for the optimism. We keep this open question for future work.
3.4 Regret Analysis of Scal+
We now give the main result of this section: For any weakly communicating MDP such that , with probability at least it holds that for any , the regret of SCAL+ is bounded as
Unlike SCAL, SCAL+ does not have a regret scaling with implying that whenever , SCAL+ performs worse than UCRL. SCAL builds an extended MDP that contains the true MDP and therefore the shortest path in the extended MDP is shorter than the shortest path in the true MDP implying that with being the solution returned by extended value iteration (Thm. 4 of Bartlett and Tewari (2009)). Let be the solution returned by ScOpt on , it is not clear how to bound other than using the prior knowledge . This open question seems a lot related to the one of Sec. 3.3 (i.e., how to have a tighter optimism in UCRL) and we also keep it for future work.
Proof sketch. We now provide a sketch of the main steps of the proof of Thm. 3.4 (the full proof is reported in App. 3.4). In order to preserve readability, in the following, all inequalities should be interpreted up to minor approximations and in high probability. Let be the number of visits in during episode and be the total number of episodes. By using the optimism proved in Sec. 3.3, we can decompose the regret as:
where and is the solution of ScOpt. The stopping condition of ScOpt applied to is such that (after manipulation): By plugging this inequality into (7) we obtain two terms: and . We can further decompose the scalar product as . The second terms is negligible in the final regret since it is of order when summed over , and episodes (Fruit et al., 2018b, Eq. 56). On the other hand, the term is the dominant term of the regret and represents the error of using the estimated in place of in a step of value iteration. As shown in Sec. 3.3, we can start bounding the error of using in place of by . The remaining term is thus . Since depends on , we cannot apply Hoeffding-Azuma inequality as done in Sec. 3.3 for the design of . Instead we use a worst-case approach and bound separately and which will introduce a factor (by using Bernstein-Freedman inequality instead of Hoeffding-Azuma inequality). It is worth pointing out that only appears due to statistical fluctuations that we cannot control, and not from the optimism (i.e., exploration bonus) that is explicitly encoded in the algorithm. Concerning the reward, as shown in Sec. 3.3, we have that . As a consequence, we can approximately write that:
The proof follows by noticing that , thus all the remaining terms can be bounded as in (Fruit et al., 2018b).
Remarks. Given the fact that the optimism in SCAL+ is tighter than in SCAL by a factor , one might have expected to get a regret bound scaling as instead of , thus matching the lower bound of Jaksch et al. (2010) as for the dependency in .777From an algorithmic perspective we achieve the optimal dependence on , although this is not reflecting in the regret bound. Unfortunately, such a bound seems difficult to achieve with SCAL+ (and even SCAL) for the reason explained above.
On the other side, (Azar et al., 2017; Kakade et al., 2018) achieved such an optimal dependence in finite-horizon problems. The main issue in extending such results is the different definition of the regret: their regret is defined as the difference between the value function at episode and the optimal one. It is not clear how to map their definition to ours without introducing a linear term in . Concerning infinite-horizon undiscounted problems, Agrawal and Jia (2017) claimed to have obtained the optimal dependence in their optimistic posterior sampling approach. To achieve such goal, they exploited the fact that . Unfortunately, as explained above, it is not possible to achieve such tight concentration by using a worst-case argument, as they do. As a result, optimistic PSRL would have a regret scaling as , while the improved bound in (Agrawal and Jia, 2017) should be rather considered as a conjecture.888The problem has been acknowledged by the authors via personal communication.
4 C-Scal+: Scal+ for continuous state space
We now consider an MDP with continuous state space and discrete action space . In general, it is impossible to approximate an arbitrary function with only a finite number of samples. As a result, we introduce the same smoothness assumption as Ortner and Ryabko (2012) (Hölder continuity):
There exist s.t. for any two states and any action :
4.1 The algorithm
In order to apply SCAL+ to a continuous problem, a natural idea is to discretize the state space as is done by Ortner and Ryabko (2012). We therefore partition into intervals defined as and for . The set of “aggregated” states is then (). As can be expected, we will see that the number of intervals will play a central role in the regret. Note that the terms and defined in Sec. 3 are still well-defined for and lying in but are except for a finite number of and (see Def. 4.2). For any subset , the sum is also well-defined as long as the collection contains only a finite number of non-zero elements. We can therefore define the aggregated counts, rewards and transition probabilities for all as: ,
Similarly to the discrete case, we define the exploration bonus of an aggregated state as
While is defined as in (5) on the discrete “aggregated” MDP, the main difference with the discrete bonus (5) is an additional term that accounts for the fact that the states that we aggregate are not completely identical but have parameters that differ by at most . We pick an arbitrary reference aggregated state and define the “aggregated” (discrete) analogue of defined in Sec. 3, where and