# Revisiting Exploration-Conscious Reinforcement Learning

The objective of Reinforcement Learning is to learn an optimal policy by performing actions and observing their long term consequences. Unfortunately, acquiring such a policy can be a hard task. More severely, since one cannot tell if a policy is optimal, there is a constant need for exploration. This is known as the Exploration-Exploitation trade-off. In practice, this trade-off is resolved by using some inherent exploration mechanism, such as the ϵ-greedy exploration, while still trying to learn the optimal policy. In this work, we take a different approach. We define a surrogate optimality objective: an optimal policy with respect to the exploration scheme. As we show throughout the paper, although solving this criterion does not necessarily lead to an optimal policy, the problem becomes easier to solve. We continue by analyzing this notion of optimality, devise algorithms derived from this approach, which reveal connections to existing work, and test them empirically on tabular and deep Reinforcement Learning domains.

## Authors

• 6 publications
• 20 publications
• 127 publications
• ### Off-Policy Deep Reinforcement Learning without Exploration

12/07/2018 ∙ by Scott Fujimoto, et al. ∙ 0

• ### ISL: Optimal Policy Learning With Optimal Exploration-Exploitation Trade-Off

Traditionally, off-policy learning algorithms (such as Q-learning) and e...
09/13/2019 ∙ by Lucas Cassano, et al. ∙ 2

• ### An Intrinsically-Motivated Approach for Learning Highly Exploring and Fast Mixing Policies

What is a good exploration strategy for an agent that interacts with an ...
07/10/2019 ∙ by Mirco Mutti, et al. ∙ 0

• ### ExTra: Transfer-guided Exploration

In this work we present a novel approach for transfer-guided exploration...
06/27/2019 ∙ by Anirban Santara, et al. ∙ 8

• ### Deciding What to Learn: A Rate-Distortion Approach

Agents that learn to select optimal actions represent a prominent focus ...
01/15/2021 ∙ by Dilip Arumugam, et al. ∙ 0

• ### Neural Network iLQR: A New Reinforcement Learning Architecture

As a notable machine learning paradigm, the research efforts in the cont...
11/21/2020 ∙ by Zilong Cheng, et al. ∙ 0

• ### An Entropy Regularization Free Mechanism for Policy-based Reinforcement Learning

Policy-based reinforcement learning methods suffer from the policy colla...
06/01/2021 ∙ by Changnan Xiao, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The main goal of Reinforcement Learning (RL) (Sutton et al., 1998) is to find an optimal policy for a given decision problem. A major difficulty arises due to the Exploration-Exploitation tradeoff, which characterizes the omnipresent tension between exploring new actions and exploiting the so-far acquired knowledge. Considerable line of work has been devoted for dealing with this tradeoff. Algorithms that explicitly balance between exploration and exploitation were developed for tabular RL (Kearns & Singh, 2002; Brafman & Tennenholtz, 2002; Jaksch et al., 2010; Osband et al., 2013). However, generalizing these results to approximate RL, i.e, when using function approximation, remains an open problem. On the practical side, recent works combined more advanced exploration schemes in approximate RL (e.g, Bellemare et al. (2016); Fortunato et al. (2017)), inspired by the theory of tabular RL. Nonetheless, even in the presence of more advanced mechanisms, -greedy exploration is still applied (Bellemare et al., 2017; Dabney et al., 2018; Osband et al., 2016). More generally, the traditional and simpler -greedy scheme (Sutton et al., 1998; Asadi & Littman, 2016) in discrete RL, and Gaussian action noise in continuous RL, are still very useful and popular in practice (Mnih et al., 2015, 2016; Silver et al., 2014; Schulman et al., 2017; Horgan et al., 2018), especially due to their simplicity.

These types of exploration schemes share common properties. First, they all fix some exploration parameter beforehand, e.g, , the ‘inverse temperature’

, or the action variance

for the

-greedy, soft-max and Gaussian exploration schemes, respectively. By doing so, the balance between exploration and exploitation is set. Second, they all explore using a random policy, and exploit using current estimate of the

optimal policy. In this work, we follow a different approach, when using these fixed exploration schemes: exploiting by using an estimate of the optimal policy w.r.t. the exploration mechanism.

Exploration-Consciousness is the main reason for the improved performance of on-policy methods like Sarsa and Expected-Sarsa (Van Seijen et al., 2009) over Q-learning during training (Sutton et al., 1998)[Example 6.6: Cliff Walking]. Imagine a simple Cliff-Walking problem: The goal of the agent is to reach the end without falling of the cliff, where the optimal policy is to go alongside the cliff. While using a fixed-exploration scheme, playing a near optimal policy which goes alongside the cliff will lead to a significant sub-optimal performance. This, in turn, will hurt the acquisition of new experience needed to learn the optimal policy. However, learning to act optimally w.r.t. the exploration scheme can mitigate this difficultly; the agent learns to reach the goal while keeping a safe enough distance from the cliff.

In the past, tabular q-learning-like exploration-conscious algorithms were suggested (John, 1994; Littman et al., 1997; Van Seijen et al., 2009). Here we take a different approach, and focus on exploration conscious policies. The main contributions of this work are as follows:

• We define exploration-consciousness optimization criteria, for discrete and continuous actions spaces. The criteria are interpreted as finding an optimal policy within a restricted set of policies. Both, we show, can be reduced to solving a surrogate MDP. The surrogate MDP approach, to the best of our knowledge, is a new one, and serves us repeatedly in this work.

• We formalize a bias-error sensitivity tradeoff. The solutions are biased w.r.t. the optimal policy, yet, are less sensitive to approximation errors.

• We establish two fundamental approaches to practically solve Exploration-Conscious optimization problems. Based on these, we formulate algorithms in discrete and continuous action spaces, and empirically test the algorithms on Atari and MuJoCo domains.

## 2 Preliminaries

Our framework is the infinite-horizon discounted Markov Decision Process (MDP). An MDP is defined as the 5-tuple (Puterman, 1994), where is a finite state space, is a compact space, is a transition kernel, is a bounded reward function, and . Let be a stationary policy, where

is a probability distribution on

, and denote as the set of deterministic policies, . Let be the value of a policy defined in state as , where , and denotes expectation w.r.t. the distribution induced by and conditioned on the event It is known that , with the component-wise values and . Furthermore, the -function of is given by , and represents the value of taking an action from state and then using the policy .

Usually, the goal is to find yielding the optimal value, and the optimal value is . It is known that optimal deterministic policy always exists (Puterman, 1994). To achieve this goal the following classical operators are defined (with equalities holding component-wise). :

 Tπv= rπ+γPπv, Tv=maxπTπv, (1) G(v)={π:Tπv=Tv}, (2)

where is a linear operator, is the optimal Bellman operator and both and are -contraction mappings w.r.t. the max norm. It is known that the unique fixed points of and are and , respectively. is the standard set of 1-step greedy policies w.r.t. . Furthermore, given , the set coincides with that of stationary optimal policies. It is also useful to define the -optimal Bellman operator, which is a -contraction, with fixed point .

 Tqq(s,a)=r(s,a)+γ∑s′P(s′∣s,a)maxa′q(s′,a′), (3)

In this work, the use of mixture policies is abundant. We denote the -convex mixture of policies by . Importantly, can be interpreted as a stochastic policy s.t with w.p the agent acts with and w.p acts with .

## 3 The α-optimal criterion

In this section, we define the notion of -optimal policy w.r.t. a policy, . We then claim that finding an -optimal policy can be done by solving a surrogate MDP. We continue by defining the surrogate MDP, and analyze some basic properties of the -optimal policy.

Let . We define to be the -optimal policy w.r.t. , and is contained in the following set,

 π∗α,π0∈argmaxπ′∈ΠEπα(π′,π0)[∑t=0γtr(st,at))], (4)

or, , where and is the -convex mixture of and , and thus a probability distribution. For brevity, we omit the subscript , and denote the -optimal policy by  throughout the rest of the paper. The -optimal value (w.r.t. ) is , the value of the policy . In the following, we will see the problem is equivalent to solving a surrogate MDP, for which an optimal deterministic policy is known to exist. Thus, there is no loss optimizing over the set of deterministic policies .

Optimization problem (4) can be viewed as optimizing over a restricted set of policies: all policies that are a convex combination of with a fixed . Naturally, we can consider in (4) a state-dependent as well, and some of the results in this work will consider this scenario. In other words, is the best policy an agent can act with, if it plays w.p according to , and w.p according to , where can be any policy. The relation to the -greedy exploration setup becomes clear when

is a uniform distribution on the actions, and set

instead of . Then, is optimal w.r.t. the -greedy exploration scheme; the policy would have the largest accumulated reward, relatively to all other policies, when acting in an -greedy fashion w.r.t. it.

We choose to name the policy as the - and not -optimal to prevent confusion with other frameworks. The -optimal policy is a notation used in the context of PAC-MDP type of analysis (Strehl et al., 2009), and has a different meaning than the objective in this work (4).

### 3.1 The α-optimal Bellman operator, α-optimal policy and policy improvement

In the previous section, we defined the -optimal policy and the -optimal value, and , respectively. We start this section by observing that problem (4) can be viewed as solving a surrogate MDP, denoted by . We define the Bellman operators of the surrogate MDP, and use them to prove an important improvement property.

Define the surrogate MDP as .

 ∀a∈A, rα(s,a)=(1−α)r(s,a)+αrπ0(s), Pπα(s′∣s,a)=(1−α)P(s′∣s,a)+αPπ0(s′∣s), (5)

are its reward and dynamics, and rest of its ingredients are similar to . We denote the value of a policy on by , and the optimal value on by . The following simple lemma relates the value of a policy , measured on and (see proof in Appendix D).

###### Lemma 1.

For any policy , . Thus, an optimal policy on is the -optimal policy .

The fixed-policy and optimal Bellman operators of are denoted by and , respectively. Again, for brevity we omit from the definitions. Notice that and are -contractions as being Bellman operators of a -discounted MDP. The following Lemma relates and to the Bellman operators of the original MDP, . Furthermore, it stresses a non-trivial relation between the -optimal policy and the -optimal value, .

###### Proposition 2.

The following claims hold for any policy :

1. , with fixed point .

2. , with fixed point .

3. An -optimal policy is an optimal policy of and is greedy w.r.t. ,

In previous works, e.g. (Asadi & Littman, 2016), the operator was referred to as the -greedy operator. Lemma 2 shows this operator is (with ), the optimal Bellman operator of the defined surrogate MDP . This lemma leads to the following important property.

###### Proposition 3.

Let , , be a policy, and be the -optimal policy w.r.t . Then, with equality iff .

The first relation , is better than , is trivial and holds by definition (4). The non-trivial statement is the second one. It asserts that given , it is worthwhile to use the mixture policy with ; use with smaller probability. Specifically, better performance, compared to , is assured when using the deterministic policy , by setting .

In section 6, we demonstrate the empirical consequences of the improvement lemma, which, to our knowledge, has not yet been stated. Furthermore, the improvement lemma is unique to the defined optimization criterion (4). We will show that alternative definitions of exploration conscious criteria does not necessarily have this property. Moreover, one can use Proposition 3 to generalize the notion of the 1-step greedy policy (2), as was done in Efroni et al. (2018) with multiple-step greedy improvement. We leave studying this generalization and its Policy Iteration scheme for future work, and focus on solving (4) a single time.

### 3.2 Performance bounds in the presence of approximations

We now consider an approximate setting and quantify a bias - error sensitivity tradeoff in , where is an approximated -optimal policy. We formalize an intuitive argument; as increases the bias relatively to the optimal policy increases. Yet, the sensitivity to errors decreases, since the agent uses w.p. regardless of errors.

###### Definition 1.

Let be the optimal value of an MDP, . We define , to be the Lipschitz constant w.r.t. of the MDP at state . We further define the upper bound on the Lipschitz constant .

Definition 1 defines the ‘Lipschitz’ property of the optimal value, . Intuitively, quantifies a degree of ‘smoothness’ of the optimal value. A small value of indicates that if one acts according to once and then continue playing the optimal policy from state , a great loss will not occur. Large values of indicate that using from state leads to an irreparable outcome (e.g, falling off a cliff). The following theorem formalizes a bias-error sensitivity tradeoff. As increases, the bias increases, while the sensitivity to errors decreases (see proof in Appendix G).

###### Theorem 4.

Let . Assume is an approximate -optimal value s.t for some . Let be the greedy policy w.r.t. , . Then, the performance relatively to the optimal policy is bounded by,

 ∥∥v∗−vπα(^π∗α,π0)∥∥≤αL1−γBias+2(1−α)γδ1−γSensitivity.

When the bias of the -optimal value relatively to the optimal one is small, solving (4) does not lead to a great loss relatively to the optimal performance. The bias can be bounded by the ‘Lipschitz’ property of the MDP. For a state dependent , the bias bound changes to be dependent on . This highlights the importance of prior knowledge when using (4). Choosing (possibly state-wise) s.t. is small, allows to use a bigger , while the bias is small. The sensitivity term upper bounds the performance of relatively to the -optimal value, and is less sensitive to errors as increase.

The bias term is derived by using the structure of , and is not a direct application of the Simulation Lemma (Kearns & Singh, 2002; Strehl et al., 2009); applying it would lead to a bias of . The sensitivity term generalizes (Bertsekas & Tsitsiklis, 1995)[Proposition 6.1] by using a modified proof technique. There, a factor does not exists.

## 4 Exploration-Conscious Continuous Control

The -greedy approach from Section 3 relies on an exploration mechanism which is fixed beforehand: and are fixed, and an optimal policy w.r.t. them is being calculated (4). However, in continuous control RL algorithms, such as DDPG and PPO (Lillicrap et al., 2015; Schulman et al., 2017), different approach is used. Usually, a policy is being learned, and the exploration noise is injected by perturbing the policy, e.g., by adding to it a Gaussian noise.

We start this section by defining an exploration-conscious optimality criterion that captures such perturbation for the simple case of Gaussian noise. Then, results from Section 3 are adapted to the newly defined criterion, while highlighting commonalities and differences relatively to (4). As in Section 3, we define an appropriate surrogate MDP and we show it can be solved by the usual machinery of Bellman operators. Unlike Section 3, we show that improvement when decreasing the stochasticity does not generally hold. In Appendix J, we prove a bias - error sensitivity tradeoff for the simpler case of optimal policy w.r.t. continuous uniform noise. Yet, we believe that a similar in spirit result can be derived for the class of optimal Gaussian policies.

Instead of restricting the set of policies to the one defined in (4), we restrict our set of policies to be the set of Gaussian policies with a fixed variance. Formally, we wish to find the optimal deterministic policy in this set,

 μ∗σ∈argmaxμ∈ΠEπμ,σ[∞∑t=0γtr(st,at)], (6)

where , is a Gaussian policy with mean and a fixed variance . We name and as the mean and -optimal policy, respectively. As in (4), we show in the following that solving (6) is equivalent for solving a surrogate MDP. Thus, optimal policy can always be found in the deterministic class of policies ; mixture of Gaussians would not lead to a better performance in (6).

Similarly to (5), we define a surrogate MDP w.r.t. to the Gaussian noise and relate it to values of Gaussian policies on the original MDP . Then, we characterize its Bellman operators and thus establish it can be solved using Dynamic Programming. Define the surrogate MDP as . For every ,

 rσ(s,a)=∫AN(a′;a,σ)r(s,a′)da′, Pσ(s′∣s,a)=∫AN(a′;a,σ)P(s′∣s,a′)da′, (7)

are its reward and dynamics, and denote a value of a policy on by . The following results correspond to Lemma 1 and Proposition 2 for the class of Gaussian policies.

###### Lemma 5.

For any policy , . Thus, an optimal policy on is the mean optimal policy .

###### Proposition 6.

Let be a mixture of Gaussian policies. Then, the following holds:

1. , with fixed point .

2. , with fixed point .

3. The mean -optimal policy is an optimal policy of and,

Surprisingly, given a -optimal policy mean , an improvement is not assured when lowering the stochasticity by decreasing in . This comes in contrast to Proposition 3 and highlights its uniqueness (proof in Appendix I).

###### Proposition 7.

Let and let be the mean -optimal policy. There exists an MDP s.t .

In Appendix J, we prove a bias - error sensitivity tradeoff (21) (as in Theorem 4) for the class of optimal policies w.r.t. uniform noise, while establishing similar results to the ones in this section. We believe a bias - error sensitivity tradeoff exists for the class of optimal policies w.r.t. Gaussian noise as well, and leave the details for future work.

## 5 Algorithms

In this section, we offer two fundamental approaches to solve exploration conscious criteria using sample-based algorithms: the Expected and Surrogate approaches. For both, we formulate converging, q-learning-like, algorithms. Next, by adapting DDPG, we show the two approaches can be used in exploration-conscious continuous control as well.

Consider any fixed exploration scheme. Generally, these schemes operate in two stages: (i) Choose a greedy action, . (ii) Based on and some randomness generator, choose an action to be applied on the environment, . E.g., for -greedy exploration, w.p. the agent acts with , otherwise, with a random uniform policy. While in RL the common update rules use , the saved experience is , in the following we motivate the use of , and view the data as .

The two approaches characterized in the following are based on two, inequivalent, ways to define the -function. For the Expected approach the -function is defined as usual: represents the value obtained when taking an action and then acting with , meaning is the action chosen in step (ii). Alternatively, for the Surrogate approach, the -function is defined on the ‘Surrogate’ MDP, i.e., the exploration is viewed as stochasticity of the environment. Then, is the value obtained when is the action of step (i), i.e., choosing action .

### 5.1 Exploration Conscious Q-Learning

We focus on solving the -optimal policy (4), and formulate -learning-like algorithms using the two aforementioned approaches. The Expected -optimal -function is,

 qπα(π∗α,π0)(s,a)≜r(s,a)+γ∑s′P(s′∣s,a)v∗α(s′) (8)

Indeed, is the usually defined -function of the policy on an MDP . Here, the action represents the actual performed action, . By relating to it can be easily verified that satisfies the fixed point equation (see Appendix K),

 qπα(π∗α,π0)(s,a)= r(s,a)+γ(1−α)∑s′P(s′∣s,a)maxa′qπα(π∗α,π0)(s′,a′) +γα∑s′,a′P(s′∣s,a)π0(a′∣s′)qπα(π∗α,π0)(s′,a′). (9)

Alternatively, consider the optimal -function of the surrogate MDP (5). It satisfies the fixed-point equation

 q∗α(s,a)≜rα(s,a)+γ∑s′Pα(s′∣s,a)maxa′q∗α(s′,a′).

The following lemma formalizes the relation between the two -functions, and shows they are related by a function of the state, and not of the action.

###### Lemma 8.

The -optimal policy is also an optimal policy of (Lemma 1). Thus, it is greedy w.r.t. , the optimal of . By Proposition 2.3 it is also greedy w.r.t. , i.e.,

 π∗α(s)∈argmaxa′q∗α(s,a′)=argmaxa′qπα(π∗α,π0)(s,a′).

Lemma 8 describes this fact by different means; the two -functions are related by a function of the state and, thus, the greedy action w.r.t. each is equal. Furthermore, it stresses the fact that the two -function are not equal.

Before describing the algorithms, we define the following notation for any ,

 v(s)=maxa′q(s,a′),avπ(s)=∑a′π(a′∣s)q(s,a′).

We now describe the Expected -Q-learning algorithm (see Algorithm 1), also given in (John, 1994; Littman et al., 1997), and re-interpret it in light of the previous discussion.

The fixed point equation (9), leads us to define the operator for which . Expected -Q-learning (Alg. 1) is a Stochastic Approximation (SA) alg. based on the operator . Given a sample of the form , it updates by

 (1−η)q(s,aenv)+η(rt+γ((1−α)v(st+1)+αvπ0(st+1))) (10)

Its convergence proof is standard and follows by showing is a -contraction and using (Bertsekas & Tsitsiklis, 1995)[Proposition 4.4] (see proof in Appendix K.1).

We now turn to describe an alternative algorithm, which operates on the surrogate MDP, , and converges to . Naively, given a sample , regular -learning on can be used by updating as,

 (1−ηt)q(s,achosen)+ηt(rt+γv(st+1)), (11)

Yet, this approach does not utilize a meaningful knowledge; when the exploration policy is played, i.e., when , the sample can be used to update all the action entries from the current state. These entries are also affected by the policy . In fact, we cannot prove the convergence of the naive update based on current techniques; if the greedy action is repeatedly chosen, ‘infinitely often’ visit in all pairs cannot be guaranteed.

This reasoning leads us to formulate Surrogate -Q-learning (see Algorithm 2). The Surrogate -Q-learning updates two -functions, and . The first, , has the same update as in Expected -Q-learning, and thus converges (w.p ) to . The second, , updates the chosen greedy action using equation (11), when the exploration policy is not played (). By bootstrapping on , the algorithm updates all other actions when the exploration policy is played (). Using (Singh et al., 2000)[Lemma 1], the convergence of Surrogate -Q-learning to is established (see proof in Appendix K.2). Interestingly, and unlike other -learning algorithms (e.g, Expected -Q-learning, Q-learning, etc.), Surrogate -Q-learning updates the entire action set given a single sample. For completness, we state the convergence result for both algorithms.

###### Theorem 9.

Consider the processes described in Alg.  1, 2. Assume satisfies , , , and , where . Then, for both 1, 2 the sequence converges w.p. 1 to , and for 2, converges w.p. 1 to .

### 5.2 Continuous Control

Building on the two approaches for solving Exploration Conscious criteria, we suggest two techniques to find an optimal Gaussian policy (6) using gradient based Deep RL (DRL) algorithms, and specifically, DDPG (Lillicrap et al., 2015). Nonetheless, the techniques are generalizable to other actor-critic, DRL algorithms (Schulman et al., 2017).

Assume we wish to find an optimal Gaussian policy by parameterizing its mean . Nachum et al. (2018)[Eq. 13] showed the gradient of the value w.r.t. is similar to Silver et al. (2014),

 ∇ϕvπμ,σ=∫S∂aqππμ,σσ(s,a)∇ϕμθ(s)dρπμ,σ(s), (12)

where , is the -function of the surrogate MDP. In light of previous section, we interpret as the -function of the surrogate MDP’s (7). Furthermore, we have the following relation between the surrogate and expected -functions, , from which it is easy to verify that (see Appendix K.3),

 ∇uqπμ,σσ(s,b)=∫AN(b∣a,σ)∇bqπμ,σ(s,b)db. (13)

Thus, we can update the actor in two inequivalent ways, by using gradients on the surrogate MDP’s -function (12), or by using gradients of the expected -function (13).

The updates of the critic, or , can be done using the same notion that led to the two forms of updates in (11)-(10). When using Gaussian noise, one performs the two stages defined in Section 5, where is the output of the actor , and . Then, the sample is obtained by interacting with the environment. Based on the the fixed policy TD-error defined in (11

), we define the following loss function, for learning

, q-function of the fixed policy over ,

 (qθσ(s,achosen)−r−γqθ−σ(s′,μϕ−(s′)))2.

On the other hand, we can define a loss function derived from the fixed-policy TD-error defined in (10), for learning , the -function of the Gaussian policy with mean and variance over ,

 (qθ(s,aenv)−r−γ∫AN(b∣μϕ−(s′),s′)qθ−(s′,b)db)2.

## 6 Experiments

In this section, we test the theory and algorithms 111Implementation of the proposed algorithms can be found in https://github.com/shanlior/ExplorationConsciousRL. suggested in this work. In all experiments we used . The tested DRL algorithms in this section (See Appendix B) are simple variations of DDQN (Van Hasselt et al., 2016) and DDPG (Lillicrap et al., 2015), without any parameter tuning, and based on Section 5. For example, for the surrogate approach in both DDQN and DDPG we merely save instead of in the replay buffer (see Section 5 for definitions of ).

We observe a significant improved empirical performance, both in training and evaluation for both the surrogate and expected approaches relatively to the baseline performance. The improved training performance is predictable; the learned policy is optimal w.r.t. the noise which is being played. In large portion of the results, the exploration-conscious criteria leads to better performance in evaluation.

### 6.1 Exploration Consciousness with Prior Knowledge

We use an adaptation of the Cliff-Walking maze (Sutton et al., 1998) we term T-Cliff-Walking (see Appendix C). The agent starts at the bottom-left side of a maze, and needs to get to the bottom-right side goal state with value . If the agent falls off the cliff, the episode terminates with reward . When the agent visits any of the first three steps on top of the cliff, it gets a reward of .

We tested Expected -Q-learning, Surrogate -Q-learning, and compared their performance to Q-learning in the presence of -greedy exploration. Figure 1 stresses the typical behaviour of the -optimality criterion. It is easier to approximate than the optimal policy. Further, by being exploration-consciousness, the value of the approximated policy improves faster using the -optimal algorithms; it learns faster which regions to avoid. As Proposition 4 suggests, the value of the learned policy is biased w.r.t . Next, as suggested by Proposition 3, acting greedily w.r.t. the approximated value attains better performance. Such improvement is not guaranteed while the value had not yet converged to . However, the empirical results suggest that if the agent performs well over the mixture policy, it is worth using the greedy policy.

We show that it is possible to incorporate prior knowledge to decrease the bias caused by being Exploration-Conscious. The T-Cliff-Walking example demands high exploration, , because of the bottleneck state between the two sides of the maze. The -optimal policy in such case is to stay at the left part of the maze. We used the prior knowledge that close to the barrier is high. The knowledge was injected through the choice of , i.e., we chose a state-wise exploration scheme with in the passage and the two states around it, and elsewhere, for all three algorithms. The results in Figure 1 suggests that using prior knowledge to set , can increase the performance by reducing the bias. In contrast, such prior knowledge does not help the baseline q-learning.

### 6.2 Exploration Consciousness in Atari

We tested the -optimal criterion in the more complex function approximation setting (see Appendix Alg. 3, 4). We used five Atari 2600 games (5) from the ALE (Bellemare et al., 2013)

. We chose games that resemble the Cliff Walking scenario, where the wrong choice of action can lead to a sudden termination of the episode. Thus, being unaware of the exploration strategy can lead to poor training results. We used the same deep neural network as in DQN

(Mnih et al., 2015), using the openAI Baselines implementation (Dhariwal et al., 2017), without any parameter tuning, except for the update equations. We chose to use the Double-DQN variant of DQN (Van Hasselt et al., 2016) for simplicity and generality. Nonetheless, changing the optimality criterion is orthogonal to any of the suggested add-ons to DQN (Hessel et al., 2017). We used in the train phase, and in the evaluation phase. For the surrogate version, we used a naive implementation based on equation (11).

Table 1 shows that our method improves upon using the optimal criterion. That is, while bias exists, the algorithm still converges to a better policy. This result holds both on the exploratory training regime and the evaluation regime. Again, acting greedy w.r.t. the approximation of the -optimal policy proved beneficial: The evaluation phase results surpasses the train phase results as shown in the table, and the training figures in Appendix (2). The evaluation is usually done with an . Proposition 3 put formal grounds for using smaller in the evaluation phase than in the training phase; improvement is assured. Being accurate is extremely important in most Atari games, so Exploration-Consciousness can also hurt the performance. Still, one can use prior knowledge to overcome this obstacle.

### 6.3 Exploration Consciousness in MuJoCo

We tested the Expected -DDPG (5) and Surrogate -DDPG (6) on continuous control tasks from the MuJoCo environment (Todorov et al., 2012). We used the OpenAI implementation of DDPG as the baseline, where we only changed the update equations to match our proposed algorithms. We used the default hyper-parameters, and independent Gaussian noise with , for all tasks and algorithms. The results in Table 2 were averaged over 10 different seeds. The performance of the -optimal variants superseded the baseline DDPG, for most of the training and test results. Interestingly, although improvement is not guaranteed (Proposition 7), the -optimal policy improved when using deterministically, i.e., in the test phase. This suggests that improvement can be expected on certain scenarios, although that generally it is not guaranteed. We also found that the training process was faster using the -optimal algorithms, as can be seen in the learning curves in Appendix 3. Interestingly, again, the surrogate approach proved superior.

## 7 Relation to existing work

Lately, several works have tackled the exploration problem for deep RL. In some, like Bootstrapped-DQN (see appendix [D.1] in (Osband et al., 2016)), the authors still employ an -greedy mechanism on top of their methods. Moreover, methods like Distributional-DQN (Bellemare et al., 2017; Dabney et al., 2018) and the state-of-the-art Ape-X DQN (Horgan et al., 2018), still uses -greedy and Gaussian noise, for discrete and continuous actions, respectively. Hence, all the above works are applicable for the -optimal criterion by using the simple techniques described in Section 5.

Existing on-policy methods produce variants of Exploration-Consciousness. In TRPO and A3C (Schulman et al., 2015; Mnih et al., 2016), the exploration is implicitly injected into the agent policy through entropy regularization, and the agent improves upon the value of the explorative policy. Simple derivation shows the -greedy and the Gaussian approaches are both equivalent to regularizing the entropy to be higher than a certain value by setting or appropriately.

Expected -Q-learning highlights a relation to algorithms analysed in (John, 1994; Littman et al., 1997) and to Expected-Sarsa (ES) (Van Seijen et al., 2009). The focus of (John, 1994; Littman et al., 1997) is exploration-conscious q-based methods. In ES, when setting the ‘estimation policy’ (Van Seijen et al., 2009) to be , we get similar updating equations as in lines 6-7, and similarly to (John, 1994; Littman et al., 1997). However, in ES decays to zero, and the optimal policy is obtained in the infinite time limit. In (Nachum et al., 2018), the authors offer a gradient based mechanism for updating the mean and variance of the actor. Here, we offer and analyze the approach of setting and to a constant value. This would be of interest especially when a ‘good’ mechanism for decaying and lacks; the decay mechanism is usually chosen by trial-and-error, and is not clear how it should be set.

Lastly, (