1 Introduction
In^{†}^{†}This paper extends previous work published as Morere and Ramos (2018). Reinforcement Learning (RL), dataefficiency and learning speed are paramount. Indeed, when interacting with robots, humans, or the real world, data can be extremely scarce and expensive to collect. Improving dataefficiency is of the utmost importance to apply RL to interesting and realistic applications. Learning from few data is relatively easier to achieve when rewards are dense, as these can be used to guide exploration. In most realistic problems however, defining dense reward functions is nontrivial, requires expert knowledge and much finetuning. In some cases (eg. when dealing with humans), definitions for dense rewards are unclear and remain an open problem. This greatly hinders the applicability of RL to many interesting problems.
It appears more natural to reward robots only when reaching a goal, termed goalonly rewards, which becomes trivial to define Reinke et al. (2017). Goalonly rewards, defined as unit reward for reaching a goal and zero elsewhere, cause classic exploration techniques based on randomwalk such as greedy and control input noise Schulman et al. (2015), or optimistic initialization to become highly inefficient. For example, Boltzmann exploration Kaelbling et al. (1996) requires training time exponential in the number of states Osband et al. (2014). Such data requirement is unacceptable in realworld applications. Most solutions to this problem rely on redesigning rewards to avoid dealing with the problem of exploration. Reward shaping helps learning Ng et al. (1999), and translating rewards to negative values triggers optimism in the face of uncertainty Kearns and Singh (2002); Brafman and Tennenholtz (2002); Jaksch et al. (2010). This approach suffers from two shortcomings: proper reward design is difficult and requires expert knowledge; improper reward design often degenerates to unexpected learned behaviour.
Intrinsic motivation proposes a different approach to exploration by defining an additional guiding reward; see Figure 1 (left). The exploration reward is typically added to the original reward, which makes rewards dense from the agent’s perspective. This approach has had many successes Bellemare et al. (2016); Fox et al. (2018) but suffers several limitations. For example, weighting between exploration and exploitation must be chosen before learning and remain fixed. Furthermore, in the modelfree setting, stateaction value functions are learned from nonstationary targets mixing exploration and exploitation, hence making learning less dataefficient.
To solve the problem of dataefficient exploration in goalonly reward settings, we propose to leverage advances in multiobjective RL Roijers et al. (2013). We formulate exploration as one of the core objectives of RL by explicitly integrating it to the loss being optimized. Following the multiobjective RL framework, agents optimize for both exploration and exploitation as separate objectives. This decomposition can be seen as two different RL agents, as shown in Figure 1 (right). Contrary to most intrinsic RL approaches, this formulation keeps the explorationexploitation tradeoff at a policy level, as in traditional RL. This allows for several advantages: (i) Weighting between objectives can be adapted while learning, and strategies can be developed to change exploration online; (ii) Exploration can be stopped at any time at no extra cost, yielding purely exploratory behaviour immediately; (iii) Inspection of exploration status is possible, and experimenters can easily generate trajectories for exploration or exploitation only.
Our contributions are the following:

We propose a framework based on multiobjective RL for treating exploration as an explicit objective, making it core to the optimization problem.

This framework is experimentally shown to perform better than classic additive exploration bonuses on several key exploration characteristics.

Drawing inspiration from the fields of bandits and Bayesian optimization, we give strategies for taking advantage of and tuning the explorationexploitation balance online. These strategies achieve a degree of control over agent exploration that was previously unattainable with classic additive intrinsic rewards.

We present a dataefficient modelfree RL method (EMUQ) for continuous stateaction goalonly MDPs based on the proposed framework, guiding exploration towards regions of higher valuefunction uncertainty.

EMUQ is experimentally shown to outperform classic exploration techniques and other methods with additive intrinsic rewards on a continuous control benchmark.
In the following, Section 2
reviews background on Markov decision processes, intrinsic motivation RL, multiobjective RL and related work. Section
3 defines a framework for explicit explorationexploitation balance at a policy level, based on multiobjective RL. Section 4 presents advantages and strategies for controlling this balance during the agent learning process. Section 5 formulates EMUQ, a modelfree dataefficient RL method for continuous stateaction goalonly MDPs, based on the proposed framework. Section 6 presents experiments that evaluate EMUQ’s exploration capabilities on classic RL problems and a simulated robotic manipulator. EMUQ is further evaluated against other intrinsic RL methods on a continuous control benchmark. We conclude with a summary in Section 7.2 Preliminaries
This section reviews basics on Markov decision processes, intrinsic motivation RL, multiobjective RL and related work.
2.1 Markov Decision Processes
A Markov decision process (MDP) is defined by the tuple . and are spaces of states and actions respectively. The transition function
encodes the probability to transition to state
when executing action in state , i.e. . The reward distribution of support defines the reward associated with transition . In the simplest case, goalonly rewards are deterministic and unit rewards are given for absorbing goal states, potential negative unit rewards are given for penalized absorbing states, and zeroreward is given elsewhere. is a discount factor. Solving a MDP is equivalent to finding the optimal policy starting from :(1) 
with , , and . Modelfree RL learns an actionvalue function , which encodes the expected longterm discounted value of a stateaction pair
(2) 
Equation 2 can be rewritten recursively, also known as the Bellman equation
(3) 
, , which is used to iteratively refine models of based on transition data.
2.2 Intrinsic RL
While classic RL typically carries out exploration by adding randomness at a policy level (eg. random action, posterior sampling), intrinsic RL focuses on augmenting rewards with an exploration bonus. This approach was presented in Chentanez et al. (2005), in which agents aim to maximize a total reward for transition :
(4) 
where is the exploration bonus and a userdefined parameter weighting exploration. The second term encourages agents to select stateaction pairs for which they previously received high exploration bonuses. The definition of has been the focus of much recent theoretical and applied work; examples include model prediction error Stadie et al. (2015) or information gain Little and Sommer (2013).
While this formulation enables exploration in well behaved scenarios, it suffers from multiple limitations:

Exploration bonuses are designed to reflect the information gain at a given time of the learning process. They are initially high, and typically decrease after more transitions are experienced, making it a nonstationary target. Updating with nonstationary targets results in higher data requirements, especially when environment rewards are stationary.

The exploration bonus given for reaching new areas of the stateaction space persists in the estimate of
. As a consequence, agents tend to overexplore and may be stuck oscillating between neighbouring states. 
There is no dynamic control over the explorationexploitation balance, as changing parameter only affects future total rewards. Furthermore, it would be desirable to control generating trajectories for pure exploration or pure exploitation, as these two quantities may conflict.
This work presents a framework for enhancing intrinsic exploration, which does not suffer from the previously stated limitations.
2.3 MultiObjective RL
Multiobjective RL seeks to learn policies solving multiple competing objectives by learning how to solve for each objective individually Roijers et al. (2013)
. In multiobjective RL, the reward function describes a vector of
rewards instead of a scalar. The value function also becomes a vector defined as(5) 
where is the vector of rewards at step in which each coordinate corresponds to one objective. For simplicity, the overall objective is often expressed as the sum of all individual objectives; can be converted to a scalar stateaction value function with a linear scalarization function: , where are weights governing the relative importance of each objective.
The advantage of the multiobjective RL formulation is to allow learning policies for all combinations of , even if the balance between each objective is not explicitly defined prior to learning. Moreover, if is a function of time, policies for new values of are available without additional computation. Conversely, with traditional RL methods, a pass through the whole dataset of transitions would be required.
2.4 Related Work
Enhancing exploration with additional rewards can be traced back to the work of Storck et al. (1995) and Meuleau and Bourgine (1999), in which information acquisition is dealt with in an active manner. This type of exploration was later termed intrinsic motivation and studied in Chentanez et al. (2005). This field has recently received much attention, especially in the context of very sparse or goalonly rewards Reinke et al. (2017); Morere and Ramos (2018) where traditional reward functions give too little guidance to RL algorithms.
Extensive intrinsic motivation RL work has focused on domains with simple or discrete spaces, proposing various definitions for exploration bonuses. Starting from reviewing intrinsic motivation in psychology, the work of Oudeyer and Kaplan (2008) presents a definition based on information theory. Maximizing predicted information gain from taking specific actions is the focus of Little and Sommer (2013)
, applied to learning in the absence of external reward feedback. Using approximate value function variance as an exploration bonus was proposed in
Osband et al. (2016). In the context of modelbased RL, exploration based on model learning progress Lopes et al. (2012), and model prediction error Stadie et al. (2015); Pathak et al. (2017) were proposed. State visitation counts have been widely investigated, in which an additional model counting previous stateaction pair occurrences guides agents towards less visited regions. Recent successes include Bellemare et al. (2016); Fox et al. (2018). An attempt to generalizing counterbased exploration to continuous state spaces was made in Nouri and Littman (2009), by using regression trees to achieve multiresolution coverage of the state space. Another pursuit for scaling visitation counters to large and continuous state spaces was made in Bellemare et al. (2016) by using density models.Little work attempted to extend intrinsic exploration to continuous action spaces. A policy gradient RL method was presented in Houthooft et al. (2016). Generalization of visitation counters is proposed in Fox et al. (2018), and interpreted as exploration values. Exploration values are also presented as an alternative to additive rewards in Szita and Lőrincz (2008), where exploration balance at a policy level is mentioned.
Most of these methods typically suffer from high data requirements. One of the reasons for such requirements is that exploration is treated as an adhoc problem instead of being the focus of the optimization method. More principled ways to deal with exploration can be found in other related fields. In bandits, the balance between exploration and exploitation is central to the formulation Kuleshov and Precup (2014). For example with upper confidence bound Auer et al. (2002), actions are selected based on the balance between action values and a visitation term measuring the variance in the estimate of the action value. In the bandits setting, the balance is defined at a policy level, and the exploration term is not incorporated into action values like in intrinsic RL.
Similarly to bandits, Bayesian Optimization Jones et al. (1998) (BO) brings exploration at the core of its framework, extending the problem to continuous action spaces. BO provides a dataefficient approach for finding the optimum of an unknown objective. Exploration is achieved by building a probabilistic model of the objective from samples, and exploiting its posterior variance information. An acquisition function such as UCB Cox and John (1992) balances exploration and exploitation, and is at the core of the optimization problem. BO was successfully applied to direct policy search Brochu et al. (2010); Wilson et al. (2014)
by searching over the space of policy parameters, casting RL into a supervised learning problem. Searching the space of policy parameters is however not dataefficient as recently acquired step information is not used to improve exploration. Furthermore, using BO as global search over policy parameters greatly restricts parameter dimensionality, hence typically imposes using few expressive and handcrafted features.
In both bandits and BO formulations, exploration is brought to a policy level where it is a central goal of the optimization process. In this work, we treat exploration and exploitation as two distinct objectives to be optimized. Multiobjective RL Roijers et al. (2013) provides tools which we utilize for defining these two distinct objectives, and balancing them at a policy level. Multiobjective RL allows for making exploration central to the optimization process. While there exist Multiobjective RL methods to find several viable objective weightings such as finding Pareto fronts Perny and Weng (2010), our work focuses on two well defined objectives whose weighting changes during learning. As such, we are mostly interested in the ability to change the relative importance of objectives without requiring training.
Modelling stateaction values using a probabilistic model enables reasoning about the whole distribution instead of just its expectation, giving opportunities for better exploration strategies. Bayesian Qlearning Dearden et al. (1998) was first proposed to provide value function posterior information in the tabular case, then extended to more complicated domains by using Gaussian processes to model the stateaction function Engel et al. (2005). In this work, authors also discuss decomposition of returns into several terms separating intrinsic and extrinsic uncertainty, which could later be used for exploration. Distribution over returns were proposed to design risksensitive algorithms Morimura et al. (2010), and approximated to enhance RL stability in Bellemare et al. (2017)
. In recent work, Bayesian linear regression is combined to a deep network to provide a posterior on Qvalues
Azizzadenesheli et al. (2018). Thomson sampling is then used for action selection, but can only guarantee local exploration. Indeed, if all action were experienced in a given state, the uncertainty of Q in this state is not sufficient to drive the agent towards unexplored regions.To the best of our knowledge, there exists no modelfree RL framework treating exploration as a core objective. We present such framework, building on theory from multiobjective RL, bandits and BO. We also present EMUQ, a solution to exploration based on the proposed framework in fully continuous goalonly domains, relying on reducing the posterior variance of value functions.
This paper extends our earlier work Morere and Ramos (2018). It formalizes a new framework for treating exploration and exploitation as two objectives, provides strategies for online exploration control and new experimental results.
3 Explicit Balance for Exploration and Exploitation
Traditional RL aims at finding a policy maximizing the expected sum of future discounted rewards, as formulated in Equation 1. Exploration is then typically achieved by adding a perturbation to rewards or behaviour policies in an adhoc way. We propose making the tradeoff between exploration and exploitation explicit and at a policy level, by formulating exploration as a multiobjective RL problem.
3.1 Framework Overview
Multiobjective RL extends the classic RL framework by allowing value functions or policies to be learned for individual objectives. Exploitation and exploration are two distinct objectives for RL agents, for which separate value functions and (respectively) can be learned. Policies then need to make use of information from two separate models for and . While exploitation value function is learned from external rewards, exploration value function is modelled using exploration rewards.
Aiming to define policies which combine exploration and exploitation, we draw inspiration from Bayesian Optimization Brochu et al. (2010), which seeks to find the maximum of an expensive function using very few samples. It relies on an acquisition function to determine the most promising locations to sample next, based on model posterior mean and variance. The UpperConfidence Bounds (UCB) acquisition function Cox and John (1992) is popular for its explicit balance between exploitation and exploration controlled by parameter . Adapting UCB to our framework leads to policies balancing and . Contrary to most intrinsic RL approaches, our formulation keeps the explorationexploitation tradeoff at a policy level, as in traditional RL. This allows for adapting the explorationexploitation balance during the learning process without sacrificing dataefficiency, as would be the case with a balance at a reward level. Furthermore, policy level balance can be used to design methods to control the agent’s learning process, e.g. stop exploration after a budget is reached, or encourage more exploration if the agent converged to a suboptimal solution; see Section 4. Lastly, generating trajectories resulting only from exploration or exploration grants experimenters insight over learning status.
3.2 Exploration Values
We propose to redefine the objective optimized by RL methods to incorporate both exploration and exploitation at its core. To do so, we consider the following expected balanced return for policy :
(6) 
where we introduced exploration rewards and parameter governing explorationexploitation balance. Note that we recover Equation 1 by setting to , hence disabling exploration.
Equation 6 can be further decomposed into
(7)  
(8) 
where we have defined the exploration stateaction value function , akin to . Exploration behaviour is achieved by maximizing the expected discounted exploration return . Note that, if depends on , then is a function of . For clarity, we omit this potential dependency in notations.
Bellmantype updates for both and can be derived by unrolling the first term in both sums:
(9)  
(10) 
By identification we recover the update for given by Equation 3 and the following update for :
(11) 
which is similar to that of . Learning both and can be seen as combining two agents to solve separate MPDs for goal reaching and exploration, as shown in Figure 1 (right). This formulation is general in that any reinforcement learning algorithm can be used to learn and , combined with any exploration bonus. Both stateaction value functions can be learned from transition data using existing RL algorithms.
3.3 Exploration Rewards
The presented formulation is independent from the choice of exploration rewards, hence many reward definitions from the intrinsic RL literature can directly be applied here.
Note that in the special case for all states and actions, we recover exploration values from DORA Fox et al. (2018), and if state and action spaces are discrete, we recover visitation counters Bellemare et al. (2016).
Another approach to define consists in considering the amount of exploration left at a given state. The exploration reward for a transition is then defined as the amount of exploration in the resulting state of the transition, to favour transitions that result in discovery. It can be computed by taking an expectation over all actions:
(12) 
We defined a function accounting for the uncertainty associated with a stateaction pair. This formulation favours transitions that arrive at states of higher uncertainty. An obvious choice for is the variance of values, to guide exploration towards parts of the stateaction space where values are uncertain. This formulation is discussed in Section 5. Another choice for is to use visitation count or its continuous equivalent. Compared to classic visitation counts, this formulation focuses on visitations of the resulting transition state instead of on the stateaction pair of origin .
Exploration rewards are often constrained to negative values so that by combining an optimistic model for to negative rewards, optimism in the face of uncertainty guarantees efficient exploration Kearns and Singh (2002). The resulting model creates a gradient of values; trajectories generated by following this gradient reach unexplored areas of the stateaction space. With continuous actions, Equation 12 might not have closed form solution and the expectation can be estimated with approximate integration or sampling techniques. In domains with discrete actions however, the expectation is replaced by a sum over all possible actions.
3.4 Action Selection
Goalonly rewards are often defined as deterministic, as they simply reflect goal and penalty states. Because our framework handles exploration in a deterministic way, we simply focus on deterministic policies. Although stateaction values are still nonstationary (because is), they are learned from a stationary objective . This makes learning policies for exploitation easier.
Following the definition in Equation 6, actions are selected to maximize the expected balanced return at a given state :
(13) 
Notice the similarity between the policy given in Equation 13 and UCB acquisition functions from the Bayesian optimization and bandits literature. No additional exploration term is needed, as this policy explicitly balances exploration and exploitation with parameter . This parameter can be tuned at any time to generate trajectories for pure exploration or exploitation, which can be useful to assess agent learning status. Furthermore, strategies can be devised to control manually or automatically during the learning process. We propose a few strategies in Section 4.
The policy from Equation 13 can further be decomposed into present and future terms:
(14) 
where the term denoted future is effectively
. This decomposition highlights the link between this framework and other active learning methods; by setting
to , only the myopic term remains, and we recover the traditional UCB acquisition function from bandits or Bayesian optimization. This decomposition can be seen as an extension of these techniques to a nonmyopic setting. Indeed, future discounted exploration and exploitation are also considered within the action selection process. Drawing this connection opens up new avenues for leveraging exploration techniques from the bandits literature.The method presented in this section for explicitly balancing exploration and exploitation at a policy level is concisely summed up in Algorithm 1. The method is general enough so that it allows learning both and with any RL algorithm, and does not make assumptions on the choice of exploration reward used. Section 5 presents a practical method implementing this framework, while the next section presents advantages and strategies for controlling exploration balance during the agent learning process.
4 Preliminary Experiments on Classic RL Problems
In this section, a series of preliminary experiments on goalonly classic RL domains is presented to highlight the advantages of exploration values over additive rewards. Strategies for taking advantage of variable exploration rates are then provided.
The comparisons make use of goalonly version of simple and fully discrete domains. We compare all methods using strictly the same learning algorithm and reward bonuses. Learning algorithms are tabular implementations of QLearning with learning rate fixed to . Reward bonuses are computed from a table of stateaction visitation counts, where experiencing a stateaction pair for the first time grants reward and revisiting yields reward. We denote by additive reward a learning algorithm where reward bonuses are used as in classic intrinsic RL (Equation 4), and by exploration values reward bonuses used as in the proposed framework (Equation 11 and action selection defined by Equation 13). A Qlearning agent with no reward bonuses and greedy exploration is displayed as a baseline.
Problem 1: The Cliff Walking domain Sutton et al. (1998) is adapted to the goalonly setting: negative unit rewards are given for falling off the cliff (triggering agent teleportation to starting state), and positive unit rewards for reaching the terminal goal state. Transitions allow the agent to move in four cardinal directions, where a random direction is chosen with low probability .
Problem 2: The traditional Taxi domain Dietterich (2000) is also adapted to the goalonly setting. This domain features a gridword with walls and four special locations. In each episode, the agent starts randomly and two of the special locations are denoted as passenger and destination. The goal is for the agent to move to the passenger’s location, pickup the passenger, drive it to the destination, and drop it off. A unit reward is given for droppingoff the passenger to the destination (ending the episode), and rewards are given for actions pickup and dropoff in wrong locations.
4.1 Analyzing the Advantages of an Explicit ExplorationExploitation Balance
We first present simple pathological cases in which using exploration values provides advantages over additive rewards for exploration, on the Cliff Walking domain.
The first two experiments show that exploration values allow for direct control over exploration such stopping and continuing exploration. Stopping exploration after a budget is reached is simulated by setting exploration parameters (eg. ) to after episodes and stopping model learning. While exploration values maintain high performance after exploration stops, returns achieved with additive rewards dramatically drop and yield a degenerative policy. When exploration is enabled once again, the two methods continue improving at a similar rate; see Figures (a)a and (b)b. Note that when exploration is disabled, there is a jump in returns with exploration values, as performance for pure exploitation is evaluated. However, it is never possible to be sure a policy is generated from pure exploitation when using additive rewards, as parts of bonus exploration rewards are encoded within learned Qvalues.
The third experiment demonstrates that stochastic transitions with higher probability of random action () lead to increased return variance and poor performance with additive rewards, while exploration values only seem mildly affected. As shown in Figure (c)c, even greedy appears to solve the task, suggesting stochastic transitions provide additional random exploration. It is unclear why the additive rewards method is affected negatively.
Lastly, the fourth experiments shows environment reward magnitude is paramount to achieving good performance with additive rewards; see Figure (d)d. Even though exploration parameters balancing environment and exploration bonus rewards are scaled to maintain equal amplitude between the two terms, additive rewards suffer from degraded performance. This is due to two reward quantities being incorporated into a single model for Q, which also needs to be initialized optimistically with respect to both quantities. When the two types of rewards have different amplitude, this causes a problem. Exploration values do not suffer from this drawback as separate models are learned based on these two quantities, hence resulting in unchanged performance.
4.2 Automatic Control of ExplorationExploitation Balance
We now present strategies for automatically controlling the explorationexploitation balance during the learning process. The following experiments also make use of the Taxi domain.
Exploration parameter is decreased over time according to the following schedule , where governs decay rate. Higher values of result in reduced exploration after only a few episodes, whereas lower values translate to almost constant exploration. Results displayed in Figure 3 indicate that decreasing exploration leads to fast convergence to returns relatively close to maximum return, as shown when setting . However choosing a reasonable value first results in lower performance, but enables finding a policy with higher returns later. Such behaviour is more visible with very small values such as which corresponds to almost constant .
Method  Times target reached  Episodes to target  Performance after target 

greedy  0/10  –  – 
Exploration values  9/10  111.33  0.08(4.50) 
Additive rewards  9/10  242.11  33.26(67.41) 
We now show how direct control over exploration parameter can be taken advantage of to stop learning automatically once a predefined target is met. On the taxi domain, exploration is first stopped after an exploration budget is exhausted. Results comparing additive rewards to exploration values for different budgets of , and episodes are given in Figure (a)a. These clearly show that when stopping exploration after the budget is reached, exploration value agents can generate purely exploiting trajectories achieving near optimal return whereas additive reward agents fail to converge on an acceptable policy.
Lastly, we investigate stopping exploration automatically once a target return is reached. After each learning episode, test episodes with pure exploitation are run to score the current policy. If all test episodes score returns above , the target return is reached and exploration stops. Results for this experiment are shown in Figure (b)b and Table 1. Compared to additive rewards, exploration values display better performance after target is reach as well as faster target reaching.
Exploration values were experimentally shown exploration advantages over additive reward on simple RL domains. The next section presents an algorithm built on the proposed framework which extends to fully continuous state and action spaces and is applicable to more advanced problems.
5 EMUQ: Exploration by Minimizing Uncertainty of Values
Following the framework defined in Section 3, we propose learning exploration values with a specific reward driving trajectories towards areas of the stateaction space where the agent’s uncertainty of values is high.
5.1 Reward Definition
Modelling values with a probabilistic model gives access to variance information representing model uncertainty in expected discounted returns. Because the probabilistic model is learned from expected discounted returns, discounted return variance is not considered. Hence the model variance only reflects epistemic uncertainty, which can be used to drive exploration. This formulation was explored in EMUQ Morere and Ramos (2018), extending Equation 12 as follows:
(15) 
where is the maximum possible variance of , guaranteeing always negative rewards. In practice, depends on the model used to learn and its hyperparameters, and can often be computed analytically. In Equation 15, the variance operator computes the epistemic uncertainty of values, that is it assesses how confident the model is that it can predict values correctly. Note that the MDP stochasticity emerging from transitions, rewards and policy is absorbed by the expectation operator in Equation 2, and so no assumptions are required on the MDP components in this reward definition.
5.2 Bayesian Linear Regression for QLearning
We now seek to obtain a modelfree RL algorithm able to explore with few environment interactions, and providing a full predictive distribution on stateaction values to fit the exploration reward definition given by Equation 15. Kernel methods such as Gaussian Process TD (GPTD) Engel et al. (2005) and LeastSquares TD (LSTD) Lagoudakis and Parr (2003) are among the most dataefficient modelfree techniques. While the former suffers from prohibitive computation requirements, the latter offers an appealing tradeoff between dataefficiency and complexity. We now derive a Bayesian RL algorithm that combines the strengths of both kernel methods and LSTD.
The distribution of longterm discounted exploitation returns can be defined recursively as:
(16) 
which is an equality in the distributions of the two sides of the equation. Note that so far, no assumption are made on the nature of the distribution of returns. Let us decompose the discounted return into its mean and a random zeromean residual so that . Substituting and rearranging Equation 16 yields
(17) 
The only extrinsic uncertainty left in this equation are the reward distribution and residuals . Assuming rewards are disturbed by zeromean Gaussian noise implies the difference of residuals is Gaussian with zeromean and precision . By modelling as a linear function of a feature map so that , estimation of stateaction values becomes a linear regression problem of target and weights . The likelihood function takes the form
(18) 
where independent transitions are denoted as
. We now treat the weights as random variables with zeromean Gaussian prior
. The weight posterior distribution is(19)  
(20)  
(21) 
where , , and . The predictive distribution is also Gaussian, yielding
(22)  
(23) 
The predictive variance encodes the intrinsic uncertainty in , due to the subjective understanding of the MDP’s model; it is used to compute in Equation 15.
The derivation for is similar, replacing with and with . Note that because does not depend on rewards, it can be shared by both models. Hence, with ,
(24) 
This model gracefully adapts to iterative updates at each step, by substituting the current prior with the previous posterior. Furthermore, the ShermanMorrison equality is used to compute rank1 updates of matrix with each new data point :
(25) 
This update only requires a matrixtovector multiplication and saves the cost of inverting a matrix at every step. Hence the complexity cost is reduced from to in the number of features . An optimized implementation of EMUQ is given in algorithm 2.
End of episode updates for and (line 15 onward) are analogous to policy iteration, and although not mandatory, greatly improve convergence speed. Note that because is a nonstationary target, recomputing it after each episode with the updated posterior on provides the model on with more accurate targets, thereby improving learning speed.
5.3 Kernel Approximation Features for RL
We presented a simple method to learn and as linear functions of statesactions features. While powerful when using a good feature map, linear models typically require experimenters to define meaningful features on a problem specific basis. In this section, we introduce random Fourier features (RFF) Rahimi and Recht (2008), a kernel approximation technique which allows linear models to enjoy the expressivity of kernel methods. It should be noted that these features are different from Fourier basis Konidaris et al. (2011) (detailed in supplementary material), which do not approximate kernel functions. Although RFF were recently used to learn policy parametrizations Rajeswaran et al. (2017), to the best of our knowledge, this is the first time RFF are applied to the value function approximation problem in RL.
For any shift invariant kernel, which can be written as with
, a representation based on the Fourier transform can be computed with Bochner’s theorem
Gihman and Skorohod (1974).Theorem 1 (Bochner’s Theorem) Any shift invariant kernel , , with a positive finite measure can be represented in terms of its Fourier transform as
(26) 
Assuming measure has a density , is the spectral density of and we have
(27) 
where is the spectral density of , is an approximate feature map, and the number of spectral samples from . In practice, the feature map approximating is
(28) 
where the imaginary part was set to zero, as required for real kernels. In the case of the RBF kernel defined as , the kernel spectral density is Gaussian . Feature maps can be computed by drawing samples from one time only, and computing Equation 28 on new inputs using these samples. Resulting features are not domain specific and require no feature engineering. Users only need to choose a kernel that represents adequate distance measures in the stateaction space, and can benefit from numerous kernels already provided by the literature. Using these features in conjunction with Bayesian linear regression provides an efficient method to approximate a Gaussian process.
As the number of features increases, kernel approximation error decreases Sutherland and Schneider (2015); approximating popular shiftinvariant kernels to within can be achieved with only features. Additionally, sampling frequencies according to a quasirandom sampling scheme (used in our experiments) reduces kernel approximation error compared to classic MonteCarlo sampling with the same number of features Yang et al. (2014).
EMUQ with RFF combines the easeofuse and expressivity of kernel methods brought by RFF with the convergence properties and speed of linear models.
5.3.1 Comparison of Random Fourier Features and Fourier Basis Features
Return mean and standard deviation for Qlearning with random Fourier features (RFF) or Fourier basis features on SinglePendulum (a), MountainCar (b), and DoublePendulum (c) domain with classic rewards. Results are computed using classic Qlearning with
greedy policy, and averaged over 20 runs for each method.For completeness, a comparison between RFF and the better known Fourier Basis Features Konidaris et al. (2011) is provided on classic RL domains using Qlearning. A short overview on Fourier Basis Features is given in Appendix A.
Three relatively simple environments were considered: SinglePendulum, MountainCar and DoublePendulum (details on these environments are given in Section 6). The same Qlearning algorithm was used for both methods, with equal parameters. As little as random Fourier features are sufficient in these domains, while the order of Fourier basis was set to for SinglePendulum and MountainCar and to for DoublePendulum. The higher state and action space dimensions of DoublePendulum make using Fourier basis features prohibitively expensive, as the number of generated features increases exponentially with space dimensions. For example, in DoublePendulum, Fourier basis features of order leads to more than features.
Results displayed in Figure 5 show RFF outperforms Fourier basis both in terms of learning speed and asymptotic performance, while using a lower number of features. In DoublePendulum, the number of Fourier basis features seems insufficient to solve the problem, even though it is an order of magnitude higher than that of RFF.
6 Experiments
EMUQ’s exploration performance is qualitatively and quantitatively evaluated on a toy chain MDP example, widelyused continuous control domains and a robotic manipulator problem. Experiments aim at measuring exploration capabilities in domains with goalonly rewards. Unless specified otherwise, domains feature one absorbing goal state with positive unit reward, and potential penalizing absorbing states of reward of . All other rewards are zero, resulting in very sparse reward functions, and rendering guidance from reward gradient information inapplicable.
6.1 Synthetic Chain Domain
We investigate EMUQ’s exploration capabilities on a classic domain known to be hard to explore. It is composed of a chain of states and two actions, displayed in Figure (a)a. Action right (dashed) has probability to move right and probability to move left. Action left (solid) is deterministic.
6.1.1 Goalonly Rewards
We first consider the case of goalonly rewards, where goal state yields unit reward and all other transitions result in nil reward. Classic exploration such as greedy was shown to have exponential regret with the number of states in this domain Osband et al. (2014). Achieving better performance on this domain is therefore essential to any advanced exploration technique. We compare EMUQ to greedy exploration for increasing chain lengths, in terms of number of steps before goalstate is found. Results in Figure (b)b illustrate the exponential regret of greedy while EMUQ achieves much lower exploration time, scaling linearly with chain length.
6.1.2 SemiSparse Rewards
We now investigate the impact of reward structure by decreasing the chain domain’s reward sparsity. In this experiment only, agents are given additional rewards with probability for every nongoal state, effectively guiding them towards the goal state (goalonly rewards are recovered for ). The average number of steps before the goal is reached as a function of is compared for greedy and EMUQ in Figure (c)c. Results show that greedy performs very poorly for high , but improves as guiding reward density increases. Conversely, EMUQ seems unaffected by reward density and performs equally well for all values of . When , agents receive reward in every nongoal state, and greedy performs similarly to EMUQ.
6.2 Classic Control
EMUQ is further evaluated on more challenging RL domains. These feature fully continuous state and action spaces, and are adapted to the goalonly reward setting. In these standard control problems Brockman et al. (2016), classic exploration methods are unable to reach goal states.




6.2.1 Exploration Behaviour on goalonly MountainCar
We first provide intuition behind what EMUQ learns and illustrate its typical behaviour on a continuous goalonly version of MountainCar. In this domain, the agent needs to drive an underactuated car up a hill by building momentum. The state space consists of car position and velocity, and actions ranging from to describing car wheel torque (absolute value) and direction (sign). The agent is granted a unit reward for reaching the top of the right hill, and zero elsewhere.
Figure 7 displays the stateaction exploration value function at different stages of learning, overlaid by the statespace trajectories followed during learning. The first episode (yellow line) exemplifies action babbling, and the car does not exit the valley (around ). On the next episode (black line), the agent finds sequences of actions that allow exiting the valley and exploring further areas of the stateaction space. Lastly, in episode three (white line), the agent finds the goal (). This is done by adopting a strategy that quickly leads to unexplored areas, as shown by the increased gap between white lines. The exploration value function reflects high uncertainty about unexplored areas (yellow), which shrink as more data is gathered, and low and decreasing uncertainty for often visited areas such as starting states (purple). Function also features a gradient which can be followed from any state to find new areas of the stateaction space to explore. Figure (d)d shows EMUQ’s exploration capabilities enables to find the goal state within one or two episodes.
6.2.2 Continuous control benchmark
We now compare our algorithm on the complete benchmark of continuous control goalonly tasks. All domains make use of OpenAI Gym Brockman et al. (2016), and are modified to feature goal only rewards and continuous stateaction spaces with dimensions detailed in Table 2. More specifically, the domains considered are MountainCar and the following:

SinglePendulum: The agent needs to balance an underactuated pendulum upwards by controlling a motor’s torque a the base of the pendulum. A unit reward is granted when the pole (of angle with vertical ) is upwards: .

DoublePendulum: Similarly to SinglePendulum, the agent’s goal is to balance a double pendulum upwards. Only the base joint can be controlled while the joint between the two segments moves freely. The agent is given a unit reward when the tip of the pendulum is close to the tallest point it can reach: within a distance .

CartpoleSwingUp: This domain features a single pole mounted on a cart. The goal is to balance the pole upwards by controlling the torque of the underactuated cart’s wheels. Driving the cart too far off the centre () results in episode failure with reward , and managing to balance the pole (, with the pole angle with the vertical axis) yields unit reward. Note that contrary to classic Cartpole, this domain starts with the pole hanging down and episodes terminate when balance is achieved.

LunarLander
: The agent controls a landing pod by applying lateral and vertical thrust, which needs to be landed on a designated platform. A positive unit reward is given for reaching the landing pad within distance
of its center point, and a negative unit reward is given for crashing or exiting the flying area. 
Reacher: A robotic manipulator composed of two segments can be actuated at each of its two joints to reach a predefined position in a twodimensional space. Bringing the manipulator tip within a distance of a random target results in a unit reward.

Hopper: This domain features a single leg robot composed of three segments, which needs to propel itself to a predefined height. A unit reward is given for successfully jumping to height , and a negative unit reward when the leg falls past angle with the vertical axis.
Domain  

SinglePendulum  3  1  0.3  0.3  300  0.001  1.0 
MountainCar  2  1  0.3  10  300  0.1  1.0 
DoublePendulum  6  1  0.3  0.3  500  0.01  1.0 
CartpoleSwingUp  4  1  0.8  1.0  500  0.01  1.0 
LunarLander  8  2  0.5  0.3  500  0.01  1.0 
Reacher  11  2  0.3  0.3  500  0.001  1.0 
Hopper  11  3  0.3  0.3  500  0.01  1.0 
Most methods in the sparse rewards literature address domains with discrete states and/or action spaces, making it hard to find baselines to compare EMUQ to. Furthermore, classic exploration techniques such as greedy fail on these domains. We compare our algorithm to three baselines: VIME, DORA and RFFQ. VIME Houthooft et al. (2016) defines exploration as maximizing information gain about the agent’s belief of environment dynamics. DORA Fox et al. (2018), which we run on discretized action spaces, extends visitation counts to continuous state spaces. Both VIME and DORA use additive rewards, as opposed to EMUQ which uses exploration values. QLearning with greedy exploration and RFF is denoted RFFQ. Because it would fail in domains with goalonly rewards, it is run with classic rewards; see Brockman et al. (2016) for details on classic rewards.
We are interested in comparing exploration performance, favouring fast discovery of goal states. To reflect exploration performance, we measure the number of episodes required before the first positive goalreward is obtained. This metric reflects how long pure exploration is required for before goalreaching information can be taken advantage of to refine policies, and hence directly reflects exploration capabilities. Parameter is set to for all domains and episodes are capped at steps. State spaces are normalized, and Random Fourier Features approximating square exponential kernels are used for both state and action spaces with EMUQ and RFFQ. The state and action kernel lengthscales are denoted as and respectively. Exploration and exploitation tradeoff parameter is set to for all experiments. Other algorithm parameters were manually fixed to reasonable values given in Table 2.
Domain  EMUQ  VIME  DORA (discrete)  RFFQ (classic reward)  

Success  Episodes to goal  Success  Episodes to goal  Success  Episodes to goal  Success  Episodes to goal  
SinglePendulum  100%  1.80 (1.07)  95%  2.05 (2.04)  35%  3.00 (4.11)  100%  1.0 (0.00) 
MountainCar  100%  2.95 (0.38)  65%  5.08 (2.43)  0%  –  100%  8.6 (8.05) 
DoublePendulum  100%  1.10 (0.30)  90%  3.61 (2.75)  0%  –  100%  4.20 (2.25) 
CartpoleSwingup  90%  12.40 (16.79)  65%  3.23 (2.66)  35%  48.71 (30.44)  100%  9.70 (12.39) 
LunarLander  100%  28.75 (29.57)  75%  4.47 (2.47)  30%  35.17 (31.38)  95%  19.15 (24.06) 
Reacher  100%  19.70 (20.69)  95%  3.68 (2.03)  35%  1.00 (0.00)  95%  26.55 (25.58) 
Hopper  60%  52.85 (39.32)  40%  5.62 (3.35)  20%  30.50 (11.80)  80%  41.15 (35.72) 
Results displayed in Table 3 indicate that EMUQ is more consistent than VIME or DORA in finding goal states on all domains, illustrating better exploration capabilities. The average number of episodes to reach goal states is computed only on successful runs. EMUQ displays better goal finding on lower dimension domains, while VIME tends to find goals faster on domains with higher dimensions but fails in more occasions. Observing similar results between EMUQ and RFFQ confirms that EMUQ can deal with goalonly rewards without sacrificing performance.
6.3 Jaco Manipulator


In this final experiment, we show the applicability of EMUQ to realistic problems by demonstrating its efficacy on an advanced robotics simulator. In this robotics problem, the agent needs to learn to control a Jaco manipulator solely from observing joint configuration. Given a position in the 3D space, the agent’s goal is to bring the manipulator finger tips to this goal location by sending torque commands to each of the manipulator joints; see Figure (a)a. Designing such targetreaching policies is also known as inverse kinematics for robotic arms, and has been studied extensively. Instead, we focus here on learning a mapping from joint configuration to joint torques on a damaged manipulator. When a manipulator is damaged, previously computed inverse kinematics are not valid anymore, thus being able to learn a new targetreaching policy is important.
We model damage by immobilizing four of the arm joints, making previous inverse kinematics invalid. The target position is chosen randomly to form locations across the reachable space. Episodes terminate with unit reward when the target is reached within steps, zero rewards are given otherwise. We compare EMUQ and RFFQ on this domain, both using random Fourier features approximating an RBF kernel. Parameters and were manually selected to acceptable values of and respectively. Figure (b)b displays results averaged over runs. The difference in number of episodes solved shows EMUQ learns and manages to complete the task more consistently than RFFQ. This confirms that directed exploration is beneficial, even in more realistic robotics scenario.
7 Conclusion
We proposed a novel framework for exploration in RL domains with very sparse or goalonly rewards. The framework makes use of multiobjective RL to define exploration and exploitation as two key objectives, bringing the balance between the two at a policy level. This formulation has several advantages over traditional exploration methods. It allows direct and online control over exploration, without additional computation or training. Strategies for such control were shown to experimentally outperform classic intrinsic RL on several aspects. We demonstrated scalability to continuous stateaction spaces by presenting EMUQ, a method based on our framework, guiding exploration towards regions of higher valuefunction uncertainty. EMUQ was experimentally shown to outperform classic exploration techniques and other intrinsic RL methods on a continuous control benchmark and on a robotic manipulator.
As future work, we would like to investigate how exploration as multiobjective RL can be brought to other types of RL methods such as policy gradient. This extension would enable control over exploration in domains with larger stateaction spaces, an potentially numerous reallife application. Other interesting extensions include bringing the online control over exploration achieved by this work to lifelong RL, where it would be beneficial. Indeed, exploration can be tuned down in critical situations where high performance is necessary, or increased when learning new behaviours is required.
Appendix A Fourier Basis Features
Fourier basis features are described in Konidaris et al. (2011) as a linear function approximation based on Fourier series decomposition. Formally, the order feature map for state is defined as follows:
(29) 
where is the Cartesian product of all for . Note that Fourier basis features do not scale well. Indeed, the number of features generated is exponential with state space dimension.
While Fourier basis features approximate value functions with periodic basis functions, random Fourier features are designed to approximate a kernel function with similar basis functions. As such, they allow recovering properties of kernel methods in the limit of the number of features. Additionally, random Fourier features scale better with higher dimensions.
Appendix B Derivation of Bayesian linear regression for Qlearning
The likelihood function is defined as follows
(30) 
where independent transitions are denoted . we treat the linear regression weights as random variables and introduce a Gaussian prior
(31) 
The weight posterior can be computed analytically with Bayes rule, resulting in a normal distribution
(32) 
Expressions for the mean and variance follow from general results of products of normal distributions Bishop (2006):
(33)  
(34) 
where , , and . The predictive distribution can be obtain by weight marginalization and is also normal
(35) 
Expressions for Q and follow from general results Bishop (2006), yielding
(36)  
(37) 
References
 Finitetime analysis of the multiarmed bandit problem. Machine learning. Cited by: §2.4.
 Efficient exploration through Bayesian deep Qnetworks. arXiv preprint arXiv:1802.04412. Cited by: §2.4.
 A distributional perspective on reinforcement learning. In International Conference on Machine Learning, Cited by: §2.4.
 Unifying countbased exploration and intrinsic motivation. In Neural Information Processing Systems, Cited by: §1, §2.4, §3.3.
 Pattern recognition and machine learning. Technical report Cited by: Appendix B.
 Rmaxa general polynomial time algorithm for nearoptimal reinforcement learning. Journal of Machine Learning Research. Cited by: §1.
 A tutorial on Bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning. arXiv preprint arXiv:1012.2599. Cited by: §2.4, §3.1.
 OpenAI Gym. External Links: arXiv:1606.01540 Cited by: §6.2.2, §6.2.2, §6.2.
 Intrinsically motivated reinforcement learning. In Advances in neural information processing systems, Cited by: Figure 1, §2.2, §2.4.
 A statistical method for global optimization. In International Conference onSystems, Man and Cybernetics, Cited by: §2.4, §3.1.

Bayesian Qlearning.
In
Association for the Advancement of Artificial Intelligence
, Cited by: §2.4.  Hierarchical reinforcement learning with the MAXQ value function decomposition. Journal of Artificial Intelligence Research. Cited by: §4.
 Reinforcement learning with Gaussian processes. In International Conference on Machine Learning, Cited by: §2.4, §5.2.
 DORA the explorer: Directed outreaching reinforcement actionselection. In International Conference on Learning Representations, Cited by: §1, §2.4, §2.4, §3.3, §6.2.2.
 The theory of stochastic processes, vol. i. Cited by: §5.3.
 Vime: Variational information maximizing exploration. In Neural Information Processing Systems, Cited by: §2.4, §6.2.2.
 Nearoptimal regret bounds for reinforcement learning. Journal of Machine Learning Research. Cited by: §1.
 Efficient global optimization of expensive blackbox functions. Journal of Global optimization. Cited by: §2.4.
 Reinforcement learning: A survey. Journal of artificial intelligence research. Cited by: §1.
 Nearoptimal reinforcement learning in polynomial time. Machine Learning. Cited by: §1, §3.3.
 Value function approximation in reinforcement learning using the Fourier basis.. In Association for the Advancement of Artificial Intelligence, Cited by: Appendix A, §5.3.1, §5.3.
 Algorithms for multiarmed bandit problems. arXiv preprint arXiv:1402.6028. Cited by: §2.4.
 Leastsquares policy iteration. Journal of Machine Learning Research. Cited by: §5.2.
 Learning and exploration in actionperception loops. Frontiers in neural circuits. Cited by: §2.2, §2.4.
 Exploration in modelbased reinforcement learning by empirically estimating learning progress. In Neural Information Processing Systems, Cited by: §2.4.
 Exploration of multistate environments: Local measures and backpropagation of uncertainty. Machine Learning. Cited by: §2.4.
 Bayesian RL for goalonly rewards. In Conference on Robot Learning, Cited by: §2.4, §2.4, §5.1, footnote.
 Nonparametric return distribution approximation for reinforcement learning. In International Conference on Machine Learning, Cited by: §2.4.
 Policy invariance under reward transformations: Theory and application to reward shaping. In International Conference on Machine Learning, Cited by: §1.
 Multiresolution exploration in continuous spaces. In Neural Information Processing Systems, Cited by: §2.4.
 Deep exploration via bootstrapped DQN. In Advances in neural information processing systems, Cited by: §2.4.
 Generalization and exploration via randomized value functions. arXiv preprint arXiv:1402.0635. Cited by: §1, Figure 6, §6.1.1.
 How can we define intrinsic motivation?. In International Conference on Epigenetic Robotics: Modeling Cognitive Development in Robotic Systems, Cited by: §2.4.
 Curiositydriven exploration by selfsupervised prediction. In International Conference on Machine Learning, Cited by: §2.4.
 On finding compromise solutions in multiobjective Markov decision processes.. In European Conference on Artificial Intelligence, Cited by: §2.4.
 Random features for largescale kernel machines. In Neural Information Processing Systems, Cited by: §5.3.
 Towards generalization and simplicity in continuous control. In Neural Information Processing Systems, Cited by: §5.3.
 Average reward optimization with multiple discounting reinforcement learners. In International Conference on Neural Information Processing, Cited by: §1, §2.4.
 A survey of multiobjective sequential decisionmaking. Journal of Artificial Intelligence Research. Cited by: §1, §2.3, §2.4.
 Trust region policy optimization. In International Conference on Machine Learning, Cited by: §1.
 Incentivizing exploration in reinforcement learning with deep predictive models. arXiv preprint arXiv:1507.00814. Cited by: §2.2, §2.4.

Reinforcement driven information acquisition in nondeterministic environments.
In
International Conference on Artificial Neural Networks
, Cited by: §2.4.  On the error of random Fourier features. In Conference on Uncertainty in Artificial Intelligence, Cited by: §5.3.
 Reinforcement learning: An introduction. MIT press. Cited by: §4.
 The many faces of optimism: A unifying approach. In International Conference on Machine Learning, Cited by: §2.4.
 Using trajectory data to improve Bayesian optimization for reinforcement learning. Journal of Machine Learning Research. Cited by: §2.4.
 QuasiMonte Carlo feature maps for shiftinvariant kernels. In International Conference on Machine Learning, Cited by: §5.3.