Exploration in Feature Space for Reinforcement Learning

10/05/2017 ∙ by Suraj Narayanan Sasikumar, et al. ∙ 0

The infamous exploration-exploitation dilemma is one of the oldest and most important problems in reinforcement learning (RL). Deliberate and effective exploration is necessary for RL agents to succeed in most environments. However, until very recently even very sophisticated RL algorithms employed simple, undirected exploration strategies in large-scale RL tasks. We introduce a new optimistic count-based exploration algorithm for RL that is feasible in high-dimensional MDPs. The success of RL algorithms in these domains depends crucially on generalization from limited training experience. Function approximation techniques enable RL agents to generalize in order to estimate the value of unvisited states, but at present few methods have achieved generalization about the agent's uncertainty regarding unvisited states. We present a new method for computing a generalized state visit-count, which allows the agent to estimate the uncertainty associated with any state. In contrast to existing exploration techniques, our ϕ-pseudocount achieves generalization by exploiting the feature representation of the state space that is used for value function approximation. States that have less frequently observed features are deemed more uncertain. The resulting ϕ-Exploration-Bonus algorithm rewards the agent for exploring in feature space rather than in the original state space. This method is simpler and less computationally expensive than some previous proposals, and achieves near state-of-the-art results on high-dimensional RL benchmarks. In particular, we report world-class results on several notoriously difficult Atari 2600 video games, including Montezuma's Revenge.



There are no comments yet.


page 38

page 50

page 64

page 66

page 69

page 70

page 71

page 72

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

List of Algorithms


1.1 Reinforcement Learning

Machine Learning is a field in computer science that allows computers to dynamically generate novel algorithms that otherwise cannot be explicitly programmed. These algorithms, called hypotheses, generalize patterns and regularities from observed real-world data using statistical techniques (Bishop, 2007). Reinforcement Learning

(RL) is a field of machine learning that deals with optimal sequential decision making in an unknown environment with no explicitly labelled training data. The RL framework is one of the fundamental models that best describes how intelligent beings interact with their world to achieve a

goal. An RL algorithm is given agency to interact with its surroundings, and is aptly called an agent. The world with which the agent interacts is called its environment. The unsupervised nature of RL algorithms means that the agent has to develop a policy for acting in an unknown environment by trial-and-error (Sutton and Barto, 1998). In every such interaction the agent performs an action on the environment and receives a percept. The percept consists of the current configuration of the environment, called state, and a scalar feedback signal, called reward. The reward signal indicates how good the sequence of actions of the agent was. The goal of an RL agent is based on the concept of the reward hypothesis:

Definition 1 (Reward Hypothesis).

(Sutton, 1999) Any notion of a goal or purpose of an intelligent agent can be described as the maximization of expected cumulative reward.

The existence of an extrinsic feedback signal makes RL algorithms also somewhat supervised in nature - thus RL algorithms are in some sense both supervised and unsupervised (Barto and Dietterich, 2004).

As an example, consider an agent playing a car racing game in which the goal is to reach the finish line as soon as possible. To model the goal as a cumulative reward maximization problem, we give the agent a negative reward every time step, thereby incentivizing the agent to reach the finish line as quickly as possible. This example illustrates how an objective can be modelled as the maximization of expected cumulative reward, and the goal of an agent as a sequence of actions that achieves it. The interaction between agent and environment continues until the agent converges to an optimal sequence of actions for each state in the environment. This interaction is called the agent-environment interaction cycle, as illustrated in Figure 1.1. Each iteration of the interaction is called a time-step, often denoted by the subscript to distinguish states, actions, and percepts between time-steps.

Figure 1.1: The agent-environment interaction cycle (Sutton and Barto, 1998)

1.2 The Exploration/Exploitation Dilemma

In an online decision-making setting such as the reinforcement learning problem, an agent is faced with two choices - explore or exploit. The term exploration

in an active learning system is defined as the process of deliberately taking a non-greedy action with the sole aim of gathering more information about the environment. Exploration plays a fundamental role in reinforcement learning algorithms. It is born out of the notion that an optimal long-term policy might involve short-term sacrifices. Alternatively,

exploitation is the act of taking the best possible action given the current information about the environment. A central challenge in reinforcement learning is to find the sweet spot between exploration and exploitation, i.e., to figure out when to explore and when to exploit. This problem is known as the exploration-exploitation dilemma.

At present there are a number of provably efficient exploration methods that are effective in environments with low-dimensional state-action spaces. Most of the exploration algorithms which enjoy strong theoretical guarantees implement the so-called "Optimism in the Face of Uncertainty" (OFU) principle. This heuristic encourages the agent to be optimistic about the reward it might attain in less explored parts of the environment. The agent seeks out states with higher associated uncertainty, and in doing so reduces its uncertainty in a very efficient way. Many algorithms that implement this heuristic do so by adding an exploration bonus to the agent’s reward signal. This bonus is usually a function of a state visit-count; the agent receives higher exploration bonuses for exploring less frequently visited states (about which it is less certain).

Unfortunately, these algorithms do not scale well to high-dimensional environments. In these domains, the agent can only visit a small portion of the state space while it is training. The visit-count for most states is always zero, even after training is finished. Nearly all states will be assigned the same exploration bonus throughout training. This renders the bonus useless as a tool for efficient exploration. All unvisited states appear to the agent as equally uncertain. This problem arises because these count-based OFU algorithms fail to generalise the agent’s uncertainty from one context to another. Even if an unvisited state has very similar features to a frequently visited one, the agent will treat the former as a complete unknown. Consequently even the sophisticated algorithms that are suitable for the high-dimensional setting – e.g. those that use deep neural networks for policy evaluation – tend to use simple, inefficient exploration strategies.

Success in the high-dimensional setting demands that the agent represent the state space in a way that allows generalisation about uncertainty. This sort of generalisation would allow that the agent’s uncertainty be lower for states with familiar features, and higher for states with novel features, even if those exact states haven’t been visited. What we require, then, is an efficient method for computing a suitable similarity measure for states. That is the key challenge addressed in this thesis.

1.3 Summary of Contributions

This thesis presents a new count-based exploration algorithm that is feasible in environments with large state-action spaces. It can be combined with any value-based RL algorithm that uses linear function approximation (LFA). The principal contribution is a new method for computing generalised visit-counts. Following Bellemare et al. (2016)

, we construct a visit-density model in order to measure the similarity between states. Our approach departs from theirs in that we do not construct our density model over the raw state space. Instead, we exploit the feature map that is used for value function approximation, and construct a density model over the transformed feature space. This model assigns higher probability to state feature vectors that share features with visited states. Generalised visit-counts are then computed from these probabilities; states with frequently observed features are assigned higher counts. These counts serve as a measure of the uncertainty associated with a state. Exploration bonuses are then computed from these counts in order to encourage the agent to visit regions of the state-space with less familiar features.

Our density model can be trivially derived from any feature map used for LFA, regardless of the application domain, and requires little or no additional design. In contrast to existing algorithms, there is no need to perform a special dimensionality reduction of the state space in order to compute our generalised visit-counts. Our method uses the same lower-dimensional feature representation to estimate value and to estimate uncertainty. This makes it simpler to implement and less computationally expensive than some existing proposals. Our evaluation demonstrates that this simple approach achieves near state-of-the-art performance on high-dimensional RL benchmarks.

2.1 Classical Reinforcement Learning

In Classical RL (CRL), the environment is assumed to be fully observable, ergodic, and every state has the Markov property. The branch of reinforcement learning where these assumptions are lifted is called General Reinforcement Learning (GRL) (Hutter, 2005).

Definition 2 (Markov property).

Future states are only dependent on the current states and action, and are independent of the history of percepts. Formally,

for all , and histories

A Markov Decision Process (MDP) captures the above assumptions about the environment, and so in the CRL context the environment is modelled as an MDP (Puterman, 1994). Thus, the CRL problem now reduces to the problem of finding an optimal policy for an unknown MDP.

Definition 3 (Markov Decision Process).

A Markov Decision Process is a Tuple representative of a fully-observable environment in which all states are Markov.

  • is a finite set of states

  • is a finite set of actions

  • are the transition probabilities

  • is the expected value of the reward resulting from the transition

  • is the discount factor which weights the relative importance of immediate rewards to future rewards.

If the dynamics (transition and reward distributions) of the MDP are known, then we can use dynamic programming methods to directly plan on the MDP to find an optimal policy. In the RL context, in which the system dynamics are unknown, we have to use iterative RL algorithms such as TD-learning (Sutton, 1988) to find a good policy asymptotically.111Asymptotic analysis is one of the few theoretical tools we have to analyse RL algorithms in a domain-agnostic way.

Definition 4 (Policy).

A policy may be deterministic or stochastic. A deterministic policy is a mapping from the states to actions.

A stochastic policy is a probability distribution over the set of actions given a state.


The most common way to characterize the quality of a given policy is to define a function that computes how valuable it is to follow the policy from a given state (or state-action pair). This notion of value is expressed in terms of future rewards the agent could expect, if it had chosen to follow the given policy.

Definition 5 (State-Value Function).

The state-value function, is a mapping from states to . The value of a state under policy is the expected discounted cumulative reward given that the agent starts in state and follows policy thereafter.

Definition 6 (Action-Value Function).

The action-value function, is a mapping from state-action pairs to . The action-value of the state-action pair under policy is the expected discounted cumulative reward given that the agent starts in state , takes action , and follows policy thereafter.

Bellman Equations

Bellman equations form the basis for how to compute, approximate, and learn value functions in the RL setup (Sutton and Barto, 1998). They arise naturally from the structure of an MDP by capturing the recursive relationship between the value of a state and the value of its successor states. The two Bellman equations for the state-values and action-values can be defined as follows.

Definition 7 (Bellman Equation for state-value function of an MDP).
Definition 8 (Bellman Equation for action-value function of an MDP).

We can now use the value function to define a partial ordering over policies. A policy is said to be better than another when the expected return of one policy is greater than or equal to the other for all states. Formally, . From the imposed partial ordering it has been shown that there exists at least one policy, , such that for all policies , although it might not be unique (Bertsekas and Tsitsiklis, 1996). The Bellman Optimality Equations provide a mathematical framework for talking about the optimal policy just by replacing the sum over actions with a operator. Intuitively, this represents a policy that is greedy with respect to the value of its successor states.

Definition 9 (Bellman Optimality Equation for state-values).
Definition 10 (Bellman Optimality Equation for action-values).

For finite MDPs with known environment dynamics, the Bellman Optimality Equations have a unique solution. Unfortunately in the RL setup we deal with an unknown MDP. Thus, almost all of the RL algorithms approximate the Bellman Optimality Equations for an unknown MDP and try to iteratively find an optimal policy asymptotically.

2.1.1 Reinforcement Learning Algorithms

The fundamental difference between an RL problem and a planning problem is the knowledge of the environment dynamics. In a planning problem the model of the environment is already known and the problem boils down to finding an optimal policy in the environment. In an RL problem, the agent is dropped into an unknown environment the dynamics of which is unknown. This makes reinforcement learning a hard problem. This distinction gives rise to two categories of RL algorithms, namely model-based and model-free.

The class of algorithms that learns the model of the environment, and then does planning within the learned model are called model-based RL algorithms. These algorithms learn the transition probabilities () and reward functions () of the MDP by iteratively simulating the environment and updating the simulation to better represent the true environment. This approach to solve unknown MDP’s is computationally intensive, especially in large or continuous problems. Value iteration and policy iteration are two dynamic programming algorithms that have a planning-based approach to the RL problem. On the other hand, model-free algorithms directly learn the optimal policy using an intermediary quantity (usually the value-function).

Generalized Policy Iteration (GPI)

The overarching theme of almost all value-function based CRL algorithms is the back-and-forth between two interacting processes, prediction and control, eventually resulting in convergence. Prediction refers to policy-evaluation where the value-function is estimated for the current policy. Control on the other hand aims to find a policy that is greedy with respect to the current value-function (state-value or action-value).

Figure 2.1: Generalized Policy Iteration (Sutton and Barto, 1998)

The ‘Prediction and Control’ process converges when it produces no significant change, that is, the value-function is consistent with the current policy and the policy is greedy with respect to the current value-function.

Temporal Difference Learning

Temporal Difference learning (TD learning) is a common RL algorithm; it is a model-free algorithm that combines Monte Carlo methods with the ideas from dynamic programming. TD learning allows the agent to directly learn from its experience of the environment. Following the GPI theme, we need a strategy for prediction and control. In TD prediction we use the sampling of Monte Carlo methods and bootstrapping (updating from an existing estimate) of DP algorithms to estimate the current value-function.

Definition 11 (Update formula for state-value function).

TD(0) is a TD learning algorithm that updates state-values after each time-step, so the learning process is fast and on-line. The target for the TD(0) update formula uses the existing estimate of , hence we say the algorithm bootstraps.

As the agent interacts with the environment more, TD learning is able to generate a better estimate of the value-functions. In the limit, if each state (or state-action pair) is visited infinitely often with some additional constraints on the learning rates, convergence to the true value-function is guaranteed (Bertsekas and Tsitsiklis, 1996).

In TD control we want to optimize the value-function of an unknown environment. There are two classes of policy control methods, namely, on-policy and off-policy. On-policy control uses the policy derived from the current value-function estimate to update the future estimates. Alternatively, off-policy control uses a policy that is greedy with respect to the current value-function to estimate future value-functions. (on-policy) and -learning (off-policy) are two popular TD control algorithms that are known to learn an MDP asymptotically (Sutton and Barto, 1998; Watkins and Dayan, 1992).

The important concept of why we are able to do model-free TD control lies in the fact that we use (state,action)-value functions instead of state-value functions.

Definition 12 (Greedy policy control).

Policy improvement is done by considering a new policy, , which is greedy with respect to the current value-function.

From the above policy improvement equations we can see that in order to be greedy with respect to the state-value function, we require the model of the MDP. In contrast, if the policy is greedy with respect to the action-value function, the model dynamics of the MDP is not needed, and hence, it is model-free. Thus, optimizing action-value functions to learn the optimal policy is at the heart of all model-free TD control algorithms.

Challenges and Drawbacks

All the Classical Reinforcement Learning algorithms that we discussed above can be categorized as tabular algorithms. That is, the algorithms use a table data structure to associate each state (or state-action pair) with its current value estimate. As the agent interacts with the environment and gains experience, the table values are updated with better estimates of its value.

The main drawback of such a method is that it scales poorly. When the state-space is very large or continuous, the fundamental requirement that the agent visits each state (or state-action pair) multiple times (or infinitely often) is not satisfied; states are at most visited once. An agent following a policy derived from these value estimates would do no better than a random policy. Moreover, the table size grows with the number of states, making storage infeasible for problems with large/continuous state-space.

A common approach to solving this problem is to find a way to generalize the value-function from the limited experience of the agent (Sutton and Barto, 1998). That is, we want to approximate the value-function for an unseen state (or state-action pair) from the example values it has observed so far. Function approximation is a generalization technique that does exactly this; it takes in observed values of a desired function and attempts to generalize an approximation of the function.

2.1.2 Function Approximation

Function approximation (FA) is an instance of supervised learning 

(Sutton and Barto, 1998). It is viewed as a class of techniques used to approximate functions by using example values of the desired function. In the RL context tabular methods become infeasible in large or continuous state spaces. This challenge is mitigated by employing FA techniques to predict the value-function at unseen states. However, not all FA methods are applicable to the RL setting. We require a training method which can learn efficiently from on-line, non-i.i.d. data, and also handle non-stationary target functions. The following are some of the function approximators that are used in the RL context.

  • Gradient-Descent Methods

    • Artificial Neural Networks

    • Linear Combination of Features

  • State Aggregation

    • k-Nearest Neighbors

    • Soft Aggregation

State aggregation is a method of generalizing function approximation in which states are grouped based on a criterion and then value is estimated as an attribute of the group. When a state is re-visited the value corresponding to the state’s group gets updated.

Linear combination of features, also known as Linear Function Approximation (LFA), is essentially a linear mapping from the state space (of dimensionality ) to a feature space of dimension , where often . Each basis function of the feature space is a mapping from the state space to a real-valued number that represents some feature of the state-space.

Definition 13 (Linear-Approximate state(action)-value function).

The approximate state-value function of a state under a policy is given by:

Where , is a feature map, and is the parameter vector.

LFA has sound theoretical guarantees and also is very efficient in terms of both data and computation (Sutton and Barto, 1998), making it a good candidate for the implementation of our algorithm.

As mentioned previously, FA can be regarded as a technique to develop a generalization regarding value. In order to have a good capacity to generalize, a function approximator must have relevant data about the state-space. Consider a pathological case in which the agent does not explore at all: as a result the only data available for FA would be concentrated in one region of the state space. This results in the estimation of values of unseen states being highly biased. In order to avoid this problem we have to make sure that the agent visits most regions of the state-space, that is, the agent has to explore the state-space efficiently. The main goal of this thesis is to address the problem of how to explore efficiently in large state-spaces.

2.2 Exploration Strategies for Reinforcement Learning

In Section 1.2 we described the exploration/exploitation dilemma, which is a fundamental problem in RL. All exploration strategies attempt to manage the trade-off between these two often opposed objectives. The simplest and most widely-used exploration strategy is known as -greedy. At each time-step the agent chooses a greedy action with probability and with probability the agent chooses a completely random action. To ensure that the policy converges to the optimal policy it has to satisfy the GLIE assumptions  (Singh et al., 2000):

Definition 14 (Greedy in the Limit with Infinite Exploration).

A policy is GLIE if it satisfies the following two assumptions.

  • Each action is taken infinitely often in every state that is visited infinitely often,

    Where is the number of times action has been chosen in state up-to time-step .

  • In the limit, the learning policy is greedy with respect to the Q-value function with probability .

For example, -greedy satisfies the GLIE assumptions when is annealed to zero. A common way to do this is by setting .

In small, finite MDPs -greedy satisfies the GLIE assumptions, but when the state-action space is large/continuous the first GLIE assumption is violated and hence the convergence guarantee is lost. -greedy is a naïve approach to solving the exploration problem, but we still use it in large MDPs because of its low resource requirements when compared with alternatives  (Bellemare et al., 2016). In this thesis we propose a novel exploration strategy that improves upon -greedy, and provides state-of-the-art results in large problems with low computational overhead.

We now provide an exposition of various explorations strategies, their foundational principles, and an analysis of recent breakthroughs in the field of exploration.

2.2.1 Taxonomy of Exploration Strategies

The exploration-exploitation dilemma is still an open problem, but researchers have made significant inroads into understanding the nature of the problem. Sebastian Thrun classified exploration techniques into two families of exploration schemes,

directed and undirected  (Thrun, 1992). Undirected exploration strategies do not use any information from the environment to make an informed exploratory action; they predominantly rely on randomness to do exploration. Softmax methods and -greedy are examples of undirected exploration techniques. The softmax action is sampled from the Boltzman distribution

On the other hand, directed exploration strategies use the knowledge about the learning process to form an exploration-specific heuristic for action selection. This heuristic directs the agent to take those actions that maximizes the information gain about the environment. The exploration algorithm introduced in this thesis falls into the category of directed exploration algorithms. In order to put it into context, we first present an overview of the existing directed exploration strategies used in the literature.

2.3 The Optimism in the Face of Uncertainty Principle

In the following chapter we present our directed exploration method, which implements the principle of "Optimism in the Face of Uncertainty" (OFU) as a heuristic for exploration. In this section we review existing work on the OFU heuristic. The principle is succinctly captured in Osband and Van Roy (2016):

"When at a state, the agent assigns to each action an optimistically biased while statistically plausible estimate of future value and selects the action with the greatest estimate."

OFU is a heuristic to direct exploratory actions. OFU directs the agent to take actions which have more uncertain value estimates. Instead of greedily taking the action that has the highest estimated value, that agent is encouraged to take actions which have a high probability of being optimal. To see that an apparently suboptimal action may indeed have a high probability of being optimal, let us take an example. Suppose that the agent has taken an action very often from a particular state , and suppose that also currently has the highest value-estimate among the available actions. Now consider an alternative action that has only been tried once from the state , and suppose that the reward received was lower than . Action has higher estimated value, but having tried it many times, the agent’s uncertainty about its value is quite low. In contrast, the uncertainty about the value of the alternative action is very high, since it has been taken so rarely. Thus, while the current estimate may be lower than , there is a good chance that the agent was unlucky when taking the first time, and that the true action-value is much higher than both estimates. Thus it may be that has a higher probability of being the optimal action than does , especially if their estimated values are quite close. The OFU heuristic would bias the agent toward taking action instead of the greedy action . An agent following this heuristic will behave as if it is optimistic about action , or more precisely, about its true action-value . This optimism drives the agent to explore regions of the environment about which it is more uncertain.

2.3.1 OFU using Count-Based Exploration Bonuses

Most of the exploration algorithms that enjoy strong theoretical efficiency guarantees, implement the OFU heuristic. Many do so by augmenting the estimated value of a state(-action pair) with an exploration bonus that quantifies the uncertainty in that value estimate. An agent which acts greedily with respect to this augmented value function will be biased to take actions with higher associated uncertainty. Most of these algorithms are tabular and count-based in that they compute their exploration bonuses using a table of state(-action) visit-counts. The visit-count serves as an approximate measure of the uncertainty associated with a state(-action), because more novel state(-action) pairs will have lower visit-counts. State(-actions) with lower visit counts are assigned higher exploration bonuses. This drives the agent to behave optimistically and explore less frequently visited regions of the environment, which may yet prove to have higher value than familiar regions. Moreover, even if those regions turn out to yield little reward when explored, the agent will have greatly reduced its uncertainty about those regions. Indeed, the reduction in uncertainty would be much smaller if the agent were to take an action that had already been tried many times. The OFU heuristic is therefore a win-win approach for the agent. OFU algorithms are more efficient than undirected exploration strategies like -greedy because the agent avoids actions that yield neither large rewards nor large reductions in uncertainty (Osband et al., 2016).

2.3.2 Tabular Count-based Exploration Algorithms

One of the best known OFU methods is the UCB1 bandit algorithm, which selects an action that maximises an upper confidence bound , where is the estimated mean reward and is the visit-count (Lai and Robbins, 1985). The dependence of the bonus term on the inverse square-root of the visit-count is justified using Chernoff bounds. In the MDP setting, the tabular OFU algorithm most closely resembling our method is Model-Based Interval Estimation with Exploration Bonuses (MBIE-EB) (Strehl and Littman, 2008).222To the best of our knowledge, the first work to use exploration bonuses in the MDP setting was the Dyna-+ algorithm, in which the bonus is a function of the recency of visits to a state, rather than the visit-count (Sutton, 1990) Empirical estimates and of the transition and reward functions are maintained, and is augmented with a bonus term , where is the state-action visit-count, and is a theoretically derived constant. The Bellman optimality equation for the augmented action-value function is

Here the dependence of the bonus on the inverse square-root of the visit-count is provably optimal (Kolter and Ng, 2009). This equation can be solved using any MDP solution method.

While tabular OFU algorithms perform well in practice on small MDPs (Strehl and Littman, 2004), their sample complexity becomes prohibitive for larger problems (Bellemare et al., 2016). The sample complexity of an algorithm is a bound on the number of timesteps at which the agent is not taking an -optimal action with high probability (Kakade, 2003). Loosely speaking, it measures the amount of experience the agent must have before one can be confident it is basically performing optimally. MBIE-EB, for example, has a sample complexity bound of . In the high-dimensional setting – where the agent cannot hope to visit every state during training – this bound offers no guarantee that the trained agent will perform well. The prohibitive complexity of these tabular OFU algorithms is due in part to the fact that a table of visit-counts is not useful if the state-action space is too large. Since the agent will only visit a small fraction of that space, the visit-count for most states will always be zero. These algorithms are therefore unable to usefully compare the novelty of two unvisited states. All unvisited states have the same visit-count, and hence the same exploration bonus. The optimistic agent will treat them all as equally novel and equally appealing.

2.3.3 Generalized Visit-counts for Exploration in Large MDPs

Tabular OFU algorithms fail on high-dimensional problems because they do not allow for generalization across the state space regarding uncertainty. Every unvisited state is treated as entirely novel, regardless of any similarity between the unvisited states and the visited states in the history. In order to explore efficiently in large domains, the agent must be able to make use of the fact that some unvisited states share many features with visited states, while others share very few. If an unvisited state has almost exactly the same features as a very frequently visited one, then it should not be considered to be as uncertain as a state with unfamiliar features. An effective OFU method for these problems would not just encourage the agent to visit unvisited states, but rather would drive the agent to visit states with novel or uncommon features. We discuss this issue further in section Section 3.1.1.

Several very recent extensions of count-based exploration methods have achieved this sort of generalisation regarding uncertainty, and have produced impressive results on high-dimensional RL benchmarks. These algorithms closely resemble MBIE-EB, but they substitute the state-action visit-count for a generalised visit-count which quantifies the similarity of a state to previously visited states. Bellemare et al. (2016) construct a Context Tree Switching (CTS) density model over the state space such that higher probability is assigned to states that are more similar to visited states (Veness et al., 2012). A state pseudocount is then derived from this density. A subsequent extension of this work replaces the CTS density model with a neural network (Ostrovski et al., 2017). Another recent proposal uses locality sensitive hashing (LSH) to cluster similar states, and the number of visited states in a cluster serves as a generalised visit-count (Tang et al., 2016). As in the MBIE-EB algorithm, these counts are used to compute exploration bonuses. These three algorithms outperform random strategies, and are currently the leading exploration methods in large discrete domains where exploration is hard.

Before presenting our optimistic count-based exploration method in the following chapter, we now briefly canvas two alternative frameworks for directed exploration, and discuss their limitations.

2.4 Bayes-Adaptive RL

In the Bayesian approach to model-based reinforcement learning, we maintain a posterior distribution over the possible models of the environment given the experience of the agent  (Dearden et al., 1998)

. Bayesian inference is used to update the posterior with new information as the agent interacts with the environment, and also to incorporate the agent’s prior distribution over the transition models.

Since the posterior is maintained over all possible models we can now talk about the uncertainty pertaining to what is the best action to take. This uncertainty is modelled as a Markov Decision Process defined over a set of hyper-states. A hyper-state acts as an information state which summarizes the information accumulated so far. This augmented MDP, often referred to as the Bayes-Adaptive MDP (BAMDP), can be solved with standard RL algorithms  (Duff, 2002). In this framework an agent acting greedily in the BAMDP whilst updating the posterior acts optimally (according to its prior belief) in the original MDP. The Bayes-optimal policy for the unknown environment is the optimal policy of the BAMDP, thereby providing an elegant solution to the exploration-exploitation trade-off.

Unfortunately, the cardinality of the hyper-states grows exponentially with the planning horizon thereby rendering exact solution to the BAMDP computationally intractable for large problems  (Duff, 2002).

2.5 Intrinsic Motivation

The final directed exploration heuristic that we discuss is born out of the so-called intrinsic motivation framework. There appears to be a growing scientific consensus in developmental psychology that human beings, from infants to adults, develop their understanding of the world using certain cognitive systems such as intuitive theories, social-structures, spatial systems, etc.  (Spelke and Kinzler, 2007; Lake et al., 2016). During curiosity-driven, creative, or risk-taking activities, rational agents use this understanding to generate intrinsic goals. Accomplishing these intrinsic goals leads to the accumulation of intrinsic rewards, thereby exhibiting an innate desire to explore, manipulate, or probe their environment  (Oudeyer, 2007).

Drawing parallels to reinforcement learning, the goal of a traditional RL agent is to maximize its expected cumulative reward. This behaviour is extrinsically motivated since the reward signal is external to an agent. We say that an agent is intrinsically motivated if it has intrinsic goals and rewards. In the context of exploration for RL, the aim of the intrinsic motivation approach is to use intrinsic reward as a heuristic that assigns an exploratory value to the agent’s actions. For example, an agent may receive intrinsic rewards for visiting novel parts of the environment that need further exploration  (Thrun, 1992).

Many formulations that quantify the exploratory value of an action has been put forth, and most of them augment the environment’s reward function so as to motivate directed exploration. Schmidhuber (2010) proposed a measure for intrinsic motivation by taking into account the improvement a learning algorithm effected on its predictive world model. This measure tracks the progress of an agent’s ability to better compress the history of states and actions  (Steunebrink et al., 2013). Another framework for intrinsically motivated learning is to maximize the mutual information. An intrinsic reward measure called empowerment is formulated by searching for the maximal mutual information  (Mohamed and Rezende, 2015). The notion of maximizing information gain was demonstrated in a humanoid robot by the introduction of artificial curiosity (Schmidhuber, 1991) as an intrinsic goal  (Frank et al., 2014).

These formulations have some major drawbacks which hinder their suitability as exploration heuristics. Firstly, they fail to provide any strong theoretical guarantees of efficient exploration. Leike (2016) pointed out that since none of these heuristics take into account the reward structure of the problem, they do not distinguish between regions of high and low expected reward. Secondly, these algorithms require that we maintain the environment dynamics of the underlying MDP, which prevents us from easily integrating them with model-free algorithms. Another major drawback is the computational overhead associated with calculating the heuristic. For problems with large state/action spaces, computing the intrinsic reward becomes intractable for many heuristics  (Bellemare et al., 2016). Most problems of interest have extremely large state spaces, and hence the intrinsic motivation heuristic is currently impractical as an exploration strategy in these domains.

3.1 Drawbacks of Existing Exploration Methods for Large MDPs

We introduced count-based exploration strategies for large MDPs in section Section 2.3.3. Even though they are the current state-of-the-art exploration algorithms in these domains, we consider that there are some potential drawbacks to their common approach to estimating novelty. The motivation for our algorithm arises from trying to avoid these drawbacks.

3.1.1 Choosing a Novelty Measure

The aforementioned algorithms compute a generalized visit-count. This generalized count is a novelty measure that quantifies the (dis)similarity of a state to those in the history. These algorithms drive the agent towards regions of the state space with high novelty. However, the effectiveness of these novelty measures depends on the way in which they measure the similarity between states. If this similarity measure is not chosen in a principled way, states may deemed similar in ways that are not relevant to the given problem. Let us explore this issue by taking an example.

Example 1 (Confounded novelty).

Alice is a foodie. She wants to explore the myriad restaurants that are open in her city. Suppose that Alice’s novelty measure treats restaurants as similar if they are geographically close. Alice consults her novelty measure to choose a restaurant she has not tried yet, and it returns a Chinese restaurant in a distant suburb that she has not visited before. Alice scratches her head thinking: ‘I have been to a tonne of Chinese restaurants; if only my novelty measure understood that and suggested a different cuisine!’ Unfortunately, her novelty measure considers this restaurant very dissimilar from the Chinese restaurants she has visited, simply because it is geographically distant from them.

The problem here is that Alice’s novelty measure does not know anything about which features matter when evaluating the novelty of a restaurant. Let us now look at an example from the recent exploration literature where this problem can be clearly observed.

Inappropriate Novelty Measures in Practice

The problems that can arise from an unprincipled choice of novelty measure are well illustrated in the experimental evaluation of Stadie et al. (2015)

. Their algorithm uses an autoencoder to encode the state-space into a lower dimensional representation. The encoding is then fed into a model dynamics prediction neural network which estimates the novelty by providing an error-based bonus. This method, called Model Prediction Exploration Bonuses (MP-EB), uses an error based estimator and is different from the visit-density model of

Bellemare et al. (2016), but they both estimate novelty. To generalize regarding value they use the DQN network, and so we refer to their algorithm as DQN+MP-EB.

Figure 3.1: Q*bert Level 1
Figure 3.2: Q*bert Level 2
Figure 3.3: Two levels of the Atari2600 game Q*bert

During empirical evaluation of their algorithm an anomaly was detected in the game Q*bert111In Q*bert, the goal of the agent is to jump on all the cubes without falling off the edge, or being captured. from the Arcade Learning Environment222ALE is a performance evaluation platform consisting of Atari2600 games. It is considered as the standard performance test bed for RL algorithms. We’ll discuss in depth about ALE in Chapter 4. (ALE) benchmarking suit. DQN+MP-EB algorithm scored lower than the baseline algorithm, DQN+-greedy. They attributed this anomaly to the fact that during each level change of Q*bert, the color of the game changes dramatically, but neither the objective nor the structure of the level changes (Figure 3.3). When their agent reached level 2 (Figure 3.3), it perceived the state to be completely novel because MP-EB is sensitive to color. This tricked MP-EB into assigning high exploration bonus to all the states even though the action-values of the states hadn’t changed. Hence the policy of the agent was impacted adversely.

The pathology of DQN+MP-EB in the Q*bert game highlights a serious problem with current novelty estimators they do not take into account the relevancy of a state to the task an agent is trying to accomplish. We argue that a measure of novelty should not just be an arbitrary generalized representation of how many times an agent has visited a state, but should ideally be a measure of dissimilarity in facets that are relevant to the agent’s goal. Two states can be different in many ways; the challenge is to find out a similarity metric which is effective in achieving the agent’s goal optimally. In Example ice’s novelty measure did not know that suggesting a restaurant with a different cuisine would be more relevant to her task, thereby naively suggesting a geographically distant unvisited restaurant.

3.1.2 Separate Generalization Methods for Value and Uncertainty

We contend that this deficiency is not peculiar to MP-EB, but rather that it may arise whenever the novelty measure is not designed to be task-relevant. Indeed, all of the aforementioned algorithms which compute a novelty measure share a common structure which leaves them vulnerable to this problem. Each algorithm has two quite unrelated components: a value estimator (an RL algorithm which performs policy evaluation), and a novelty estimator. Each component involves an entirely separate generalization method. The value estimator makes use of a feature representation of the state space in order to generalize about value. The novelty estimator separately utilizes a different, exploration-specific state space representation to measure the similarity between states. For example, the #Exploration algorithm of Tang et al. (2016) uses the DQN algorithm for value estimation. In order to estimate novelty, however, #Exploration maps the state space into a lower-dimensional representation using locality sensitive hashing. The similarity measure induced by the choice of hash codes is unlikely to resemble that which is induced by the features learnt by DQN. The DQN-CTS-EB algorithm of Bellemare et al. (2016) has a similar structure: DQN is used to estimate value, but the CTS density model is used to estimate novelty. Again, it is not obvious that there should be much in common between the two similarity measures induced by these different state space representations. One might think that this is natural; after all, each representation is used for a different purpose. However, there are two questions we can ask here. Firstly, is there redundant computation due to performing a dimensionality reduction of the same state-space twice? If so, can we reuse the same state space representation for both value and novelty estimation? We address these questions in the following section.

Before moving on we should note that the concerns we express in this section have already been raised in the literature. In their empirical evaluation Bellemare et al. (2016) observed that their value estimator (DQN) was learning at a much slower rate than their CTS density model (their novelty measure). The authors attribute this mismatch to the incompatibility between novelty and value estimators. They further go on to suggest that designing density models to be compatible with value function would be beneficial and a promising research direction.

The drawbacks we presented in this section suggest that there may be much room for improvement in the design of novelty estimators for exploration. In the following sections we describe our technique for estimating novelty by factoring in the insights we gained from analyzing these drawbacks. We first provide a solid footing for some of the assumptions that we made while designing the algorithm. We then go on to present our core exploration algorithm, and then combine it with a model-free RL algorithm (SARSA()). In the coming chapters we present empirical evidence that our RL algorithm achieves world-leading results on the ALE benchmarking suite.

3.2 Estimating Novelty in Feature Space

3.2.1 Motivation

Which representation of the state space is appropriate for novelty estimation? Intuitively, if we use some parameters to determine the value of a state, then naturally, two such objects are considered dissimilar only if they differ in these parameters. Analogously, if the agent is using certain features to determine the value of a state, then naturally, two such states should be considered dissimilar only if they differ in those value-relevant features. This motivates us to construct a similarity measure that exploits the feature representation that is used for value function approximation. These features are explicitly designed to be relevant for estimating value. If they were not, they would not permit a good approximation to the true value function. This sets our method apart from the approaches described in Section 2.3.3, which measure novelty with respect to a separate, exploration-specific representation of the state space, one that bears no relation to the value function or the reward structure of the MDP. We argue that measuring novelty in feature space is a simpler and more principled approach, and hypothesise that more efficient exploration will result. Our proposal ensures that generalization regarding novelty is done in the same space as generalization regarding value. Figure 3.4 illustrates the basic structure of our proposed novelty estimator.

Figure 3.4: Novelty Measure in Feature Space

Let us make the idea more concrete with our running example.

Example 2 (Value-relevant exploration).

After Alice’s disappointing restaurant visit last time, she tweaked her novelty estimator such that it now generalizes based on value-relevant features like the type of cuisine, the star rating, and the other features that truly determine the quality of Alice’s dining experience. When Alice is ready to try something new, she can rest assured that it’s going to be something novel in a way that is meaningful.

3.2.2 Design Decisions

Our exploration strategy, henceforth known as -exploration bonus (-EB), can be thought of as exploration in the feature space. This makes the existence of a feature map crucial to our strategy. Therefore we require that our algorithm be compatible with Linear Function Approximation (LFA). Before the advent of neural networks and subsequently DQN, large RL problems used linear function approximation to estimate the value of a state. Our decision to use LFA as our value prediction module has the following desirable benefits:

  • Domain Independence: The visit-density models that we have seen so far (MP-EB, CTS-EB, PixelCNN, etc.) are designed to work with RGB pixel values from a video input. Though there are many domains that use video input to train the agent, there are equally many other domains that have nothing to do with a video input. For example, reinforcement learning is used in the financial sector to optimize portfolios, asset allocations, and trading systems  (Moody and Saffell, 2001). Therefore developing a visit-density model that is domain independent is a key challenge. Our -EB method estimates the novelty using the same features that LFA uses to approximate the value function. This allows our exploration strategy to be compatible with any value-based RL algorithm that uses LFA.

  • Indirect dependence on LFA: LFA is essentially a linear combination of features. The only requirement -EB has is the existence of a feature map, which is implicitly satisfied with LFA. Because of this indirect dependence on LFA, we hypothesis that it is possible to extend -EB to be compatible with value-networks that perform representation learning as well (e.g., DQN). Due to resource and time constraints we do not pursue empirical evidence for this claim, but rather leave this as a possible future extension of our research.

  • Single point of change: The best way to assess the performance impact of changes to a system is to confine the change to a single module and then run performance tests. Following this principle, we know that SARSA() is a value-based RL algorithm which uses LFA for value prediction and -greedy for exploration (Sutton and Barto, 1998). SARSA() has been studied, perfected and validated through-out the ages. Therefore showcasing the performance gains achieved by replacing -greedy with our -EB exploration strategy allows for a sound empirical proof for the efficacy of our algorithm.

One drawback of using LFA for value prediction is that it requires a set of hand-crafted features. This is easily mitigated by choosing the Arcade Learning Environment (ALE) as our evaluation platform (Bellemare et al., 2013), combined with the Blob-PROST feature set (Liang et al., 2015). Using Blob-PROST as our feature set has and added advantage. Blob-PROST is designed to mimic the features learned by DQN, thus making our algorithm comparable with those using DQN for representation learning and value prediction. We’ll discuss in depth about the ALE and Blob-PROST in Chapter 4.

3.3 The -EB Algorithm

The main original contribution of this work is a method for estimating novelty in feature space. The challenge is to do so without explicitly computing the distance between each new feature vector and all the feature vectors in the history. That approach quickly becomes infeasible because the cost of computing all these distances grows with the size of the history. Our method instead constructs a density model over feature space that assigns higher probability to states that share more features with more frequently observed states. In order to formally describe our method we first introduce some notation.

  • , The feature map used in LFA. Maps the state space into an -dimensional feature space, .

  • , Feature vector observed at time , whose component is denoted by

  • , Sequence of feature vectors observed after timesteps.

  • , Sequence where is followed by .

  • , Set of all finite sequences of feature vectors.

  • , The sequential density model (SDM) that maps a finite sequence of feature vectors to a probability distribution.

We will now present the key component of our algorithm that allows us to estimate novelty in feature space.

3.3.1 Feature Visit-Density

Definition 15 (Feature visit-density).

The feature visit-density at time is a probability distribution over the feature space , representing the probability of observing the feature vector after observing the sequence . It is modelled as a product of independent factor distributions over individual features

This density model induces a similarity measure on the feature space. Loosely speaking, feature vectors that share component features are deemed similar. This enables us to use as a novelty measure for states, because it represents the frequency with which features are observed in the history. When confronted with a new state, we are able to estimate how frequently its component features have occurred in the history. If has more novel component features, will be lower. By using a density model we are therefore able to measure novelty in a way that usefully generalizes the agent’s uncertainty across the state space. To illustrate this, let us consider an example.

Example 3.

Suppose we use a 3-D binary feature map and that after 3 timesteps the history of observed feature vectors is . Let us estimate the feature visit densities of two unobserved feature vectors , and . Using the KT estimator for the factor models, we have , and . Note that because the component features of are more similar to those in the history. As desired, our novelty measure generalizes across the state space.

Each factor distribution is modelled using a count-based estimator. A naive option would be to use the empirical estimator which is the ratio of the number of times a feature has occurred to the total number of time steps. Another class of count-based estimators are the Dirichlet estimators which enjoy strong theoretical guarantees (Hutter, 2013). We use the Krichevsky-Trofimov(KT) estimator which is a Dirichlet-like estimator that is simple, easy to implement, scalable, and data efficient (Krichevsky and Trofimov, 1981). If is the number of times the feature has been observed, then the KT estimator is given by:

Using independent factor distributions for modelling the probability of each feature component inherently assumes that the features are independently distributed. This is not always the case, especially in video-input based domains such as the ALE we have many features that are strongly correlated. This doesn’t mean that we cannot use fully factorized distributions. One of the early assumptions made by  Bellemare et al. (2016) about the density model is that the states are independently distributed. This allowed them to factorize the states, and model each factor using a position-dependent CTS333

A Bayesian variable-order Markov model.

density model. Moreover, our empirical evaluations show that we achieve world leading results in hard exploration games suggesting that independent factored distributions produce good novelty measures. Thus by precedence and by empirical data the independence assumption on the features is a well-justified trade-off that makes the computation of novelty fast and data efficient.

3.3.2 The -pseudocount

Here we adopt a recently proposed method for computing generalised visit-counts from density models (Bellemare et al., 2016). By analogy with the pseudocounts presented in that work, we derive two -pseudocounts from our feature visit-density. Both variants presented generalize the same quantity, the state visitation count function . The expression given in the following definition is derived in Bellemare et al. (2016). We emphasize that our approach constitutes a departure from theirs, because while they derive pseudocounts from a state visit-density model, we do so using a feature visit-density model.

Definition 16 (-pseudocount).

Let 444Also called the recoding probability. be the probability that the feature visit-density model would assign if it was observed one more time. Then the -pseudocount for a state is given by:

3.3.3 The -Exploration Bonus algorithm -Eb

Equipped with all the tools necessary for the construction of an exploration bonus we now proceed to define the -EB algorithm. We provide a high level flow-chart for the construction of the bonus in Figure 3.5, and the corresponding pseudo-code in Algorithm ving defined the -pseudocount (a generalised visit-count), we follow traditional count-based exploration algorithms by computing an exploration bonus that depends on this count. The functional form of the bonus is the same as in MBIE-EB; we merely replace the empirical state-visit count with our -pseudocount.

Definition 17 (-exploration bonus).

The exploration bonus for a state-action pair at time is

where is a hyper-parameter that controls the agents level of optimism.

Loosely speaking, the hyper-parameter can be viewed as a knob that tunes the agent’s confidence in its estimate of the true action-value function. Higher values of makes the agent under-confident about value, and result in too much exploration. Very low values do not encourage enough exploration because the exploration bonus is too small to dissuade the agent from acting greedily with respect to its current value estimates. In both scenarios the final policy of the agent is affected adversely. The goal is to find a value that gives good results across domains. We performed a coarse parameter sweep among the games in the ALE evaluation platform and concluded that was the best value. Further details regarding the selection of value is discussed in Section 5.2.1.

Figure 3.5: Flow Chart for computing the exploration bonus of -EB
Input: Density Model
1 function FeatureVisitDensity   return end function
Input: Feature Visit Count ; Density Model ; Current Timestep
1 function UpdateFeatureVisitDensity   for i=1 to  do   end for
2 end function
1 function PseudoCount   return end function
Input: LFA Feature Map ; Exploration Coefficient
1 function ExplorationBonus   FeatureVisitDensity UpdateFeatureVisitDensity FeatureVisitDensity PseudoCount return end function
Algorithm 1 -exploration bonus

3.3.4 LFA with -Eb

One the advantages that we have in developing our algorithm for use with LFA is that our exploration strategy is compatible with all value-based RL algorithms that use LFA. As we will see, our empirical performance across a range of environments suggests that one can plug our exploration strategy with little to no modification into any of these algorithms and expect considerable gains in exploration efficiency. In our empirical evaluation we use SARSA() with replacing traces as our value-based reinforcement learning algorithm. Algorithm sents the pseudo-code for a generic RL algorithm that uses the augmented reward for updating the function parameters of the approximate action-value function .

Input: LFA Feature Map ; Training Horizon
1 Initialize arbitrary initial state, action while  do   Act ExplorationBonus NextAction UpdateTheta end while
return 44footnotetext: The functions NextAction and UpdateTheta are specific to the underlying value-based RL algorithm used, hence left unspecified. Act performs action in the environment.
Algorithm 2 LFA with -EB

3.3.5 Complexity Analysis

Time Complexity

From Algorithm s trivial to see that a call to ExplorationBonus has a worst-case time complexity of , where is the dimension of the feature space. This suggests that the time needed to compute the novelty of a state is independent of the dimension of the state-space. Also, more often than not, the dimension of the feature space is far smaller than that of the state space. Therefore, our algorithms generates significant savings in computation over other density models whose time-complexity scales with the number states. In practice, for a binary feature set like Blob-PROST we process only those features that have been observed before. This is achieved by maintaining a single prototypical factor density estimator for all previously unseen features. We’ll discuss the implementation specific details in depth in Chapter 4.

Space Complexity

We look at Algorithm nalyze what objects are needed to be persisted across iterations so as to facilitate calculation of the exploration bonus. Clearly the factor density estimators , and the feature visit count are needed to evaluate and update the feature visit density. Therefore it can be seen that our algorithm has a worst case space complexity of . Again, because the features in Blob-PROST are binary valued, the KT estimator can be defined recursively. This allows for updating the factor density online without the need to maintain a feature visit count . We’ll discuss more on this in Chapter 4.

3.4 Summary

In this chapter we have presented the main contribution of our research. Motivated by the drawbacks of current state-of-the-art exploration algorithms, we introduced our novel exploration algorithm called -EB. Later, we provided an exposition on the various components of the algorithm and also analysed its time and space complexity.

Now that we have presented our algorithm, we move on to implementation aspects. The next chapter provides a detailed overview of the evaluation test-bed, the software architecture, and the implementation challenges faced during the Research & Development of the algorithm.

4.1 Software Architecture

Our implementation goal is to develop an RL software agent that uses -EB as its exploration strategy. We present the high-level design of the algorithm in Figure 4.1. The presented diagram is analogous to the Agent-Environment interaction cycle (Figure 1.1), but with more granularity. From an exploration-centric standpoint, we first provide a concise overview of the components presented in the architecture, and then an exposition on the implementation details for -EB.

Figure 4.1: Agent-Environment interaction framework for SARSA+-EB.111Boxes with a tiny row and column, on top and left edges respectively, denote objects stored in RAM. They can persists across cycles and episodes. Dotted arrows with instruction on them denote operation on such objects.

4.1.1 Modular Overview


We use SARSA with replacing traces (Sutton and Barto, 1998) as our learning algorithm. This decision was driven primarily by two factors. First, using the Blob-PROST222Discussed in Section 5.1.2. feature set meant that we are locked into the framework provided by Liang et al. (2015). In our case this is in fact desirable. Replacing the exploration module of an open source, peer-reviewed and published implementation with our own exploration module enhances the credibility of any performance gains that result. Second, we need a learning algorithm that works well with Linear Function Approximation (LFA). When coupled with LFA, SARSA has better convergence guarantees than Q-learning (Melo et al., 2008). Hence, SARSA is a suitable value estimation algorithm for our agent.

Exploration Strategy

This component is our -EB exploration strategy that was proposed in Chapter 3. We implement it using the C++11333C++11 is a major revision of C++. This particular version was chosen because it makes several useful additions to core language libraries. programming language. C++ offers a significant edge over other languages in terms of efficiency and greater control over memory management. Due to the high dimensional nature of our problem, we need to extract as much performance as possible from our code. Therefore implementing the exploration strategy in C++ is critical to the empirical success of our algorithm. Moreover, a lock-in with the framework provided by Liang et al. (2015) meant that there was no compelling reason to choose a different programming language. In the coming sections we provide a detailed look at the design and implementation of -EB.


We use LFA to generalize the action-value function for unknown state-action pairs. For further discussion on LFA we refer the reader to Section 2.1.2. LFA uses the Blob-PROST feature set from Liang et al. (2015) to approximate the action-value function.

Feature Map

We consider the feature map to be an integral part of the agent. The ability to discern different features of a state is imperative to generalization regarding value, and by extension, to exploration. Our agent explores in the feature space, and we want to use LFA for value prediction. This necessitated the need for an efficient and effective feature set. The Blob-PROST feature set from Liang et al. (2015) is the best feature set available to date for the Arcade Learning Environment (ALE) evaluation platform. More details on Blob-PROST in Section 5.1.2.


We chose the Arcade Learning Environment as our evaluation platform for the following reasons.

  • ALE contains many games which vary in degree of exploration hardness. This allows us to test the efficacy of our algorithm on a broad spectrum of games (Bellemare et al., 2016).

  • ALE is widely accepted as the standard for testing RL algorithms. The vast majority of exploration specific research that is published post Bellemare et al. (2013) has adopted the ALE platform to report empirical results (Mnih et al., 2013, 2015; Stadie et al., 2015; Osband et al., 2016; Bellemare et al., 2016; Tang et al., 2016; Ostrovski et al., 2017). Therefore, in order to compare and contrast our results with existing research, it is crucial that we choose ALE as our evaluation platform.

More details on the Arcade Learning Environment in Section 5.1.1.

4.1.2 Agent-Environment Work-flow

We want to seamlessly integrate -EB into the control module. Therefore understanding the nuances of what happens in an agents cycle from an implementation perspective is critical. Figure 4.1 also doubles as a work-flow diagram for our agent SARSA+-EB. Following the usual agent-environment interaction process at timestep , the agent performs an action on the environment and receives an extrinsic reward . The agent also observes the new state of the environment, . Inside the agent, the Blob-PROST feature map consumes the current state and returns a feature vector . The feature vector is then used by the LFA module to do value prediction. The -EB module uses the stored feature to generate the exploration bonus 444Here we can see that generalization regarding value and novelty are being done in the same space.. SARSA updates the parameters of LFA with the TD update and chooses the next action optimistically.


In the exploration strategy module the feature visit-density is a hash map that persists in memory across cycles and episodes. Each entry of is a key-value pair mapping individual features to its corresponding factor distribution . In the -EB module shown in Figure 4.1, the flow of control is as follows: compute as product of factors, update with the observation , then compute again. Now calculate the pseudo-count and subsequently the exploration bonus . The bonus is considered as an intrinsic reward and is sent to the control module, SARSA.


The prediction module (LFA) approximates the next state action-value function using the parameter vector as . LFA sends the next state -value to the control module, SARSA. The parameter vector is also an object that is saved in memory and persisted across cycles and episodes.

Sarsa555For brevity we have left out the discussion on eligibility traces.

All the results from the various modules flow into the control module SARSA. The control module essentially has two tasks.

  • Choose the next action
    The next state is chosen by being greedy with respect to next state action-value obtained from LFA.

  • Update of LFA
    First we augment the extrinsic reward with the intrinsic reward obtained from -EB module.

    The augmented reward , the next state action-value , and the current state action-value , both from LFA, is used to calculate the TD error.

    Where is the discount factor. Next we update , and is updated using the usual TD update formula.

    Where is the learning rate.

Now that we have a clear idea about the surrounding infrastructure, let’s move on to the implementation details of -EB

4.2 Implementation Details

4.2.1 Feature Visit-Density

The central data structure that stores the factor distribution of each individual feature is an unordered_map666Essentially a hash map called fvd_map. Each entry is a key-value pair mapping individual features to its corresponding factor distribution. This allows us to have constant time look-up for the factor distribution of any feature. At first glance of the theoretical formulation of feature visit-density (Section 3.3.1, Algorithm he implementation looks straight forward. Unfortunately that is not the case. We need to take into account certain implementation specific aspects that are often subsumed by mathematical formulation. Following are some of the important implementation details that need to be considered for computing feature visit-density.

  • Sparse Feature Vector
    In practice the feature vector is the list of features that are active in the current timestep. Most of the time the set of observed features is in a vastly smaller subspace of the feature space . Therefore, iterating till to compute the product of the factor distributions is quite wasteful. In order to overcome this we maintain a prototype777In this context, a prototype function creates an object of a specified type. Here, a KT estimator which has seen zeros. function that computes the KT-estimate of observing the feature give that it has never been observed in timesteps. Now whenever a new feature is observed it is added to fvd_map with the current value of the prototype. If is the total number of features observed till timestep , then we can compute the feature visit-density in time.

  • Numerical Stability
    Experience has taught us that when dealing with probabilities, innocent looking formulas such as ours can be deceiving. Since we are taking product of probabilities, they are bound to numerically underflow. In our implementation, rather than computing we compute . This allows us to safely perform probability calculations without the worry of underflow.

  • Inactive Features
    During the evaluation of the feature visit-density we need to consider the factor distributions for the features that are inactive but previously observed. Since we have already observed , we can identify the features in fvd_map that are not active. The probability density stored in fvd_map against some feature , is the probability of being active. Assuming , the probability of not being active is given by fvd_map. Therefore when evaluating feature visit-density for we should also factor in the probability of inactive features not occurring.

Algorithm sents the implementation for computing the feature visit-density with all the above mentioned optimization/requirements. One key observation is that we return the log-probability. This is done to facilitate further log based probability computation that occur in other modules.

Input: Current Timestep
1 function KTPrototype   return end function
Input: Factor Distribution Map fvd_map
1 function LogFeatureVisitDensity   for  to  do   if  fvd_map.keys then // look-up   fvd_map = KTPrototype end if
2 end for
 /* Inactive features */
3 for  to size(fvd_map.keys) do   if   then // look-up with flag trick   end if
4 end for
5return end function
Algorithm 3 Implementation of Feature Visit Density

4.2.2 Updating Factor Densities

Recall that we use the Krichevsky-Trofimov (KT) estimator to compute the factor densities. Given a sequence of symbols, the KT-estimator computes the probability of the next symbol. For a binary symbol-set, the KT-estimator is given by.

Where is the number of 1’s seen so far in the sequence, and is the number of 0’s seen so far.

The Blob-PROST feature set is binary valued, making the use of KT-estimators ideal. Therefore, our factor density for a feature being active is given by.

And the probability for the feature being inactive is.

Where is the number of times feature has been seen, and is the complete sequence of past observations for feature .

The factor density equation is neat and simple, but it requires that we maintain a count for each feature. This is an unnecessary overhead and we can do better. We now propose an update formula for and derive it.

Proposition 1 (Update formula for KT-estimate ).

The factor distribution at timestep for feature can be updated using the following update formula.



From the equation for KT-estimates of we have,


In the next timestep , depending on the value of we have two cases.

  • Case 1: Feature is active, i.e.,
    The KT-estimate can be written as,

    (Since )
  • Case 2:
    The KT-estimate can be written as,

    (Since )

In both cases, from Eq. and we can see that the value decides the existence of an additional term. Therefore by observation we can combine the two cases as follows.

Therefore, from Eq. and we get,

Algorithm sents the algorithm for updating the factor distributions. It uses the update formula presented in Proposition fficiently update the factor distributions. In the implementation we can see that the update is performed in a two part manner with linear time complexity, rather than a naive double-loop search.

Input: Factor Distribution Map fvd_map; Current Timestep
1 function Update   for  to size(fvd_map.keys) do   end for
2for  to  do   end for
3 end function
Algorithm 4 Factor Distribution Update

4.2.3 Exploration Bonus

This is the entry point for our exploration strategy -EB. Due to the modular design of our algorithm, this function mostly acts like a hub that calls other functions sequentially to get the data required to calculate the exploration bonus. Algorithm sents the implementation to calculate exploration bonus. Note that the probabilities are in log space to avoid numerical stability issues.

Input: Exploration Coefficient
1 function ExplorationBonus   LogFeatureVisitDensity Update LogFeatureVisitDensity // Pseudo-count return // Exploration Bonus end function
Algorithm 5 Exploration Bonus

4.2.4 Action Selection

In the early stages of the project, our agent was facing some inexplicable issues. It had really slow learning progress, and was getting stuck with a single action for long periods of time. Fortunately, we had rich logs that helped us in identifying a pattern to the problem.

We observed that during the initial training cycles, the value predictions from LFA had very high variance due to lack of enough samples. In cases when there was an abnormally high Q-value, our greedy optimistic agent always kept taking the same action over and over again in a loop. We initially thought that, decay in the corresponding exploration bonus, coupled with increase in optimistic estimates for other states would lead to the agent breaking out of the loop. Even though eventually the agent got out of the loop, it happened only after an exorbitantly large number of episodes. From the logs we observed that each TD update only effected a small change, and hence the reason why it took a large number of episodes to overcome abnormally high Q-value.

If we were using -greedy as the exploration strategy this would not be a problem. With -greedy, the agent takes more exploratory action in the initial training cycles. Even if LFA produces highly varying Q-values initially, the agent doesn’t get stuck for more that a few cycles. Thus, it can be noted that random exploration at the beginning helps stabilize the action-values predicted by LFA.

Our goal is to replace -greedy with our intrinsically motivated exploration strategy -EB. Unfortunately, the removal of -greedy meant that the agent’s policy is now deterministic and has the above debilitating side-effect. In order to solve this crippling issue we experimented two approaches.

  • Combine -EB with -greedy
    A similar problem was reported by Bellemare et al. (2016). Their solution was to use -greedy, not as an exploration strategy, but as a tool to introduce stochasticity in the agents policy. During the initial training cycles, when there is high variance from the LFA estimates, taking a purely random action allows the agent to get out of the greedy action loop. In this experiment -greedy is implemented in the usual way - with probability take a random action, and a greedy optimistic action otherwise. Algorithm sents the implementation.

  • Combine -EB with Boltzmann distributed action selection
    One motivation for our research is to make sure that the agent does not take purely random actions. The approach from Bellemare et al. (2016) described above introduces purely random actions. We present an alternate approach which introduces stochasticity but in a directed manner.

    We split our optimistic functions into two functions, and . is trained using the extrinsic reward, whereas is trained on the exploration bonus from -EB888When using LFA, training is done on the LFA parameters. Therefore we essentially maintain two sets of parameters, and . The motivation here is that we now have a value-function that directs the exploratory actions of the agent. For action selection we construct the optimistic value function as the summed value function . During action selection, with probability the agent takes the action that is greedy with respect to , otherwise the agent takes a Boltzmann distributed random action. The Boltzmann distribution is constructed from the values using the discrete_distribution standard library. Hence the selected random action is more likely to be an action that has higher exploratory value. Algorithm sents implementation for this approach.

    Theoretically, the only difference between the above two approaches is the action selection process during exploration. The first approach takes a uniformly random action, whereas the second one takes a Boltzmann-distributed random action. Therefore during the implementation of the learning algorithm we implement the second approach, and swap the action selection process with the first for experimentation.

    function NextAction   /* contains -values for some state. */
     /* rand(0,1) generates random number between 0,1 */
    if rand(0,1)  then   /* randInt(1,x) generates a uniformly random integer between 1,x */
    1 randInt(1,) return // random action end if
    2return // Greedy Optimistic action end function
    Algorithm 6 Action Selection: -Greedy
    function NextActionBoltz   /* contains -values for some state. */
     /* rand(0,1) generates random number between 0,1 */
    if rand(0,1)  then   /* boltzDistInt(W,1,x) generates an integer between 1,x that is Boltzmann distributed according to W */
    1 boltzDistInt() return // Boltzmann distributed random action end if
    2return // Greedy Optimistic action end function
    Algorithm 7 Action Selection: Boltzmann Distributed

    4.2.5 Sarsa()+-Eb

    Now that we have all the modules necessary for learning, we present the implementation for our agent in Algorithm algorithm[H] Reinforcement Learning with SARSA and -EB exploration Input: Feature Map ; Training Horizon

    /* Each are an matrix. */  Initialize arbitrary Initial state  // Vector containing -values // Vector containing -values NextActionBoltzwhile  do

      /* Re-estimate -values with updated values. */ Act // Perform action in ALE. ExplorationBonus// Compute Intrinsic reward.    /* Predict next state -values. */    /* Boltzmann distributed action selection. */ NextActionBoltz   /* Alternatively: -greedy action selection. */ // NextAction /* TD update */ end while

    return 88footnotetext: is the number feature observed til timestep . We have removed the details regarding eligibility traces for brevity and clarity. Algorithm sents the final version of the algorithm that we have implemented, and for which empirical results are presented. In the algorithm shown we use the Boltzmann distributed action selection approach. We can disable Line enable the two comments below it to use the -greedy action selection approach. In the next chapter we discuss the experimental evaluation framework we used to perform empirical evaluation. Then we showcase the state-of-the-art results that our algorithms enjoys.

    Chapter 5 Empirical Evaluation

    ’What can be asserted without evidence can also be dismissed without evidence.’

    Christopher Hitchens

    Empirical evidence is one of the fundamental requirements for validating any scientific hypothesis. In order to validate the efficacy of our exploration algorithm, this chapter showcases empirical results that represent a significant improvement over existing algorithms.

    In Section 5.1 we talk about the evaluation platform and the feature set that we used to evaluate our exploration strategy. Section 5.1.1 introduces the Arcade Learning Environment (ALE) as our evaluation platform. We provides justification for choosing ALE as an environment for our agent. Further, in Section 5.1.2 we introduce the Blob-PROST feature set, and the benefits our agent enjoys from using it.

    In Section 5.2 we provide the necessary foundations needed to evaluate Algorithm  discuss the aspects that need to be considered in choosing a particular game for evaluation. Further, we talk about the parameters for empirical evaluation, such are number of trial, training frames, etc.

    Lastly, in Section 5.3 we discuss the results of our various experiments. We compare the two action selection process discussed in Section 4.2.4 and compare their empirical performance. Then we compare the learning performance of our agent with SARSA+-greedy. Finally, we compare the evaluation scores for our agent with other leading algorithms.

    5.1 Evaluation Framework

    5.1.1 Arcade Learning Algorithm (ALE)

    The Arcade Learning Environment (ALE) (Bellemare et al., 2013) is a software framework that interfaces with the Stella emulator (Mott and Team, 1996) for the Atari 2600 games (Montfort and Bogost, 2009). The Atari 2600 platform contains hundreds of games that vary in many aspects of game-playing such as sports, puzzle, action, adventure, arcade, strategy etc. (Figure 5.1). Some of the games are quite challenging for human players (Bellemare et al., 2013). Due to the diverse nature of the games, a learning algorithm that can play the entire gamut of the Atari 2600 games can be considered to be generally competent. The goal of the ALE framework is to provide a platform for AI researchers to test their learning algorithm for general competence, share empirical data with the research community, and further the goal of achieving artificial general intelligence (Bellemare et al., 2013).

    Figure 5.1: Game screens from 55 Atari 2600 games (Defazio and Graepel, 2014).
    Figure 5.2: High level working of ALE for an RL algorithm.

    The ALE contains a game-layer to facilitate reinforcement learning. The game-layer takes in the action from the agent and validates if it is one of the predefined </