List of Algorithms
LIST OF ALGORITHMSLIST OF ALGORITHMSloa
1.1 Reinforcement Learning
Machine Learning is a field in computer science that allows computers to dynamically generate novel algorithms that otherwise cannot be explicitly programmed. These algorithms, called hypotheses, generalize patterns and regularities from observed realworld data using statistical techniques (Bishop, 2007). Reinforcement Learning
(RL) is a field of machine learning that deals with optimal sequential decision making in an unknown environment with no explicitly labelled training data. The RL framework is one of the fundamental models that best describes how intelligent beings interact with their world to achieve a
goal. An RL algorithm is given agency to interact with its surroundings, and is aptly called an agent. The world with which the agent interacts is called its environment. The unsupervised nature of RL algorithms means that the agent has to develop a policy for acting in an unknown environment by trialanderror (Sutton and Barto, 1998). In every such interaction the agent performs an action on the environment and receives a percept. The percept consists of the current configuration of the environment, called state, and a scalar feedback signal, called reward. The reward signal indicates how good the sequence of actions of the agent was. The goal of an RL agent is based on the concept of the reward hypothesis:Definition 1 (Reward Hypothesis).
(Sutton, 1999) Any notion of a goal or purpose of an intelligent agent can be described as the maximization of expected cumulative reward.
The existence of an extrinsic feedback signal makes RL algorithms also somewhat supervised in nature  thus RL algorithms are in some sense both supervised and unsupervised (Barto and Dietterich, 2004).
As an example, consider an agent playing a car racing game in which the goal is to reach the finish line as soon as possible. To model the goal as a cumulative reward maximization problem, we give the agent a negative reward every time step, thereby incentivizing the agent to reach the finish line as quickly as possible. This example illustrates how an objective can be modelled as the maximization of expected cumulative reward, and the goal of an agent as a sequence of actions that achieves it. The interaction between agent and environment continues until the agent converges to an optimal sequence of actions for each state in the environment. This interaction is called the agentenvironment interaction cycle, as illustrated in Figure 1.1. Each iteration of the interaction is called a timestep, often denoted by the subscript to distinguish states, actions, and percepts between timesteps.
1.2 The Exploration/Exploitation Dilemma
In an online decisionmaking setting such as the reinforcement learning problem, an agent is faced with two choices  explore or exploit. The term exploration
in an active learning system is defined as the process of deliberately taking a nongreedy action with the sole aim of gathering more information about the environment. Exploration plays a fundamental role in reinforcement learning algorithms. It is born out of the notion that an optimal longterm policy might involve shortterm sacrifices. Alternatively,
exploitation is the act of taking the best possible action given the current information about the environment. A central challenge in reinforcement learning is to find the sweet spot between exploration and exploitation, i.e., to figure out when to explore and when to exploit. This problem is known as the explorationexploitation dilemma.At present there are a number of provably efficient exploration methods that are effective in environments with lowdimensional stateaction spaces. Most of the exploration algorithms which enjoy strong theoretical guarantees implement the socalled "Optimism in the Face of Uncertainty" (OFU) principle. This heuristic encourages the agent to be optimistic about the reward it might attain in less explored parts of the environment. The agent seeks out states with higher associated uncertainty, and in doing so reduces its uncertainty in a very efficient way. Many algorithms that implement this heuristic do so by adding an exploration bonus to the agent’s reward signal. This bonus is usually a function of a state visitcount; the agent receives higher exploration bonuses for exploring less frequently visited states (about which it is less certain).
Unfortunately, these algorithms do not scale well to highdimensional environments. In these domains, the agent can only visit a small portion of the state space while it is training. The visitcount for most states is always zero, even after training is finished. Nearly all states will be assigned the same exploration bonus throughout training. This renders the bonus useless as a tool for efficient exploration. All unvisited states appear to the agent as equally uncertain. This problem arises because these countbased OFU algorithms fail to generalise the agent’s uncertainty from one context to another. Even if an unvisited state has very similar features to a frequently visited one, the agent will treat the former as a complete unknown. Consequently even the sophisticated algorithms that are suitable for the highdimensional setting – e.g. those that use deep neural networks for policy evaluation – tend to use simple, inefficient exploration strategies.
Success in the highdimensional setting demands that the agent represent the state space in a way that allows generalisation about uncertainty. This sort of generalisation would allow that the agent’s uncertainty be lower for states with familiar features, and higher for states with novel features, even if those exact states haven’t been visited. What we require, then, is an efficient method for computing a suitable similarity measure for states. That is the key challenge addressed in this thesis.
1.3 Summary of Contributions
This thesis presents a new countbased exploration algorithm that is feasible in environments with large stateaction spaces. It can be combined with any valuebased RL algorithm that uses linear function approximation (LFA). The principal contribution is a new method for computing generalised visitcounts. Following Bellemare et al. (2016)
, we construct a visitdensity model in order to measure the similarity between states. Our approach departs from theirs in that we do not construct our density model over the raw state space. Instead, we exploit the feature map that is used for value function approximation, and construct a density model over the transformed feature space. This model assigns higher probability to state feature vectors that share features with visited states. Generalised visitcounts are then computed from these probabilities; states with frequently observed features are assigned higher counts. These counts serve as a measure of the uncertainty associated with a state. Exploration bonuses are then computed from these counts in order to encourage the agent to visit regions of the statespace with less familiar features.
Our density model can be trivially derived from any feature map used for LFA, regardless of the application domain, and requires little or no additional design. In contrast to existing algorithms, there is no need to perform a special dimensionality reduction of the state space in order to compute our generalised visitcounts. Our method uses the same lowerdimensional feature representation to estimate value and to estimate uncertainty. This makes it simpler to implement and less computationally expensive than some existing proposals. Our evaluation demonstrates that this simple approach achieves near stateoftheart performance on highdimensional RL benchmarks.
2.1 Classical Reinforcement Learning
In Classical RL (CRL), the environment is assumed to be fully observable, ergodic, and every state has the Markov property. The branch of reinforcement learning where these assumptions are lifted is called General Reinforcement Learning (GRL) (Hutter, 2005).
Definition 2 (Markov property).
Future states are only dependent on the current states and action, and are independent of the history of percepts. Formally,
for all , and histories
A Markov Decision Process (MDP) captures the above assumptions about the environment, and so in the CRL context the environment is modelled as an MDP (Puterman, 1994). Thus, the CRL problem now reduces to the problem of finding an optimal policy for an unknown MDP.
Definition 3 (Markov Decision Process).
A Markov Decision Process is a Tuple representative of a fullyobservable environment in which all states are Markov.

is a finite set of states

is a finite set of actions

are the transition probabilities

is the expected value of the reward resulting from the transition

is the discount factor which weights the relative importance of immediate rewards to future rewards.
If the dynamics (transition and reward distributions) of the MDP are known, then we can use dynamic programming methods to directly plan on the MDP to find an optimal policy. In the RL context, in which the system dynamics are unknown, we have to use iterative RL algorithms such as TDlearning (Sutton, 1988) to find a good policy asymptotically.^{1}^{1}1Asymptotic analysis is one of the few theoretical tools we have to analyse RL algorithms in a domainagnostic way.
Definition 4 (Policy).
A policy may be deterministic or stochastic. A deterministic policy is a mapping from the states to actions.
A stochastic policy is a probability distribution over the set of actions given a state.
Value
The most common way to characterize the quality of a given policy is to define a function that computes how valuable it is to follow the policy from a given state (or stateaction pair). This notion of value is expressed in terms of future rewards the agent could expect, if it had chosen to follow the given policy.
Definition 5 (StateValue Function).
The statevalue function, is a mapping from states to . The value of a state under policy is the expected discounted cumulative reward given that the agent starts in state and follows policy thereafter.
Definition 6 (ActionValue Function).
The actionvalue function, is a mapping from stateaction pairs to . The actionvalue of the stateaction pair under policy is the expected discounted cumulative reward given that the agent starts in state , takes action , and follows policy thereafter.
Bellman Equations
Bellman equations form the basis for how to compute, approximate, and learn value functions in the RL setup (Sutton and Barto, 1998). They arise naturally from the structure of an MDP by capturing the recursive relationship between the value of a state and the value of its successor states. The two Bellman equations for the statevalues and actionvalues can be defined as follows.
Definition 7 (Bellman Equation for statevalue function of an MDP).
Definition 8 (Bellman Equation for actionvalue function of an MDP).
We can now use the value function to define a partial ordering over policies. A policy is said to be better than another when the expected return of one policy is greater than or equal to the other for all states. Formally, . From the imposed partial ordering it has been shown that there exists at least one policy, , such that for all policies , although it might not be unique (Bertsekas and Tsitsiklis, 1996). The Bellman Optimality Equations provide a mathematical framework for talking about the optimal policy just by replacing the sum over actions with a operator. Intuitively, this represents a policy that is greedy with respect to the value of its successor states.
Definition 9 (Bellman Optimality Equation for statevalues).
Definition 10 (Bellman Optimality Equation for actionvalues).
For finite MDPs with known environment dynamics, the Bellman Optimality Equations have a unique solution. Unfortunately in the RL setup we deal with an unknown MDP. Thus, almost all of the RL algorithms approximate the Bellman Optimality Equations for an unknown MDP and try to iteratively find an optimal policy asymptotically.
2.1.1 Reinforcement Learning Algorithms
The fundamental difference between an RL problem and a planning problem is the knowledge of the environment dynamics. In a planning problem the model of the environment is already known and the problem boils down to finding an optimal policy in the environment. In an RL problem, the agent is dropped into an unknown environment the dynamics of which is unknown. This makes reinforcement learning a hard problem. This distinction gives rise to two categories of RL algorithms, namely modelbased and modelfree.
The class of algorithms that learns the model of the environment, and then does planning within the learned model are called modelbased RL algorithms. These algorithms learn the transition probabilities () and reward functions () of the MDP by iteratively simulating the environment and updating the simulation to better represent the true environment. This approach to solve unknown MDP’s is computationally intensive, especially in large or continuous problems. Value iteration and policy iteration are two dynamic programming algorithms that have a planningbased approach to the RL problem. On the other hand, modelfree algorithms directly learn the optimal policy using an intermediary quantity (usually the valuefunction).
Generalized Policy Iteration (GPI)
The overarching theme of almost all valuefunction based CRL algorithms is the backandforth between two interacting processes, prediction and control, eventually resulting in convergence. Prediction refers to policyevaluation where the valuefunction is estimated for the current policy. Control on the other hand aims to find a policy that is greedy with respect to the current valuefunction (statevalue or actionvalue).
The ‘Prediction and Control’ process converges when it produces no significant change, that is, the valuefunction is consistent with the current policy and the policy is greedy with respect to the current valuefunction.
Temporal Difference Learning
Temporal Difference learning (TD learning) is a common RL algorithm; it is a modelfree algorithm that combines Monte Carlo methods with the ideas from dynamic programming. TD learning allows the agent to directly learn from its experience of the environment. Following the GPI theme, we need a strategy for prediction and control. In TD prediction we use the sampling of Monte Carlo methods and bootstrapping (updating from an existing estimate) of DP algorithms to estimate the current valuefunction.
Definition 11 (Update formula for statevalue function).
TD(0) is a TD learning algorithm that updates statevalues after each timestep, so the learning process is fast and online. The target for the TD(0) update formula uses the existing estimate of , hence we say the algorithm bootstraps.
As the agent interacts with the environment more, TD learning is able to generate a better estimate of the valuefunctions. In the limit, if each state (or stateaction pair) is visited infinitely often with some additional constraints on the learning rates, convergence to the true valuefunction is guaranteed (Bertsekas and Tsitsiklis, 1996).
In TD control we want to optimize the valuefunction of an unknown environment. There are two classes of policy control methods, namely, onpolicy and offpolicy. Onpolicy control uses the policy derived from the current valuefunction estimate to update the future estimates. Alternatively, offpolicy control uses a policy that is greedy with respect to the current valuefunction to estimate future valuefunctions. (onpolicy) and learning (offpolicy) are two popular TD control algorithms that are known to learn an MDP asymptotically (Sutton and Barto, 1998; Watkins and Dayan, 1992).
The important concept of why we are able to do modelfree TD control lies in the fact that we use (state,action)value functions instead of statevalue functions.
Definition 12 (Greedy policy control).
Policy improvement is done by considering a new policy, , which is greedy with respect to the current valuefunction.
From the above policy improvement equations we can see that in order to be greedy with respect to the statevalue function, we require the model of the MDP. In contrast, if the policy is greedy with respect to the actionvalue function, the model dynamics of the MDP is not needed, and hence, it is modelfree. Thus, optimizing actionvalue functions to learn the optimal policy is at the heart of all modelfree TD control algorithms.
Challenges and Drawbacks
All the Classical Reinforcement Learning algorithms that we discussed above can be categorized as tabular algorithms. That is, the algorithms use a table data structure to associate each state (or stateaction pair) with its current value estimate. As the agent interacts with the environment and gains experience, the table values are updated with better estimates of its value.
The main drawback of such a method is that it scales poorly. When the statespace is very large or continuous, the fundamental requirement that the agent visits each state (or stateaction pair) multiple times (or infinitely often) is not satisfied; states are at most visited once. An agent following a policy derived from these value estimates would do no better than a random policy. Moreover, the table size grows with the number of states, making storage infeasible for problems with large/continuous statespace.
A common approach to solving this problem is to find a way to generalize the valuefunction from the limited experience of the agent (Sutton and Barto, 1998). That is, we want to approximate the valuefunction for an unseen state (or stateaction pair) from the example values it has observed so far. Function approximation is a generalization technique that does exactly this; it takes in observed values of a desired function and attempts to generalize an approximation of the function.
2.1.2 Function Approximation
Function approximation (FA) is an instance of supervised learning
(Sutton and Barto, 1998). It is viewed as a class of techniques used to approximate functions by using example values of the desired function. In the RL context tabular methods become infeasible in large or continuous state spaces. This challenge is mitigated by employing FA techniques to predict the valuefunction at unseen states. However, not all FA methods are applicable to the RL setting. We require a training method which can learn efficiently from online, noni.i.d. data, and also handle nonstationary target functions. The following are some of the function approximators that are used in the RL context.
GradientDescent Methods

Artificial Neural Networks

Linear Combination of Features


State Aggregation

kNearest Neighbors

Soft Aggregation

State aggregation is a method of generalizing function approximation in which states are grouped based on a criterion and then value is estimated as an attribute of the group. When a state is revisited the value corresponding to the state’s group gets updated.
Linear combination of features, also known as Linear Function Approximation (LFA), is essentially a linear mapping from the state space (of dimensionality ) to a feature space of dimension , where often . Each basis function of the feature space is a mapping from the state space to a realvalued number that represents some feature of the statespace.
Definition 13 (LinearApproximate state(action)value function).
The approximate statevalue function of a state under a policy is given by:
Where , is a feature map, and is the parameter vector.
LFA has sound theoretical guarantees and also is very efficient in terms of both data and computation (Sutton and Barto, 1998), making it a good candidate for the implementation of our algorithm.
As mentioned previously, FA can be regarded as a technique to develop a generalization regarding value. In order to have a good capacity to generalize, a function approximator must have relevant data about the statespace. Consider a pathological case in which the agent does not explore at all: as a result the only data available for FA would be concentrated in one region of the state space. This results in the estimation of values of unseen states being highly biased. In order to avoid this problem we have to make sure that the agent visits most regions of the statespace, that is, the agent has to explore the statespace efficiently. The main goal of this thesis is to address the problem of how to explore efficiently in large statespaces.
2.2 Exploration Strategies for Reinforcement Learning
In Section 1.2 we described the exploration/exploitation dilemma, which is a fundamental problem in RL. All exploration strategies attempt to manage the tradeoff between these two often opposed objectives. The simplest and most widelyused exploration strategy is known as greedy. At each timestep the agent chooses a greedy action with probability and with probability the agent chooses a completely random action. To ensure that the policy converges to the optimal policy it has to satisfy the GLIE assumptions (Singh et al., 2000):
Definition 14 (Greedy in the Limit with Infinite Exploration).
A policy is GLIE if it satisfies the following two assumptions.

Each action is taken infinitely often in every state that is visited infinitely often,
Where is the number of times action has been chosen in state upto timestep .

In the limit, the learning policy is greedy with respect to the Qvalue function with probability .
For example, greedy satisfies the GLIE assumptions when is annealed to zero. A common way to do this is by setting .
In small, finite MDPs greedy satisfies the GLIE assumptions, but when the stateaction space is large/continuous the first GLIE assumption is violated and hence the convergence guarantee is lost. greedy is a naïve approach to solving the exploration problem, but we still use it in large MDPs because of its low resource requirements when compared with alternatives (Bellemare et al., 2016). In this thesis we propose a novel exploration strategy that improves upon greedy, and provides stateoftheart results in large problems with low computational overhead.
We now provide an exposition of various explorations strategies, their foundational principles, and an analysis of recent breakthroughs in the field of exploration.
2.2.1 Taxonomy of Exploration Strategies
The explorationexploitation dilemma is still an open problem, but researchers have made significant inroads into understanding the nature of the problem. Sebastian Thrun classified exploration techniques into two families of exploration schemes,
directed and undirected (Thrun, 1992). Undirected exploration strategies do not use any information from the environment to make an informed exploratory action; they predominantly rely on randomness to do exploration. Softmax methods and greedy are examples of undirected exploration techniques. The softmax action is sampled from the Boltzman distributionOn the other hand, directed exploration strategies use the knowledge about the learning process to form an explorationspecific heuristic for action selection. This heuristic directs the agent to take those actions that maximizes the information gain about the environment. The exploration algorithm introduced in this thesis falls into the category of directed exploration algorithms. In order to put it into context, we first present an overview of the existing directed exploration strategies used in the literature.
2.3 The Optimism in the Face of Uncertainty Principle
In the following chapter we present our directed exploration method, which implements the principle of "Optimism in the Face of Uncertainty" (OFU) as a heuristic for exploration. In this section we review existing work on the OFU heuristic. The principle is succinctly captured in Osband and Van Roy (2016):
"When at a state, the agent assigns to each action an optimistically biased while statistically plausible estimate of future value and selects the action with the greatest estimate."
OFU is a heuristic to direct exploratory actions. OFU directs the agent to take actions which have more uncertain value estimates. Instead of greedily taking the action that has the highest estimated value, that agent is encouraged to take actions which have a high probability of being optimal. To see that an apparently suboptimal action may indeed have a high probability of being optimal, let us take an example. Suppose that the agent has taken an action very often from a particular state , and suppose that also currently has the highest valueestimate among the available actions. Now consider an alternative action that has only been tried once from the state , and suppose that the reward received was lower than . Action has higher estimated value, but having tried it many times, the agent’s uncertainty about its value is quite low. In contrast, the uncertainty about the value of the alternative action is very high, since it has been taken so rarely. Thus, while the current estimate may be lower than , there is a good chance that the agent was unlucky when taking the first time, and that the true actionvalue is much higher than both estimates. Thus it may be that has a higher probability of being the optimal action than does , especially if their estimated values are quite close. The OFU heuristic would bias the agent toward taking action instead of the greedy action . An agent following this heuristic will behave as if it is optimistic about action , or more precisely, about its true actionvalue . This optimism drives the agent to explore regions of the environment about which it is more uncertain.
2.3.1 OFU using CountBased Exploration Bonuses
Most of the exploration algorithms that enjoy strong theoretical efficiency guarantees, implement the OFU heuristic. Many do so by augmenting the estimated value of a state(action pair) with an exploration bonus that quantifies the uncertainty in that value estimate. An agent which acts greedily with respect to this augmented value function will be biased to take actions with higher associated uncertainty. Most of these algorithms are tabular and countbased in that they compute their exploration bonuses using a table of state(action) visitcounts. The visitcount serves as an approximate measure of the uncertainty associated with a state(action), because more novel state(action) pairs will have lower visitcounts. State(actions) with lower visit counts are assigned higher exploration bonuses. This drives the agent to behave optimistically and explore less frequently visited regions of the environment, which may yet prove to have higher value than familiar regions. Moreover, even if those regions turn out to yield little reward when explored, the agent will have greatly reduced its uncertainty about those regions. Indeed, the reduction in uncertainty would be much smaller if the agent were to take an action that had already been tried many times. The OFU heuristic is therefore a winwin approach for the agent. OFU algorithms are more efficient than undirected exploration strategies like greedy because the agent avoids actions that yield neither large rewards nor large reductions in uncertainty (Osband et al., 2016).
2.3.2 Tabular Countbased Exploration Algorithms
One of the best known OFU methods is the UCB1 bandit algorithm, which selects an action that maximises an upper confidence bound , where is the estimated mean reward and is the visitcount (Lai and Robbins, 1985). The dependence of the bonus term on the inverse squareroot of the visitcount is justified using Chernoff bounds. In the MDP setting, the tabular OFU algorithm most closely resembling our method is ModelBased Interval Estimation with Exploration Bonuses (MBIEEB) (Strehl and Littman, 2008).^{2}^{2}2To the best of our knowledge, the first work to use exploration bonuses in the MDP setting was the Dyna+ algorithm, in which the bonus is a function of the recency of visits to a state, rather than the visitcount (Sutton, 1990) Empirical estimates and of the transition and reward functions are maintained, and is augmented with a bonus term , where is the stateaction visitcount, and is a theoretically derived constant. The Bellman optimality equation for the augmented actionvalue function is
Here the dependence of the bonus on the inverse squareroot of the visitcount is provably optimal (Kolter and Ng, 2009). This equation can be solved using any MDP solution method.
While tabular OFU algorithms perform well in practice on small MDPs (Strehl and Littman, 2004), their sample complexity becomes prohibitive for larger problems (Bellemare et al., 2016). The sample complexity of an algorithm is a bound on the number of timesteps at which the agent is not taking an optimal action with high probability (Kakade, 2003). Loosely speaking, it measures the amount of experience the agent must have before one can be confident it is basically performing optimally. MBIEEB, for example, has a sample complexity bound of . In the highdimensional setting – where the agent cannot hope to visit every state during training – this bound offers no guarantee that the trained agent will perform well. The prohibitive complexity of these tabular OFU algorithms is due in part to the fact that a table of visitcounts is not useful if the stateaction space is too large. Since the agent will only visit a small fraction of that space, the visitcount for most states will always be zero. These algorithms are therefore unable to usefully compare the novelty of two unvisited states. All unvisited states have the same visitcount, and hence the same exploration bonus. The optimistic agent will treat them all as equally novel and equally appealing.
2.3.3 Generalized Visitcounts for Exploration in Large MDPs
Tabular OFU algorithms fail on highdimensional problems because they do not allow for generalization across the state space regarding uncertainty. Every unvisited state is treated as entirely novel, regardless of any similarity between the unvisited states and the visited states in the history. In order to explore efficiently in large domains, the agent must be able to make use of the fact that some unvisited states share many features with visited states, while others share very few. If an unvisited state has almost exactly the same features as a very frequently visited one, then it should not be considered to be as uncertain as a state with unfamiliar features. An effective OFU method for these problems would not just encourage the agent to visit unvisited states, but rather would drive the agent to visit states with novel or uncommon features. We discuss this issue further in section Section 3.1.1.
Several very recent extensions of countbased exploration methods have achieved this sort of generalisation regarding uncertainty, and have produced impressive results on highdimensional RL benchmarks. These algorithms closely resemble MBIEEB, but they substitute the stateaction visitcount for a generalised visitcount which quantifies the similarity of a state to previously visited states. Bellemare et al. (2016) construct a Context Tree Switching (CTS) density model over the state space such that higher probability is assigned to states that are more similar to visited states (Veness et al., 2012). A state pseudocount is then derived from this density. A subsequent extension of this work replaces the CTS density model with a neural network (Ostrovski et al., 2017). Another recent proposal uses locality sensitive hashing (LSH) to cluster similar states, and the number of visited states in a cluster serves as a generalised visitcount (Tang et al., 2016). As in the MBIEEB algorithm, these counts are used to compute exploration bonuses. These three algorithms outperform random strategies, and are currently the leading exploration methods in large discrete domains where exploration is hard.
Before presenting our optimistic countbased exploration method in the following chapter, we now briefly canvas two alternative frameworks for directed exploration, and discuss their limitations.
2.4 BayesAdaptive RL
In the Bayesian approach to modelbased reinforcement learning, we maintain a posterior distribution over the possible models of the environment given the experience of the agent (Dearden et al., 1998)
. Bayesian inference is used to update the posterior with new information as the agent interacts with the environment, and also to incorporate the agent’s prior distribution over the transition models.
Since the posterior is maintained over all possible models we can now talk about the uncertainty pertaining to what is the best action to take. This uncertainty is modelled as a Markov Decision Process defined over a set of hyperstates. A hyperstate acts as an information state which summarizes the information accumulated so far. This augmented MDP, often referred to as the BayesAdaptive MDP (BAMDP), can be solved with standard RL algorithms (Duff, 2002). In this framework an agent acting greedily in the BAMDP whilst updating the posterior acts optimally (according to its prior belief) in the original MDP. The Bayesoptimal policy for the unknown environment is the optimal policy of the BAMDP, thereby providing an elegant solution to the explorationexploitation tradeoff.
Unfortunately, the cardinality of the hyperstates grows exponentially with the planning horizon thereby rendering exact solution to the BAMDP computationally intractable for large problems (Duff, 2002).
2.5 Intrinsic Motivation
The final directed exploration heuristic that we discuss is born out of the socalled intrinsic motivation framework. There appears to be a growing scientific consensus in developmental psychology that human beings, from infants to adults, develop their understanding of the world using certain cognitive systems such as intuitive theories, socialstructures, spatial systems, etc. (Spelke and Kinzler, 2007; Lake et al., 2016). During curiositydriven, creative, or risktaking activities, rational agents use this understanding to generate intrinsic goals. Accomplishing these intrinsic goals leads to the accumulation of intrinsic rewards, thereby exhibiting an innate desire to explore, manipulate, or probe their environment (Oudeyer, 2007).
Drawing parallels to reinforcement learning, the goal of a traditional RL agent is to maximize its expected cumulative reward. This behaviour is extrinsically motivated since the reward signal is external to an agent. We say that an agent is intrinsically motivated if it has intrinsic goals and rewards. In the context of exploration for RL, the aim of the intrinsic motivation approach is to use intrinsic reward as a heuristic that assigns an exploratory value to the agent’s actions. For example, an agent may receive intrinsic rewards for visiting novel parts of the environment that need further exploration (Thrun, 1992).
Many formulations that quantify the exploratory value of an action has been put forth, and most of them augment the environment’s reward function so as to motivate directed exploration. Schmidhuber (2010) proposed a measure for intrinsic motivation by taking into account the improvement a learning algorithm effected on its predictive world model. This measure tracks the progress of an agent’s ability to better compress the history of states and actions (Steunebrink et al., 2013). Another framework for intrinsically motivated learning is to maximize the mutual information. An intrinsic reward measure called empowerment is formulated by searching for the maximal mutual information (Mohamed and Rezende, 2015). The notion of maximizing information gain was demonstrated in a humanoid robot by the introduction of artificial curiosity (Schmidhuber, 1991) as an intrinsic goal (Frank et al., 2014).
These formulations have some major drawbacks which hinder their suitability as exploration heuristics. Firstly, they fail to provide any strong theoretical guarantees of efficient exploration. Leike (2016) pointed out that since none of these heuristics take into account the reward structure of the problem, they do not distinguish between regions of high and low expected reward. Secondly, these algorithms require that we maintain the environment dynamics of the underlying MDP, which prevents us from easily integrating them with modelfree algorithms. Another major drawback is the computational overhead associated with calculating the heuristic. For problems with large state/action spaces, computing the intrinsic reward becomes intractable for many heuristics (Bellemare et al., 2016). Most problems of interest have extremely large state spaces, and hence the intrinsic motivation heuristic is currently impractical as an exploration strategy in these domains.
3.1 Drawbacks of Existing Exploration Methods for Large MDPs
We introduced countbased exploration strategies for large MDPs in section Section 2.3.3. Even though they are the current stateoftheart exploration algorithms in these domains, we consider that there are some potential drawbacks to their common approach to estimating novelty. The motivation for our algorithm arises from trying to avoid these drawbacks.
3.1.1 Choosing a Novelty Measure
The aforementioned algorithms compute a generalized visitcount. This generalized count is a novelty measure that quantifies the (dis)similarity of a state to those in the history. These algorithms drive the agent towards regions of the state space with high novelty. However, the effectiveness of these novelty measures depends on the way in which they measure the similarity between states. If this similarity measure is not chosen in a principled way, states may deemed similar in ways that are not relevant to the given problem. Let us explore this issue by taking an example.
Example 1 (Confounded novelty).
Alice is a foodie. She wants to explore the myriad restaurants that are open in her city. Suppose that Alice’s novelty measure treats restaurants as similar if they are geographically close. Alice consults her novelty measure to choose a restaurant she has not tried yet, and it returns a Chinese restaurant in a distant suburb that she has not visited before. Alice scratches her head thinking: ‘I have been to a tonne of Chinese restaurants; if only my novelty measure understood that and suggested a different cuisine!’ Unfortunately, her novelty measure considers this restaurant very dissimilar from the Chinese restaurants she has visited, simply because it is geographically distant from them.
The problem here is that Alice’s novelty measure does not know anything about which features matter when evaluating the novelty of a restaurant. Let us now look at an example from the recent exploration literature where this problem can be clearly observed.
Inappropriate Novelty Measures in Practice
The problems that can arise from an unprincipled choice of novelty measure are well illustrated in the experimental evaluation of Stadie et al. (2015)
. Their algorithm uses an autoencoder to encode the statespace into a lower dimensional representation. The encoding is then fed into a model dynamics prediction neural network which estimates the novelty by providing an errorbased bonus. This method, called Model Prediction Exploration Bonuses (MPEB), uses an error based estimator and is different from the visitdensity model of
Bellemare et al. (2016), but they both estimate novelty. To generalize regarding value they use the DQN network, and so we refer to their algorithm as DQN+MPEB.During empirical evaluation of their algorithm an anomaly was detected in the game Q*bert^{1}^{1}1In Q*bert, the goal of the agent is to jump on all the cubes without falling off the edge, or being captured. from the Arcade Learning Environment^{2}^{2}2ALE is a performance evaluation platform consisting of Atari2600 games. It is considered as the standard performance test bed for RL algorithms. We’ll discuss in depth about ALE in Chapter 4. (ALE) benchmarking suit. DQN+MPEB algorithm scored lower than the baseline algorithm, DQN+greedy. They attributed this anomaly to the fact that during each level change of Q*bert, the color of the game changes dramatically, but neither the objective nor the structure of the level changes (Figure 3.3). When their agent reached level 2 (Figure 3.3), it perceived the state to be completely novel because MPEB is sensitive to color. This tricked MPEB into assigning high exploration bonus to all the states even though the actionvalues of the states hadn’t changed. Hence the policy of the agent was impacted adversely.
The pathology of DQN+MPEB in the Q*bert game highlights a serious problem with current novelty estimators they do not take into account the relevancy of a state to the task an agent is trying to accomplish. We argue that a measure of novelty should not just be an arbitrary generalized representation of how many times an agent has visited a state, but should ideally be a measure of dissimilarity in facets that are relevant to the agent’s goal. Two states can be different in many ways; the challenge is to find out a similarity metric which is effective in achieving the agent’s goal optimally. In Example ice’s novelty measure did not know that suggesting a restaurant with a different cuisine would be more relevant to her task, thereby naively suggesting a geographically distant unvisited restaurant.
3.1.2 Separate Generalization Methods for Value and Uncertainty
We contend that this deficiency is not peculiar to MPEB, but rather that it may arise whenever the novelty measure is not designed to be taskrelevant. Indeed, all of the aforementioned algorithms which compute a novelty measure share a common structure which leaves them vulnerable to this problem. Each algorithm has two quite unrelated components: a value estimator (an RL algorithm which performs policy evaluation), and a novelty estimator. Each component involves an entirely separate generalization method. The value estimator makes use of a feature representation of the state space in order to generalize about value. The novelty estimator separately utilizes a different, explorationspecific state space representation to measure the similarity between states. For example, the #Exploration algorithm of Tang et al. (2016) uses the DQN algorithm for value estimation. In order to estimate novelty, however, #Exploration maps the state space into a lowerdimensional representation using locality sensitive hashing. The similarity measure induced by the choice of hash codes is unlikely to resemble that which is induced by the features learnt by DQN. The DQNCTSEB algorithm of Bellemare et al. (2016) has a similar structure: DQN is used to estimate value, but the CTS density model is used to estimate novelty. Again, it is not obvious that there should be much in common between the two similarity measures induced by these different state space representations. One might think that this is natural; after all, each representation is used for a different purpose. However, there are two questions we can ask here. Firstly, is there redundant computation due to performing a dimensionality reduction of the same statespace twice? If so, can we reuse the same state space representation for both value and novelty estimation? We address these questions in the following section.
Before moving on we should note that the concerns we express in this section have already been raised in the literature. In their empirical evaluation Bellemare et al. (2016) observed that their value estimator (DQN) was learning at a much slower rate than their CTS density model (their novelty measure). The authors attribute this mismatch to the incompatibility between novelty and value estimators. They further go on to suggest that designing density models to be compatible with value function would be beneficial and a promising research direction.
The drawbacks we presented in this section suggest that there may be much room for improvement in the design of novelty estimators for exploration. In the following sections we describe our technique for estimating novelty by factoring in the insights we gained from analyzing these drawbacks. We first provide a solid footing for some of the assumptions that we made while designing the algorithm. We then go on to present our core exploration algorithm, and then combine it with a modelfree RL algorithm (SARSA()). In the coming chapters we present empirical evidence that our RL algorithm achieves worldleading results on the ALE benchmarking suite.
3.2 Estimating Novelty in Feature Space
3.2.1 Motivation
Which representation of the state space is appropriate for novelty estimation? Intuitively, if we use some parameters to determine the value of a state, then naturally, two such objects are considered dissimilar only if they differ in these parameters. Analogously, if the agent is using certain features to determine the value of a state, then naturally, two such states should be considered dissimilar only if they differ in those valuerelevant features. This motivates us to construct a similarity measure that exploits the feature representation that is used for value function approximation. These features are explicitly designed to be relevant for estimating value. If they were not, they would not permit a good approximation to the true value function. This sets our method apart from the approaches described in Section 2.3.3, which measure novelty with respect to a separate, explorationspecific representation of the state space, one that bears no relation to the value function or the reward structure of the MDP. We argue that measuring novelty in feature space is a simpler and more principled approach, and hypothesise that more efficient exploration will result. Our proposal ensures that generalization regarding novelty is done in the same space as generalization regarding value. Figure 3.4 illustrates the basic structure of our proposed novelty estimator.
Let us make the idea more concrete with our running example.
Example 2 (Valuerelevant exploration).
After Alice’s disappointing restaurant visit last time, she tweaked her novelty estimator such that it now generalizes based on valuerelevant features like the type of cuisine, the star rating, and the other features that truly determine the quality of Alice’s dining experience. When Alice is ready to try something new, she can rest assured that it’s going to be something novel in a way that is meaningful.
3.2.2 Design Decisions
Our exploration strategy, henceforth known as exploration bonus (EB), can be thought of as exploration in the feature space. This makes the existence of a feature map crucial to our strategy. Therefore we require that our algorithm be compatible with Linear Function Approximation (LFA). Before the advent of neural networks and subsequently DQN, large RL problems used linear function approximation to estimate the value of a state. Our decision to use LFA as our value prediction module has the following desirable benefits:

Domain Independence: The visitdensity models that we have seen so far (MPEB, CTSEB, PixelCNN, etc.) are designed to work with RGB pixel values from a video input. Though there are many domains that use video input to train the agent, there are equally many other domains that have nothing to do with a video input. For example, reinforcement learning is used in the financial sector to optimize portfolios, asset allocations, and trading systems (Moody and Saffell, 2001). Therefore developing a visitdensity model that is domain independent is a key challenge. Our EB method estimates the novelty using the same features that LFA uses to approximate the value function. This allows our exploration strategy to be compatible with any valuebased RL algorithm that uses LFA.

Indirect dependence on LFA: LFA is essentially a linear combination of features. The only requirement EB has is the existence of a feature map, which is implicitly satisfied with LFA. Because of this indirect dependence on LFA, we hypothesis that it is possible to extend EB to be compatible with valuenetworks that perform representation learning as well (e.g., DQN). Due to resource and time constraints we do not pursue empirical evidence for this claim, but rather leave this as a possible future extension of our research.

Single point of change: The best way to assess the performance impact of changes to a system is to confine the change to a single module and then run performance tests. Following this principle, we know that SARSA() is a valuebased RL algorithm which uses LFA for value prediction and greedy for exploration (Sutton and Barto, 1998). SARSA() has been studied, perfected and validated throughout the ages. Therefore showcasing the performance gains achieved by replacing greedy with our EB exploration strategy allows for a sound empirical proof for the efficacy of our algorithm.
One drawback of using LFA for value prediction is that it requires a set of handcrafted features. This is easily mitigated by choosing the Arcade Learning Environment (ALE) as our evaluation platform (Bellemare et al., 2013), combined with the BlobPROST feature set (Liang et al., 2015). Using BlobPROST as our feature set has and added advantage. BlobPROST is designed to mimic the features learned by DQN, thus making our algorithm comparable with those using DQN for representation learning and value prediction. We’ll discuss in depth about the ALE and BlobPROST in Chapter 4.
3.3 The EB Algorithm
The main original contribution of this work is a method for estimating novelty in feature space. The challenge is to do so without explicitly computing the distance between each new feature vector and all the feature vectors in the history. That approach quickly becomes infeasible because the cost of computing all these distances grows with the size of the history. Our method instead constructs a density model over feature space that assigns higher probability to states that share more features with more frequently observed states. In order to formally describe our method we first introduce some notation.
Notation

, The feature map used in LFA. Maps the state space into an dimensional feature space, .

, Feature vector observed at time , whose component is denoted by

, Sequence of feature vectors observed after timesteps.

, Sequence where is followed by .

, Set of all finite sequences of feature vectors.

, The sequential density model (SDM) that maps a finite sequence of feature vectors to a probability distribution.
We will now present the key component of our algorithm that allows us to estimate novelty in feature space.
3.3.1 Feature VisitDensity
Definition 15 (Feature visitdensity).
The feature visitdensity at time is a probability distribution over the feature space , representing the probability of observing the feature vector after observing the sequence . It is modelled as a product of independent factor distributions over individual features
This density model induces a similarity measure on the feature space. Loosely speaking, feature vectors that share component features are deemed similar. This enables us to use as a novelty measure for states, because it represents the frequency with which features are observed in the history. When confronted with a new state, we are able to estimate how frequently its component features have occurred in the history. If has more novel component features, will be lower. By using a density model we are therefore able to measure novelty in a way that usefully generalizes the agent’s uncertainty across the state space. To illustrate this, let us consider an example.
Example 3.
Suppose we use a 3D binary feature map and that after 3 timesteps the history of observed feature vectors is . Let us estimate the feature visit densities of two unobserved feature vectors , and . Using the KT estimator for the factor models, we have , and . Note that because the component features of are more similar to those in the history. As desired, our novelty measure generalizes across the state space.
Each factor distribution is modelled using a countbased estimator. A naive option would be to use the empirical estimator which is the ratio of the number of times a feature has occurred to the total number of time steps. Another class of countbased estimators are the Dirichlet estimators which enjoy strong theoretical guarantees (Hutter, 2013). We use the KrichevskyTrofimov(KT) estimator which is a Dirichletlike estimator that is simple, easy to implement, scalable, and data efficient (Krichevsky and Trofimov, 1981). If is the number of times the feature has been observed, then the KT estimator is given by:
Using independent factor distributions for modelling the probability of each feature component inherently assumes that the features are independently distributed. This is not always the case, especially in videoinput based domains such as the ALE we have many features that are strongly correlated. This doesn’t mean that we cannot use fully factorized distributions. One of the early assumptions made by Bellemare et al. (2016) about the density model is that the states are independently distributed. This allowed them to factorize the states, and model each factor using a positiondependent CTS^{3}^{3}3
A Bayesian variableorder Markov model.
density model. Moreover, our empirical evaluations show that we achieve world leading results in hard exploration games suggesting that independent factored distributions produce good novelty measures. Thus by precedence and by empirical data the independence assumption on the features is a welljustified tradeoff that makes the computation of novelty fast and data efficient.3.3.2 The pseudocount
Here we adopt a recently proposed method for computing generalised visitcounts from density models (Bellemare et al., 2016). By analogy with the pseudocounts presented in that work, we derive two pseudocounts from our feature visitdensity. Both variants presented generalize the same quantity, the state visitation count function . The expression given in the following definition is derived in Bellemare et al. (2016). We emphasize that our approach constitutes a departure from theirs, because while they derive pseudocounts from a state visitdensity model, we do so using a feature visitdensity model.
Definition 16 (pseudocount).
Let ^{4}^{4}4Also called the recoding probability. be the probability that the feature visitdensity model would assign if it was observed one more time. Then the pseudocount for a state is given by:
3.3.3 The Exploration Bonus algorithm Eb
Equipped with all the tools necessary for the construction of an exploration bonus we now proceed to define the EB algorithm. We provide a high level flowchart for the construction of the bonus in Figure 3.5, and the corresponding pseudocode in Algorithm ving defined the pseudocount (a generalised visitcount), we follow traditional countbased exploration algorithms by computing an exploration bonus that depends on this count. The functional form of the bonus is the same as in MBIEEB; we merely replace the empirical statevisit count with our pseudocount.
Definition 17 (exploration bonus).
The exploration bonus for a stateaction pair at time is
where is a hyperparameter that controls the agents level of optimism.
Loosely speaking, the hyperparameter can be viewed as a knob that tunes the agent’s confidence in its estimate of the true actionvalue function. Higher values of makes the agent underconfident about value, and result in too much exploration. Very low values do not encourage enough exploration because the exploration bonus is too small to dissuade the agent from acting greedily with respect to its current value estimates. In both scenarios the final policy of the agent is affected adversely. The goal is to find a value that gives good results across domains. We performed a coarse parameter sweep among the games in the ALE evaluation platform and concluded that was the best value. Further details regarding the selection of value is discussed in Section 5.2.1.
3.3.4 LFA with Eb
One the advantages that we have in developing our algorithm for use with LFA is that our exploration strategy is compatible with all valuebased RL algorithms that use LFA. As we will see, our empirical performance across a range of environments suggests that one can plug our exploration strategy with little to no modification into any of these algorithms and expect considerable gains in exploration efficiency. In our empirical evaluation we use SARSA() with replacing traces as our valuebased reinforcement learning algorithm. Algorithm sents the pseudocode for a generic RL algorithm that uses the augmented reward for updating the function parameters of the approximate actionvalue function .
3.3.5 Complexity Analysis
Time Complexity
From Algorithm s trivial to see that a call to ExplorationBonus has a worstcase time complexity of , where is the dimension of the feature space. This suggests that the time needed to compute the novelty of a state is independent of the dimension of the statespace. Also, more often than not, the dimension of the feature space is far smaller than that of the state space. Therefore, our algorithms generates significant savings in computation over other density models whose timecomplexity scales with the number states. In practice, for a binary feature set like BlobPROST we process only those features that have been observed before. This is achieved by maintaining a single prototypical factor density estimator for all previously unseen features. We’ll discuss the implementation specific details in depth in Chapter 4.
Space Complexity
We look at Algorithm nalyze what objects are needed to be persisted across iterations so as to facilitate calculation of the exploration bonus. Clearly the factor density estimators , and the feature visit count are needed to evaluate and update the feature visit density. Therefore it can be seen that our algorithm has a worst case space complexity of . Again, because the features in BlobPROST are binary valued, the KT estimator can be defined recursively. This allows for updating the factor density online without the need to maintain a feature visit count . We’ll discuss more on this in Chapter 4.
3.4 Summary
In this chapter we have presented the main contribution of our research. Motivated by the drawbacks of current stateoftheart exploration algorithms, we introduced our novel exploration algorithm called EB. Later, we provided an exposition on the various components of the algorithm and also analysed its time and space complexity.
Now that we have presented our algorithm, we move on to implementation aspects. The next chapter provides a detailed overview of the evaluation testbed, the software architecture, and the implementation challenges faced during the Research & Development of the algorithm.
4.1 Software Architecture
Our implementation goal is to develop an RL software agent that uses EB as its exploration strategy. We present the highlevel design of the algorithm in Figure 4.1. The presented diagram is analogous to the AgentEnvironment interaction cycle (Figure 1.1), but with more granularity. From an explorationcentric standpoint, we first provide a concise overview of the components presented in the architecture, and then an exposition on the implementation details for EB.
4.1.1 Modular Overview
Control
We use SARSA with replacing traces (Sutton and Barto, 1998) as our learning algorithm. This decision was driven primarily by two factors. First, using the BlobPROST^{2}^{2}2Discussed in Section 5.1.2. feature set meant that we are locked into the framework provided by Liang et al. (2015). In our case this is in fact desirable. Replacing the exploration module of an open source, peerreviewed and published implementation with our own exploration module enhances the credibility of any performance gains that result. Second, we need a learning algorithm that works well with Linear Function Approximation (LFA). When coupled with LFA, SARSA has better convergence guarantees than Qlearning (Melo et al., 2008). Hence, SARSA is a suitable value estimation algorithm for our agent.
Exploration Strategy
This component is our EB exploration strategy that was proposed in Chapter 3. We implement it using the C++11^{3}^{3}3C++11 is a major revision of C++. This particular version was chosen because it makes several useful additions to core language libraries. programming language. C++ offers a significant edge over other languages in terms of efficiency and greater control over memory management. Due to the high dimensional nature of our problem, we need to extract as much performance as possible from our code. Therefore implementing the exploration strategy in C++ is critical to the empirical success of our algorithm. Moreover, a lockin with the framework provided by Liang et al. (2015) meant that there was no compelling reason to choose a different programming language. In the coming sections we provide a detailed look at the design and implementation of EB.
Prediction
We use LFA to generalize the actionvalue function for unknown stateaction pairs. For further discussion on LFA we refer the reader to Section 2.1.2. LFA uses the BlobPROST feature set from Liang et al. (2015) to approximate the actionvalue function.
Feature Map
We consider the feature map to be an integral part of the agent. The ability to discern different features of a state is imperative to generalization regarding value, and by extension, to exploration. Our agent explores in the feature space, and we want to use LFA for value prediction. This necessitated the need for an efficient and effective feature set. The BlobPROST feature set from Liang et al. (2015) is the best feature set available to date for the Arcade Learning Environment (ALE) evaluation platform. More details on BlobPROST in Section 5.1.2.
Environment
We chose the Arcade Learning Environment as our evaluation platform for the following reasons.

ALE contains many games which vary in degree of exploration hardness. This allows us to test the efficacy of our algorithm on a broad spectrum of games (Bellemare et al., 2016).

ALE is widely accepted as the standard for testing RL algorithms. The vast majority of exploration specific research that is published post Bellemare et al. (2013) has adopted the ALE platform to report empirical results (Mnih et al., 2013, 2015; Stadie et al., 2015; Osband et al., 2016; Bellemare et al., 2016; Tang et al., 2016; Ostrovski et al., 2017). Therefore, in order to compare and contrast our results with existing research, it is crucial that we choose ALE as our evaluation platform.
More details on the Arcade Learning Environment in Section 5.1.1.
4.1.2 AgentEnvironment Workflow
We want to seamlessly integrate EB into the control module. Therefore understanding the nuances of what happens in an agents cycle from an implementation perspective is critical. Figure 4.1 also doubles as a workflow diagram for our agent SARSA+EB. Following the usual agentenvironment interaction process at timestep , the agent performs an action on the environment and receives an extrinsic reward . The agent also observes the new state of the environment, . Inside the agent, the BlobPROST feature map consumes the current state and returns a feature vector . The feature vector is then used by the LFA module to do value prediction. The EB module uses the stored feature to generate the exploration bonus ^{4}^{4}4Here we can see that generalization regarding value and novelty are being done in the same space.. SARSA updates the parameters of LFA with the TD update and chooses the next action optimistically.
Eb
In the exploration strategy module the feature visitdensity is a hash map that persists in memory across cycles and episodes. Each entry of is a keyvalue pair mapping individual features to its corresponding factor distribution . In the EB module shown in Figure 4.1, the flow of control is as follows: compute as product of factors, update with the observation , then compute again. Now calculate the pseudocount and subsequently the exploration bonus . The bonus is considered as an intrinsic reward and is sent to the control module, SARSA.
Lfa
The prediction module (LFA) approximates the next state actionvalue function using the parameter vector as . LFA sends the next state value to the control module, SARSA. The parameter vector is also an object that is saved in memory and persisted across cycles and episodes.
Sarsa^{5}^{5}5For brevity we have left out the discussion on eligibility traces.
All the results from the various modules flow into the control module SARSA. The control module essentially has two tasks.

Choose the next action
The next state is chosen by being greedy with respect to next state actionvalue obtained from LFA. 
Update of LFA
First we augment the extrinsic reward with the intrinsic reward obtained from EB module.The augmented reward , the next state actionvalue , and the current state actionvalue , both from LFA, is used to calculate the TD error.
Where is the discount factor. Next we update , and is updated using the usual TD update formula.
Where is the learning rate.
Now that we have a clear idea about the surrounding infrastructure, let’s move on to the implementation details of EB
4.2 Implementation Details
4.2.1 Feature VisitDensity
The central data structure that stores the factor distribution of each individual feature is an unordered_map^{6}^{6}6Essentially a hash map called fvd_map. Each entry is a keyvalue pair mapping individual features to its corresponding factor distribution. This allows us to have constant time lookup for the factor distribution of any feature. At first glance of the theoretical formulation of feature visitdensity (Section 3.3.1, Algorithm he implementation looks straight forward. Unfortunately that is not the case. We need to take into account certain implementation specific aspects that are often subsumed by mathematical formulation. Following are some of the important implementation details that need to be considered for computing feature visitdensity.

Sparse Feature Vector
In practice the feature vector is the list of features that are active in the current timestep. Most of the time the set of observed features is in a vastly smaller subspace of the feature space . Therefore, iterating till to compute the product of the factor distributions is quite wasteful. In order to overcome this we maintain a prototype^{7}^{7}7In this context, a prototype function creates an object of a specified type. Here, a KT estimator which has seen zeros. function that computes the KTestimate of observing the feature give that it has never been observed in timesteps. Now whenever a new feature is observed it is added to fvd_map with the current value of the prototype. If is the total number of features observed till timestep , then we can compute the feature visitdensity in time. 
Numerical Stability
Experience has taught us that when dealing with probabilities, innocent looking formulas such as ours can be deceiving. Since we are taking product of probabilities, they are bound to numerically underflow. In our implementation, rather than computing we compute . This allows us to safely perform probability calculations without the worry of underflow. 
Inactive Features
During the evaluation of the feature visitdensity we need to consider the factor distributions for the features that are inactive but previously observed. Since we have already observed , we can identify the features in fvd_map that are not active. The probability density stored in fvd_map against some feature , is the probability of being active. Assuming , the probability of not being active is given by fvd_map. Therefore when evaluating feature visitdensity for we should also factor in the probability of inactive features not occurring.
Algorithm sents the implementation for computing the feature visitdensity with all the above mentioned optimization/requirements. One key observation is that we return the logprobability. This is done to facilitate further log based probability computation that occur in other modules.
4.2.2 Updating Factor Densities
Recall that we use the KrichevskyTrofimov (KT) estimator to compute the factor densities. Given a sequence of symbols, the KTestimator computes the probability of the next symbol. For a binary symbolset, the KTestimator is given by.
Where is the number of 1’s seen so far in the sequence, and is the number of 0’s seen so far.
The BlobPROST feature set is binary valued, making the use of KTestimators ideal. Therefore, our factor density for a feature being active is given by.
And the probability for the feature being inactive is.
Where is the number of times feature has been seen, and is the complete sequence of past observations for feature .
The factor density equation is neat and simple, but it requires that we maintain a count for each feature. This is an unnecessary overhead and we can do better. We now propose an update formula for and derive it.
Proposition 1 (Update formula for KTestimate ).
The factor distribution at timestep for feature can be updated using the following update formula.
Where
Proof.
From the equation for KTestimates of we have,
(1) 
In the next timestep , depending on the value of we have two cases.

Case 1: Feature is active, i.e.,
The KTestimate can be written as,(Since ) (2) 
Case 2:
The KTestimate can be written as,(Since ) (3)
In both cases, from Eq. and we can see that the value decides the existence of an additional term. Therefore by observation we can combine the two cases as follows.
Therefore, from Eq. and we get,
∎
Algorithm sents the algorithm for updating the factor distributions. It uses the update formula presented in Proposition fficiently update the factor distributions. In the implementation we can see that the update is performed in a two part manner with linear time complexity, rather than a naive doubleloop search.
4.2.3 Exploration Bonus
This is the entry point for our exploration strategy EB. Due to the modular design of our algorithm, this function mostly acts like a hub that calls other functions sequentially to get the data required to calculate the exploration bonus. Algorithm sents the implementation to calculate exploration bonus. Note that the probabilities are in log space to avoid numerical stability issues.
4.2.4 Action Selection
In the early stages of the project, our agent was facing some inexplicable issues. It had really slow learning progress, and was getting stuck with a single action for long periods of time. Fortunately, we had rich logs that helped us in identifying a pattern to the problem.
We observed that during the initial training cycles, the value predictions from LFA had very high variance due to lack of enough samples. In cases when there was an abnormally high Qvalue, our greedy optimistic agent always kept taking the same action over and over again in a loop. We initially thought that, decay in the corresponding exploration bonus, coupled with increase in optimistic estimates for other states would lead to the agent breaking out of the loop. Even though eventually the agent got out of the loop, it happened only after an exorbitantly large number of episodes. From the logs we observed that each TD update only effected a small change, and hence the reason why it took a large number of episodes to overcome abnormally high Qvalue.
If we were using greedy as the exploration strategy this would not be a problem. With greedy, the agent takes more exploratory action in the initial training cycles. Even if LFA produces highly varying Qvalues initially, the agent doesn’t get stuck for more that a few cycles. Thus, it can be noted that random exploration at the beginning helps stabilize the actionvalues predicted by LFA.
Our goal is to replace greedy with our intrinsically motivated exploration strategy EB. Unfortunately, the removal of greedy meant that the agent’s policy is now deterministic and has the above debilitating sideeffect. In order to solve this crippling issue we experimented two approaches.

Combine EB with greedy
A similar problem was reported by Bellemare et al. (2016). Their solution was to use greedy, not as an exploration strategy, but as a tool to introduce stochasticity in the agents policy. During the initial training cycles, when there is high variance from the LFA estimates, taking a purely random action allows the agent to get out of the greedy action loop. In this experiment greedy is implemented in the usual way  with probability take a random action, and a greedy optimistic action otherwise. Algorithm sents the implementation. 
Combine EB with Boltzmann distributed action selection
One motivation for our research is to make sure that the agent does not take purely random actions. The approach from Bellemare et al. (2016) described above introduces purely random actions. We present an alternate approach which introduces stochasticity but in a directed manner.We split our optimistic functions into two functions, and . is trained using the extrinsic reward, whereas is trained on the exploration bonus from EB^{8}^{8}8When using LFA, training is done on the LFA parameters. Therefore we essentially maintain two sets of parameters, and . The motivation here is that we now have a valuefunction that directs the exploratory actions of the agent. For action selection we construct the optimistic value function as the summed value function . During action selection, with probability the agent takes the action that is greedy with respect to , otherwise the agent takes a Boltzmann distributed random action. The Boltzmann distribution is constructed from the values using the discrete_distribution standard library. Hence the selected random action is more likely to be an action that has higher exploratory value. Algorithm sents implementation for this approach.
Theoretically, the only difference between the above two approaches is the action selection process during exploration. The first approach takes a uniformly random action, whereas the second one takes a Boltzmanndistributed random action. Therefore during the implementation of the learning algorithm we implement the second approach, and swap the action selection process with the first for experimentation.
4.2.5 Sarsa()+Eb
Now that we have all the modules necessary for learning, we present the implementation for our agent in Algorithm algorithm[H] Input: Feature Map ; Training Horizon
/* Each are an matrix. */ Initialize arbitrary Initial state // Vector containing values // Vector containing values NextActionBoltz while do
/* Reestimate values with updated values. */ Act // Perform action in ALE. ExplorationBonus// Compute Intrinsic reward. /* Predict next state values. */ /* Boltzmann distributed action selection. */ NextActionBoltz /* Alternatively: greedy action selection. */ // NextAction /* TD update */ end while
return ^{8}^{8}footnotetext: is the number feature observed til timestep . We have removed the details regarding eligibility traces for brevity and clarity. Algorithm sents the final version of the algorithm that we have implemented, and for which empirical results are presented. In the algorithm shown we use the Boltzmann distributed action selection approach. We can disable Line enable the two comments below it to use the greedy action selection approach. In the next chapter we discuss the experimental evaluation framework we used to perform empirical evaluation. Then we showcase the stateoftheart results that our algorithms enjoys.Chapter 5 Empirical Evaluation
’What can be asserted without evidence can also be dismissed without evidence.’
Christopher Hitchens
Empirical evidence is one of the fundamental requirements for validating any scientific hypothesis. In order to validate the efficacy of our exploration algorithm, this chapter showcases empirical results that represent a significant improvement over existing algorithms.
In Section 5.1 we talk about the evaluation platform and the feature set that we used to evaluate our exploration strategy. Section 5.1.1 introduces the Arcade Learning Environment (ALE) as our evaluation platform. We provides justification for choosing ALE as an environment for our agent. Further, in Section 5.1.2 we introduce the BlobPROST feature set, and the benefits our agent enjoys from using it.
In Section 5.2 we provide the necessary foundations needed to evaluate Algorithm discuss the aspects that need to be considered in choosing a particular game for evaluation. Further, we talk about the parameters for empirical evaluation, such are number of trial, training frames, etc.
Lastly, in Section 5.3 we discuss the results of our various experiments. We compare the two action selection process discussed in Section 4.2.4 and compare their empirical performance. Then we compare the learning performance of our agent with SARSA+greedy. Finally, we compare the evaluation scores for our agent with other leading algorithms.
5.1 Evaluation Framework
5.1.1 Arcade Learning Algorithm (ALE)
The Arcade Learning Environment (ALE) (Bellemare et al., 2013) is a software framework that interfaces with the Stella emulator (Mott and Team, 1996) for the Atari 2600 games (Montfort and Bogost, 2009). The Atari 2600 platform contains hundreds of games that vary in many aspects of gameplaying such as sports, puzzle, action, adventure, arcade, strategy etc. (Figure 5.1). Some of the games are quite challenging for human players (Bellemare et al., 2013). Due to the diverse nature of the games, a learning algorithm that can play the entire gamut of the Atari 2600 games can be considered to be generally competent. The goal of the ALE framework is to provide a platform for AI researchers to test their learning algorithm for general competence, share empirical data with the research community, and further the goal of achieving artificial general intelligence (Bellema
Comments
There are no comments yet.