# Unsupervised Basis Function Adaptation for Reinforcement Learning

When using reinforcement learning (RL) algorithms to evaluate a policy it is common, given a large state space, to introduce some form of approximation architecture for the value function (VF). The exact form of this architecture can have a significant effect on the accuracy of the VF estimate, however, and determining a suitable approximation architecture can often be a highly complex task. Consequently there is a large amount of interest in the potential for allowing RL algorithms to adaptively generate (i.e. to learn) approximation architectures. We investigate a method of adapting approximation architectures which uses feedback regarding the frequency with which an agent has visited certain states to guide which areas of the state space to approximate with greater detail. We introduce an algorithm based upon this idea which adapts a state aggregation approximation architecture on-line. Assuming S states, we demonstrate theoretically that - provided the following relatively non-restrictive assumptions are satisfied: (a) the number of cells X in the state aggregation architecture is of order √(S)S_2S or greater, (b) the policy and transition function are close to deterministic, and (c) the prior for the transition function is uniformly distributed - our algorithm can guarantee, assuming we use an appropriate scoring function to measure VF error, error which is arbitrarily close to zero as S becomes large. It is able to do this despite having only O(X_2S) space complexity (and negligible time complexity). We conclude by generating a set of empirical results which support the theoretical results.

## Authors

• 1 publication
• 2 publications
• ### Count-Based Exploration in Feature Space for Reinforcement Learning

We introduce a new count-based optimistic exploration algorithm for Rein...
06/25/2017 ∙ by Jarryd Martin, et al. ∙ 0

• ### Exploration in Feature Space for Reinforcement Learning

The infamous exploration-exploitation dilemma is one of the oldest and m...
10/05/2017 ∙ by Suraj Narayanan Sasikumar, et al. ∙ 0

• ### State2vec: Off-Policy Successor Features Approximators

A major challenge in reinforcement learning (RL) is the design of agents...
10/22/2019 ∙ by Sephora Madjiheurem, et al. ∙ 0

• ### Manifold Regularization for Kernelized LSTD

Policy evaluation or value function or Q-function approximation is a key...
10/15/2017 ∙ by Xinyan Yan, et al. ∙ 0

• ### Biased Aggregation, Rollout, and Enhanced Policy Improvement for Reinforcement Learning

We propose a new aggregation framework for approximate dynamic programmi...
10/06/2019 ∙ by Dimitri Bertsekas, et al. ∙ 7

• ### The Concept of Criticality in Reinforcement Learning

Reinforcement learning methods carry a well known bias-variance trade-of...
10/16/2018 ∙ by Yitzhak Spielberg, et al. ∙ 0

• ### Self-supervised Learning of Distance Functions for Goal-Conditioned Reinforcement Learning

Goal-conditioned policies are used in order to break down complex reinfo...
07/05/2019 ∙ by Srinivas Venkattaramanujam, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Traditional reinforcement learning (RL) algorithms such as TD() (Sutton, 1988) or -learning (Watkins and Dayan, 1992) can generate optimal policies when dealing with small state and action spaces. However, when environments are complex (with large or continuous state or action spaces), using such algorithms directly becomes too computationally demanding. As a result it is common to introduce some form of architecture with which to approximate the value function (VF), for example a parametrised set of functions (Sutton and Barto, 1998; Bertsekas and Tsitsiklis, 1996). One issue when introducing VF approximation, however, is that the accuracy of the algorithm’s VF estimate, and as a consequence its performance, is highly dependent upon the exact form of the architecture chosen (it may be, for example, that no element of the chosen set of parametrised functions closely fits the VF). Accordingly, a number of authors have explored the possibility of allowing the approximation architecture to be learned by the agent, rather than pre-set manually by the designer—see Busoniu et al. (2010) for an overview. It is hoped that, by doing this, we can design algorithms which will perform well within a more general class of environment whilst requiring less explicit input from designers.111Introducing the ability to adapt an approximation architecture is in some ways similar to simply adding additional parameters to an approximation architecture. However separating parameters into two sets, those adjusted by the underlying RL algorithm, and those adjusted by the adaptation method, permits us scope to, amongst other things, specify two distinct update rules.

A simple and perhaps, as yet, under-explored method of adapting an approximation architecture involves using an estimate of the frequency with which an agent has visited certain states to determine which states should have their values approximated in greater detail. We might be interested in such methods since, intuitively, we would suspect that areas which are visited more regularly are, for a number of reasons, more “important” in relation to determining a policy. Such a method can be contrasted with the more commonly explored method of explicitly measuring VF error and using this error as feedback to adapt an approximation architecture. We will refer to methods which adapt approximation architectures using visit frequency estimates as being unsupervised in the sense that no direct reference is made to reward or to any estimate of the VF.

Our intention in this article is to provide—in the setting of problems with large or continuous state spaces, where reward and transition functions are unknown, and where our task is to maximise reward—an exploration of unsupervised methods along with a discussion of their potential merits and drawbacks. We will do this primarily by introducing an algorithm, PASA, which represents an attempt to implement an unsupervised method in a manner which is as simple and obvious as possible. The algorithm will form the principle focus of our theoretical and experimental analysis.

It will turn out that unsupervised techniques have a number of advantages which may not be offered by other more commonly used methods of adapting approximation architectures. In particular, we will argue that unsupervised methods have (a) low computational overheads and (b) a tendency to require less sampling in order to converge. We will also argue that the methods can, under suitable conditions, (c) decrease VF error, in some cases significantly, with minimal input from the designer, and, as a consequence, (d) boost performance. The methods will be most effective in environments which satisfy certain conditions, however these conditions are likely to be satisfied by many of the environments we encounter most commonly in practice. The fact that unsupervised methods are cheap and simple, yet still have significant potential to enhance performance, makes them appear a promising, if perhaps somewhat overlooked, means of adapting approximation architectures.

### 1.1 Article overview

Our article is structured as follows. Following some short introductory sections we will offer an informal discussion of the potential merits of unsupervised methods in order to motivate and give a rationale for our exploration (Section 1.5). We will then propose (in Section 2) our new algorithm, PASA, short for “Probabilistic Adaptive State Aggregation”. The algorithm is designed to be used in conjunction with SARSA, and adapts a state aggregation approximation architecture on-line.

Section 3 is devoted to a theoretical analysis of the properties of PASA. Sections 3.1 to 3.3 relate to finite state spaces. We will demonstrate in Section 3.1 that PASA has a time complexity (considered as a function of the state and action space sizes, and ) of the same order as its SARSA counterpart, i.e. . It has space complexity of , where is the number of cells in the state aggregation architecture, compared to for its SARSA counterpart. This means that PASA is computationally cheap: it does not carry significant computational cost beyond SARSA with fixed state aggregation.

In Section 3.2 we investigate PASA in the context of where an agent’s policy is held fixed and prove that the algorithm converges. This implies that, unlike non-linear architectures in general, SARSA combined with PASA will have the same convergence properties as SARSA with a fixed linear approximation architecture (i.e. the VF estimate may, assuming the policy is updated, “chatter”, or fail to converge, but will never diverge).

In Section 3.3 we will use PASA’s convergence properties to obtain a theorem, again where the policy is held fixed, regarding the impact PASA will have on VF error. This theorem guarantees that VF error will be arbitrarily low as measured by routinely used scoring functions provided certain conditions are met, conditions which require primarily that the agent spends a large amount of the time in a small subset of the state space. This result permits us to argue informally that PASA will also, assuming the policy is updated, improve performance given similar conditions.

In Section 3.4 we extend the finite state space concepts to continuous state spaces. We will demonstrate that, assuming we employ an initial arbitrarily complex discrete approximation of the agent’s continuous input, all of our discrete case results have a straightforward continuous state space analogue, such that PASA can be used to reduce VF error (at low computational cost) in a manner substantially equivalent to the discrete case.

In Section 3.5 we outline some examples to help illustrate the types of environments in which our stated conditions are likely to be satisfied. We will see that, even for apparently highly unstructured environments where prior knowledge of the transition function is largely absent, the necessary conditions potentially exist to guarantee that employing PASA will result in low VF error. In a key example, we will show that for environments with large state spaces and where there is no prior knowledge of the transition function, PASA will permit SARSA to generate a VF estimate with error which is arbitrarily low with arbitrarily high probability provided the transition function and policy are sufficiently close to deterministic and the algorithm has cells available in its adaptive state aggregation architecture.

To corroborate our theoretical analysis, and to further address the more complex question of whether PASA will improve overall performance, we outline some experimental results in Section 4. We explore three different types of environment: a GARNET environment,222An environment with a discrete state space where the transition function is deterministic and generated uniformly at random. For more details refer to Sections 3.5 and 4.1. a “Gridworld” type environment, and an environment representative of a logistics problem.

Our experimental results suggest that PASA, and potentially, by extension, techniques based on similar principles, can significantly boost performance when compared to SARSA with fixed state aggregation. The addition of PASA improved performance in all of our experiments, and regularly doubled or even tripled the average reward obtained. Indeed, in some of the environments we tested, PASA was also able to outperform SARSA with no state abstraction, the potential reasons for which we discuss in Section 4.4. This is despite minimal input from the designer with respect to tailoring the algorithm to each distinct environment type.333For each problem, with the exception of the number of cells available to the state aggregation architecture, the PASA parameters were left unchanged Furthermore, in each case the additional processing time and resources required by PASA are measured and shown to be minimal, as predicted.

### 1.2 Related works

The concept of using visit frequencies in an unsupervised manner is not completely new however it remains relatively unexplored compared to methods which seek to measure the error in the VF estimate explicitly and to then use this error as feedback. We are aware of only three papers in the literature which investigate a method similar in concept to the one that we propose, though the algorithms analysed in these three papers differ from PASA in some key respects.

Moreover there has been little by way of theoretical analysis of unsupervised techniques. The results we derive in relation to the PASA algorithm are all original, and we are not aware of any published theoretical analysis which is closely comparable.

In the first of the three papers just mentioned, Menache et al. (2005) provide a brief evaluation of an unsupervised algorithm which uses the frequency with which an agent has visited certain states to fit the centroid and scale parameters of a set of Gaussian basis functions. Their study was limited to an experimental analysis, and to the setting of policy evaluation. The unsupervised algorithm was not the main focus of their paper, but rather was used to provide a comparison with two more complex adaptation algorithms which used information regarding the VF as feedback.444Their paper actually found the unsupervised method performed unfavourably compared to the alternative approaches they proposed. However they tested performance in only one type of environment, a type of environment which we will argue is not well suited to the methods we are discussing here (see Section 3.5).

In the second paper, Nouri and Littman (2009) examined using a regression tree approximation architecture to approximate the VF for continuous multi-dimensional state spaces. Each node in the regression tree represents a unique and disjoint subset of the state space. Once a particular node has been visited a fixed number of times, the subset it represents is split (“once-and-for-all”) along one of its dimensions, thereby creating two new tree nodes. The manner in which the VF estimate is updated555The paper proposes more than one algorithm. We refer here to the fitted -iteration algorithm. is such that incentive is given to the agent to visit areas of the state space which are relatively unexplored. The most important differences between their algorithm and ours are that, in their algorithm, (a) cell-splits are permanent, i.e. once new cells are created, they are never re-merged and (b) a permanent record is kept of each state visited (this helps the algorithm to calculate the number of times newly created cells have already been visited). With reference to (a), the capacity of PASA to re-adapt is, in practice, one of its critical elements (see Section 3). With reference to (b), the fact that PASA does not retain such information has important implications for its space complexity. The paper also provides theoretical results in relation to the optimality of their algorithm. Their guarantees apply in the limit of arbitrarily precise VF representation, and are restricted to model-based settings (where reward and transition functions are known). In these and other aspects their analysis differs significantly from our own.

In the third paper, which is somewhat similar in approach and spirit to the second (and which also considers continuous state spaces), Bernstein and Shimkin (2010) examined an algorithm wherein a set of kernels are progressively split (again “once-and-for-all”) based on the visit frequency for each kernel. Their algorithm also incorporates knowledge of uncertainty in the VF estimate, to encourage exploration. The same two differences to PASA (a) and (b) listed in the paragraph above also apply to this algorithm. Another key difference is that their algorithm maintains a distinct set of kernels for each action, which implies increased algorithm space complexity. The authors provide a theoretical analysis in which they establish a linear relationship between policy-mistake count666Defined, in essence, as the number of time steps in which the algorithm executes a non-optimal policy. and maximum cell size in an approximation of a continuous state space.777See, in particular, their Theorems 4 and 5. The results they provide are akin to other PAC (“probably approximately correct”) analyses undertaken by several authors under a range of varying assumptions—see, for example, Strehl et al. (2009) or, more recently, Jin et al. (2018). Their theoretical analysis differs from ours in many fundamental respects. Unlike our theoretical results in Section 3, they have the advantage that they are not dependent upon characteristics of the environment and pertain to performance, not just VF error. However, similar to Nouri and Littman (2009) above, they carry the significant limitation that there is no guarantee of arbitrarily low policy-mistake count in the absence of an arbitrarily precise approximation architecture, which is equivalent in this context to arbitrarily large computational resources.888Our results, in contrast, provide guarantees relating to maximally reduced VF error under conditions where resources may be limited.

There is a much larger body of work less directly related to this article, but which has central features in common, and is therefore worth mentioning briefly. Two important threads of research can be identified.

Second, given that the PASA algorithm functions by updating a state aggregation architecture, it is worth noting that a number of principally theoretical works exist in relation to state aggregation methods. These works typically address the question of how states in a Markov decision process (MDP) can be aggregated, usually based on “closeness” of the transition and reward function for groups of states, such that the MDP can be solved efficiently. Examples of papers on this topic include

Hutter (2016) and Abel et al. (2017) (the results of the former apply with generality beyond just MDPs). Notwithstanding being focussed on the question of how to create effective state aggregation approximation architectures, these works differ fundamentally from ours in terms of their assumptions and overall objective. Though there are exceptions—see, for example, Ortner (2013)111111This paper explores the possibility of aggregating states based on learned estimates of the transition and reward function, and as such the techniques it explores differ quite significantly from those we are investigating.—the results typically assume knowledge of the MDP (i.e. the environment) whereas our work assumes no such knowledge. Moreover the techniques analysed often use the VF, or a VF estimate, to generate a state aggregation, which is contrary to the unsupervised nature of the approaches we are investigating.

### 1.3 Formal framework

We assume that we have an agent which interacts with an environment over a sequence of iterations .121212The formal framework we assume in this article is a special case of a Markov decision process. For more general MDP definitions see, for example, Chapter 2 of Puterman (2014). We will assume throughout this article (with the exception of Section 3.4) that we have a finite set mS of states of size SSS (Section 3.4 relates to continuous state spaces and contains its own formal definitions where required). We also assume we have a discrete set mA of actions of size AAA. Since and are finite, we can, using arbitrarily assigned indices, label each state () and each action ().

For each the agent will be in a particular state and will take a particular action. Each action is taken according to a policy pi whereby the probability the agent takes action in state is denoted as .

The transition function PPP defines how the agent’s state evolves over time. If the agent is in state and takes an action in iteration , then the probability it will transition to the state in iteration is given by . The transition function must be constrained such that .

Denote as the space of all probability distributions defined on the real line. The reward function RRR is a mapping from each state-action pair to a real-valued random variable , where each

is defined by a cumulative distribution function

, such that if the agent is in state and takes action in iteration , then it will receive a real-valued reward in iteration distributed according to . Some of our key results will require that is bounded above by a single constant for all and , in which case we use Rm to denote the maximum magnitude of the expected value of for all and .

Prior to the point at which an agent begins interacting with an environment, both and are taken as being unknown. However we may assume in general that we are given a prior distribution for both. Our overarching objective is to design an algorithm to adjust during the course of the agent’s interaction with its environment so that total reward is maximised over some interval (for example, in the case of our experiments in Section 4, this will be a finite interval towards the end of each trial).

### 1.4 Scoring functions

Whilst our overarching objective is to maximise performance, an important step towards achieving this objective involves reducing error in an algorithm’s VF estimate. This is based on the assumption that more accurate VF estimates will lead to better directed policy updates, and therefore better performance. A large part of our theoretical analysis in Section 3 will be directed at assessing the extent to which VF error will be reduced under different circumstances.

Error in a VF estimate for a fixed policy is typically measured using a scoring function. It is possible to define many different types of scoring function, and in this section we will describe some of the most commonly used types.131313Sutton and Barto (2018) provide a detailed discussion of different methods of scoring VF estimates. See, in particular, Chapters and . We first need a definition of the VF itself. We formally define the value function Qgammapi for a particular policy , which maps each of the state-action pairs to a real value, as follows:

 Qπγ(si,aj)\coloneqqE(∞∑t=1γt−1R(s(t),a(t))∣∣ ∣∣s(1)=si,a(1)=aj),

where the expectation is taken over the distributions of , and (i.e. for particular instances of and , not over their prior distributions) and where is known as a discount factor. We will sometimes omit the subscript . We have used superscript brackets to indicate dependency on the iteration . Initially the VF is unknown.

Suppose that hatQ is an estimate of the VF. One commonly used scoring function is the squared error in the VF estimate for each state-action, weighted by some arbitrary function www which satisfies for all and . We will refer to this as the mean squared error (MSE):

 \glsMSEgamma\coloneqqS∑i=1A∑j=1w(si,aj)(Qπγ(si,aj)−^Q(si,aj))2. (1)

Note that the true VF , which is unknown, appears in (1). Many approximation architecture adaptation algorithms use a scoring function as a form of feedback to help guide how the approximation architecture should be updated. In such cases it is important that the score is something which can be measured by the algorithm. In that spirit, another commonly used scoring function (which, unlike MSE, is not a function of ) uses Tpi, the Bellman operator, to obtain an approximation of the MSE. This scoring function we denote as . It is a weighted sum of the Bellman error at each state-action:141414Note that this scoring function also depends on a discount factor , inherited from the Bellman error definition. It is effectively analogous to the constant used in the definition of MSE.

 \glsLgamma\coloneqqS∑i=1A∑j=1w(si,aj)(Tπ^Q(si,aj)−^Q(si,aj))2,

where:

The value still relies on an expectation within the squared term, and hence there may still be challenges estimating empirically. A third alternative scoring function , which steps around this problem, can be defined as follows:

 \glstildeLgamma\coloneqqS∑i=1A∑j=1w(si,aj)S∑i′=1P(si′|si,aj)A∑j′=1π(aj′|si′)×∫R(R(si,aj)+γ^Q(si′,aj′)−^Q(si,aj))2dFR(si,aj).

These three different scoring functions are arguably the most commonly used scoring functions, and we will state results in Section 3 in relation to all three. Scoring functions which involve a projection onto the space of possible VF estimates are also commonly used. We will not consider such scoring functions explicitly, however our results below will apply to these error functions, since, for the architectures we consider, scoring functions with and without a projection are equivalent.

We will need to consider some special cases of the weighting function . Towards that end we define what we will term the stable state probability vector , of dimension , as follows:

 ψi\coloneqqlimT→∞1TT∑t=1I{s(t)=si},

where III is the indicator function for a logical statement such that if is true. The value of the vector represents the proportion of the time the agent will spend in a particular state as provided it follows the fixed policy . In the case where a transition matrix obtained from and is irreducible and aperiodic, will be the stationary distribution associated with . None of the results in this paper relating to finite state spaces require that a transition matrix obtained from and be irreducible, however in order to avoid possible ambiguity, we will assume unless otherwise stated that , whenever referred to, is the same for all .

Perhaps the most natural, and also most commonly used, weighting coefficient is , such that each error term is weighted in proportion to how frequently the particular state-action occurs (Menache et al., 2005; Yu and Bertsekas, 2009; Di Castro and Mannor, 2010). A slightly more general set of weightings is made up of those which satisfy , where and for all and . All of our theoretical results will require that , and some will also require that .151515It is worth noting that weighting by and is not necessarily the only valid choice for . It would be possible, for example, to set for all and depending on the purpose for which the scoring function has been defined.

### 1.5 A motivating discussion

The principle we are exploring in this article is that frequently visited states should have their values approximated with greater precision. Why would we employ such a strategy? There is a natural intuition which says that states which the agent is visiting frequently are more important, either because they are intrinsically more prevalent in the environment, or because the agent is behaving in a way that makes them more prevalent, and should therefore be more accurately represented.

However it may be possible to pinpoint reasons related to efficient algorithm design which might make us particularly interested in such approaches. The thinking behind unsupervised approaches from this perspective can be summarised (informally) in the set of points which we now outline. Our arguments are based principally around the objective of minimising VF error (we will focus our arguments on MSE, though similar points could be made in relation to or ). We will note at the end of this section, however, circumstances under which the arguments will potentially translate to benefits where policies are updated as well.

It will be critical to our arguments that the scoring function is weighted by . Accordingly we begin by assuming that, in measuring VF error using MSE, we adopt , where is stored by the algorithm and is not a function of the environment (for example, or for all and ). Now consider:

1. Our goal is to find an architecture which will permit us to generate a VF estimate with low error. We can see, referring to equation (1), that we have a sum of terms of the form:

 ψi~w(si,aj)(Qπ(si,aj)−^Q(si,aj))2.

Suppose represents the value of for which MSE is minimised subject to the constraints of a particular architecture. Assuming we can obtain a VF estimate (e.g. using a standard RL algorithm), each term in (1) will be of the form:

 ψi~w(si,aj)(Qπ(si,aj)−^QMSE(si,aj))2.

In order to reduce MSE we will want to focus on ensuring that our architecture avoids the occurrence of large terms of this form. A term may be large either because is large, because is large, or because has large magnitude. It is likely that any adaptation method we propose will involve directly or indirectly sampling one or more of these quantities in order to generate an estimate which can then be provided as feedback to update the architecture. Since is assumed to be already stored by the algorithm, we focus our attention on the other two factors.

2. Whilst both and influence the size of each term, in a range of important circumstances generating an accurate estimate of will be easier and cheaper than generating an accurate estimate of . We would argue this for three reasons:

1. An estimate of can only be generated with accuracy once an accurate estimate of exists. The latter will typically be generated by the underlying RL algorithm, and may require a substantial amount of training time to generate, particularly if is close to one;161616Whilst the underlying RL algorithm will store an estimate of , having an estimate of is not the same as having an estimate of . If we want to estimate , we should consider it in general as being estimated from scratch. The distinction is explored, for example, from a gradient descent perspective in Baird (1995). See also Chapter 11 in Sutton and Barto (2018).

2. The value may also depend on trajectories followed by the agent consisting of many states and actions (again particularly if is near one), and it may take many sample trajectories and therefore a long training time to obtain a good estimate, even once is known;

3. For each single value there are terms containing distinct values for in the MSE. This suggests that can be more quickly estimated in cases where for more than one index . Furthermore, the space required to store an estimate, if required, is reduced by a factor of .

3. If we accept that it is easier and quicker to estimate than , we need to ask whether measuring the former and not the latter will provide us with sufficient information in order to make helpful adjustments to the approximation architecture. If is roughly the same value for all

, then our approach may not work. However in practice there are many environments which (in some cases subject to the policy) are such that there will be a large amount of variance in the terms of

, with the implication that can provide critical feedback with respect to reducing MSE. This will be illustrated most clearly through examples in Section 3.5.

4. Finally, from a practical, implementation-oriented perspective we note that, for fixed , the value is a function of the approximation architecture. This is not the case for . If we determine our approximation architecture with reference to , we may find it more difficult to ensure our adaptation method converges.171717This is because we are likely to adjust the approximation architecture so that the approximation architecture is capable of more precision for state-action pairs where is large. But, in doing this, we will presumably remove precision from other state-action pairs, resulting in increasing for these pairs, which could then result in us re-adjusting the architecture to give more precision to these pairs. This could create cyclical behaviour. This could force us, for example, to employ a form of gradient descent (thereby, amongst other things,181818Gradient descent using the Bellman error is also known to be slow to converge and may require additional computational resources (Baird, 1995). limiting us to architectures expressible via differential parameters, and forcing architecture changes to occur gradually) or to make “once-and-for-all” changes to the approximation architecture (removing any subsequent capacity for our architecture to adapt, which is critical if we expect, under more general settings, the policy to change with time).191919As we saw in Section 1.2, most methods which use VF feedback explored to date in the literature do indeed employ one of these two approaches.

To summarise, there is the possibility that in many important instances visit probability loses little compared to other metrics when assessing the importance of an area of the VF, and the simplicity of unsupervised methods allows for fast calculation and flexible implementation.

The above points focus on the problem of policy evaluation. All of our arguments will extend, however, to the policy learning setting, provided that our third point above consistently holds as each update is made. Whether this is the case will depend primarily on the type of environment with which the agent is interacting. This will be explored further in Section 3.5 and Section 4.

Having now discussed, albeit informally, some of the potential advantages of unsupervised approaches to adapting approximation architectures, we would now like to implement the ideas in an algorithm. This will let us test the ideas theoretically and empirically in a more precise, rigorous setting.

## 2 The PASA algorithm

Our Probabilistic Adaptive State Aggregation (PASA) algorithm is designed to work in conjunction with SARSA (though certainly there may be potential to use it alongside other, similar, RL algorithms). In effect PASA provides a means of allowing a state aggregation approximation architecture to be adapted on-line. In order to describe in detail how the algorithm functions it will be helpful to initially provide a brief review of SARSA, and of state aggregation approximation architectures.

### 2.1 SARSA with fixed state aggregation

In its tabular form SARSA202020The SARSA algorithm (short for “state-action-reward-state-action”) was first proposed by Rummery and Niranjan (1994). It has a more general formulation SARSA() which incorporates an eligibility trace. Any reference here to SARSA should be interpreted as a reference to SARSA(). stores an array . It performs an update to this array in each iteration as follows:

 ^Q(t+1)(s(t),a(t))=^Q(t)(s(t),a(t))+Δ^Q(t)(s(t),a(t)),

where:212121Note that, in equation (2), is a parameter of the algorithm, distinct from as used in the scoring function definitions. However there exists a correspondence between the two parameters which will be made clearer below.

 Δ^Q(t)(s(t),a(t))=η(R(s(t),a(t))+γ^Q(t)(s(t+1),a(t+1))−^Q(t)(s(t),a(t))) (2)

and where is a fixed step size parameter.222222In the literature, is generally permitted to change over time, i.e. . However throughout this article we assume is a fixed value. In the tabular case, SARSA has some well known and helpful convergence properties (Bertsekas and Tsitsiklis, 1996).

It is possible to use different types of approximation architecture in conjunction with SARSA. Parametrised value function approximation involves generating an approximation of the VF using a parametrised set of functions. The approximate VF is denoted as , and, assuming we are approximating over the state space only and not the action space, this function is parametrised by a matrix theta of dimension (where, by assumption, ). Such an approximation architecture is linear if can be expressed in the form , where is the th column of and is a fixed vector of dimension for each pair . The distinct vectors of dimension given by are called basis functions (where ). It is common to assume that for all , in which case we have only distinct basis functions, and . If we assume that the approximation architecture being adapted is linear then the method of adapting an approximation architecture is known as basis function adaptation. Hence we refer to the adaptation of a linear approximation architecture using an unsupervised approach as unsupervised basis function adaptation.

Suppose that Xi is a partition of , containing elements, where we refer to each element as a cell. Indexing the cells using , where , we will denote as the set of states in the th cell. A state aggregation approximation architecture—see, for example, Singh et al. (1995) and Whiteson et al. (2007)—is a simple linear parametrised approximation architecture which can be defined using any such partition . The parametrised VF approximation is expressed in the following form: .

SARSA can be extended to operate in conjunction with a state aggregation approximation architecture if we update in each iteration as follows:232323This algorithm is a special case of a more general variant of the SARSA algorithm, one which employs stochastic semi-gradient descent and which can be applied to any set of linear basis functions.

 θ(t+1)kj=θ(t)kj+ηI{s(t)∈Xk}I{a(t)=aj}(R(s(t),a(t))+γd(t)−θ(t)kj), (3)

where:

 \glsddd(t)\coloneqqX∑k′=1A∑j′=1I{s(t+1)∈Xk′}I{a(t+1)=aj′}θ(t)k′j′. (4)

We will say that a state aggregation architecture is fixed if (which in general can be a function of ) is the same for all . For convenience we will refer to SARSA with fixed state aggregation as SARSA-F. We will assume (unless we explicitly state that is held fixed) that SARSA updates its policy by adopting the -greedy policy at each iteration .

Given a state aggregation approximation architecture, if is held fixed then the value generated by SARSA can be shown to converge—this can be shown, for example, using much more general results from the theory of stochastic approximation algorithms.242424This is examined more formally in Section 3. Note that the same is true for SARSA when used in conjunction with any linear approximation architecture. Approximation architectures which are non-linear, by way of contrast, cannot be guaranteed to converge even when a policy is held fixed, and may in fact diverge. Often the employment of a non-linear architecture will demand additional measures be taken to ensure stability—see, for example, Mnih et al. (2015). Given that the underlying approximation architecture is linear, unsupervised basis function adaptation methods typically do not require any such additional measures. If, on the other hand, we allow to be updated, then this convergence guarantee begins to erode. In particular, any policy update method based on periodically switching to an -greedy policy will not, in general, converge. However, whilst the values and generated by SARSA with fixed state aggregation may oscillate, they will remain bounded (Gordon, 1996, 2001).

### 2.2 The principles of how PASA works

PASA is an attempt to implement the idea of unsupervised basis function adaptation in a manner which is as simple and obvious as possible without compromising computational efficiency. The underlying idea of the algorithm is to make the VF representation comparatively detailed for frequently visited regions of the state space whilst allowing the representation to be coarser over the remainder of the state space. It will do this by progressively updating . Whilst the partition progressively changes it will always contain a fixed number of cells . We will refer to SARSA combined with PASA as SARSA-P (to distinguish it from SARSA-F described above).

The algorithm is set out in Algorithms 1 and 2. Before describing the precise details of the algorithm, however, we will attempt to describe informally how it works. PASA will update only infrequently. We must choose the value of a parameter , which in practice will be large (for our experiments in Section 4 we choose ). In each iteration such that , PASA will update , otherwise remains fixed.

PASA updates as follows. We must first define a fixed set of base cells, with , which together form a partition of . Suppose we have an estimate of how frequently the agent visits each of these base cells based on its recent behaviour. We can define a new partition by “splitting” the most frequently visited cell into two cells containing a roughly equal number of states (the notion of a cell “split” is described more precisely below). If we now have a similar visit frequency estimate for each of the cells in the newly created partition, we could again split the most frequently visited cell giving us yet another partition . If we repeat this process a total of times, then we will have generated a partition of the state space with cells. Moreover, provided our visit frequency estimates are accurate, those areas of the state space which are visited more frequently will have a more detailed representation of the VF.

For this process to work effectively, PASA needs to have access to an accurate estimate of the visit frequency for each cell for each stage of the splitting process. We could, at a first glance, provide this by storing an estimate of the visit frequency of every individual state. We could then estimate cell visit frequencies by summing the estimates for individual states as required. However is, by assumption, very large, and storing distinct real values is implicitly difficult or impossible. Accordingly, PASA instead stores an estimate of the visit frequency of each base cell, and an estimate of the visit frequency of one of the two cells defined each time a cell is split. This allows PASA to calculate an estimate of the visit frequency of every cell in every stage of the process described in the paragraph above whilst storing only distinct values. It does this by subtracting certain estimates from others (also described in more detail below).

There is a trade-off involved when estimating visit frequencies in such a way. Suppose that for some and the partition is updated and replaced by the partition . The visit frequency estimate for a cell in is only likely to be accurate if the same cell was an element of , or if the cell is a union of cells which were elements of . Cells in which do not fall into one of these categories will need time for an accurate estimate of visit frequency to be obtained. The consequence is that it may take longer for the algorithm to converge (assuming fixed ) than would be the case if an estimate of the visit frequency of every state were available. This will be shown more clearly in Section 3.2. The impact of this trade-off in practice, however, does not appear to be significant.

### 2.3 Some additional terminology relating to state aggregation

In this subsection we will introduce some formal concepts, including the concept of “splitting” a cell, which will allow us, in the next subsection, to formally describe the PASA algorithm.

Our formalism is such that is finite.252525In the case of continuous state spaces we assume that we have a finite set of “atomic cells” which are analogous to the finite set of states discussed here. See section 3.4. This means that, for any problem, we can arbitrarily index each state from to . Suppose we have a partition defined on with XXX0 elements. We will say that the partition is ordered if every cell can be expressed as an interval of the form:

 Xj,0\coloneqq{si:Lj,0≤i≤Uj,0},

where and are integers and . Starting with an ordered partition , we can formalise the notion of splitting one of its cells , via which we can create a new partition . The new partition will be such that:

 X1=X0+1Xj,1={si:Lj,0≤i≤Lj,0+⌊(Uj,0−Lj,0−1)/2⌋}XX0+1,1={si:Lj,0+⌊(Uj,0−Lj,0−1)/2⌋

The effect is that we are splitting the interval associated with as near to the “middle” of the cell as possible. This creates two new intervals, the lower interval replaces the existing cell, and the upper interval becomes a new cell (with index ). The new partition is also an ordered partition. Note that the splitting procedure is only defined for cells with cardinality of two or more. For the remainder of this subsection our discussion will assume that this condition holds every time a particular cell is split. When we apply the procedure in practice we will take measures to ensure this condition is always satisfied.

Starting with any initial ordered partition, we can recursively reapply this splitting procedure as many times as we like. Note that each time a split occurs, we specify the index of the cell we are splitting. This means, given an initial ordered partition (with cells), we can specify a final partition (with cells) by providing a vector rho of integers, or split vector, which is of dimension and is a list of the indices of cells to split. The split vector must be such that, for each , we have the constraint that (so that each element of refers to a valid cell index). Assuming we want a partition composed of cells exactly (i.e. so that ), then must be of dimension .

We require one more definition. For each partition defined above, where , we introduce a collection of subsets of denoted . Each element of is defined as follows:

 ¯Xj,k\coloneqq{{si:si∈Xj,0}if 1≤j≤X0{si:si∈Xj,j−X0}if X0

The effect of the definition is that, for , we simply have for all , whilst for , will contain all of the states which are contained in , which is the first cell created during the overall splitting process which had an index of . Note that is not a partition, with the single exception of which is equal to . The notation just outlined will be important when we set out the manner in which PASA estimates the frequency with which different cells are visited.

### 2.4 Details of the algorithm

We now add the necessary final details to formally define the PASA algorithm. We assume we have a fixed ordered partition containing cells. The manner in which is constructed does not need to be prescribed as part of the PASA algorithm, however we assume for all . In general, therefore, is a parameter of PASA.262626The reason we do not simply take is that taking can help to ensure that the values stored by PASA tend to remain more stable. In practice, it often makes sense to simply take to be the ordered partition consisting of cells which are as close as possible to equal size. See Section 4. PASA stores a split vector of dimension . This vector in combination with defines a partition , which will represent the state aggregation architecture used by the underlying SARSA algorithm. The vector , and correspondingly the partition , will be updated every iterations, where (as noted above) is a fixed parameter. The interval defined by permits PASA to learn visit frequency estimates, which will be used when updating . We assume that each for is initialised so that no attempt will be made to split a cell containing only one state (a singleton cell).

Recall that we used to denote a cell in a state aggregation architecture in Section 2.1. We will use the convention that . We also adopt the notation and .

To assist in updating , the algorithm will store a vector baru of real values of dimension (initialised as a vector of zeroes). We update in each iteration as follows (i.e. using a simple stochastic approximation algorithm):

 ¯u(t+1)j=¯u(t)j+ς(I{s(t)∈¯Xj}−¯u(t)j), (5)

where is a constant step size parameter. In this way, will record the approximate frequency with which each of the sets in have been visited by the agent.272727Hence, when estimating how frequently the agent visits certain sets of states, the PASA algorithm implicitly weights recent visits more heavily using a series of coefficients which decay geometrically. The rate of this decay depends on . We also store an dimensional boolean vector Sigma. This keeps track of whether a particular cell has only one state, as we don’t want the algorithm to try to split singleton cells.

To update the PASA algorithm, whenever , performs a sequence of operations. A temporary copy of is made, which we call uuu. The vector is intended to estimate the approximate frequency with which each of the cells in have been visited by the agent. The elements of will be updated as part of the sequence of operations which we will presently describe. We set the entries of to for at the start of the sequence (the remaining entries can be set to zero). At each stage of the sequence we update as follows:

 ρk={jif (1−Σρk)uρk

where:

 j=argmaxi{ui:i≤X0+k−1,Σi=0}

(if multiple indices satisfy the function, we take the lowest index) and where is a constant designed to ensure that a (typically small) threshold must be exceeded before is adjusted. In this way, in each step in the sequence the non-singleton cell with the highest value (over the range , and subject to the threshold ) will be identified, via the update to , as the next cell to split. In each step of the sequence we also update and :

 uρk=uρk−uX0+kΣj=I{|\glsmXtwoindex|≤1} for 1≤j≤X0+k−1.

The reason we update as shown above is because each time the operation is applied we thereby obtain an estimate of the visit frequency of , which is the freshly updated value of , and an estimate of the visit frequency of the cell , which is (since at step ). This is shown visually in Figure 1.

Once has been generated, we implicitly have a new partition . The PASA algorithm is outlined in Algorithm 1. Note that the algorithm calls a procedure called Split, which is outlined in Algorithm 2. Algorithm 1 operates such that the cell splitting process (to generate ) occurs concurrently with the update to , such that, as each element of is updated, a corresponding temporary partition is constructed. Also note that the algorithm makes reference to objects and . To avoid storing each and for , we instead recursively update and such that and at the th stage of the splitting process. A diagram illustrating the main steps is at Figure 1.