Successor Features Support Model-based and Model-free Reinforcement Learning

01/31/2019 ∙ by Lucas Lehnert, et al. ∙ Brown University 14

One key challenge in reinforcement learning is the ability to generalize knowledge in control problems. While deep learning methods have been successfully combined with model-free reinforcement-learning algorithms, how to perform model-based reinforcement learning in the presence of approximation errors still remains an open problem. Using successor features, a feature representation that predicts a temporal constraint, this paper presents three contributions: First, it shows how learning successor features is equivalent to model-free learning. Then, it shows how successor features encode model reductions that compress the state space by creating state partitions of bisimilar states. Using this representation, an intelligent agent is guaranteed to accurately predict future reward outcomes, a key property of model-based reinforcement-learning algorithms. Lastly, it presents a loss objective and prediction error bounds showing that accurately predicting value functions and reward sequences is possible with an approximation of successor features. On finite control problems, we illustrate how minimizing this loss objective results in approximate bisimulations. The results presented in this paper provide a novel understanding of representations that can support model-free and model-based reinforcement learning.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Research in reinforcement learning (RL) (Sutton and Barto, 1998; Kaelbling et al., 1996) devises algorithms for computing an action-selection strategy, also called a policy, that maximizes a reward objective in a control problem. In a control problem, also called an environment

, an agent uses its policy to choose an action at the current state to cause a transition between states and generate a reward, a single scalar number. One key assumption—Markovianness—states that rewards and transitions depend only on the current state but are otherwise assumed to be arbitrary. RL algorithms can be coarsely classified into model-free algorithms or model-based algorithms. Model-free RL algorithms arrive at an optimal policy through continued trial-and-error interactions with the control problem while simultaneously improving an intermediate policy 

(Watkins and Dayan, 1992; Rummery, 1995; Mnih et al., 2013)

, perhaps indirectly through value estimates. Model-based RL algorithms estimate the transition and reward function and use this learned model to compute the optimal policy 

(Sutton, 1990; Brafman and Tennenholtz, 2002). While algorithms such as Dyna (Sutton, 1990) also use model-free TD-learning, this paper considers a “strict” form of model-based RL where the optimal policy is computed explicitly from a learned model, rather than using a model to speed up or inform otherwise model-free learning. One shortcoming of model-based RL is that approximation errors of a single-step transition model compound for predictions over multiple time steps. At each time step, the transition model only outputs an approximate state, and this approximate state is then re-used to predict the state for the next time step. As a result, the predicted state sequence diverges from the true state sequence as the number of time steps increase (Talvitie, 2017). Further, a predicted state sequence may not contain actual states, rendering the prediction of future reward outcomes difficult.

Dayan (1993) presents the successor representation (SR), a representation that encodes each state in terms of the visitation frequencies over future states. The SR has also been extended to successor features (SFs) (Barreto et al., 2017; Lehnert et al., 2017)

by first encoding the state into a feature vector and then estimating SFs to predict visitation frequencies over future states. Because the SR encodes future state visitation frequencies, the SR also encodes information about the transition dynamics of a control problem and can be seen as an intermediate between model-based and model-free RL algorithms 

(Momennejad et al., 2017; Russek et al., 2017).

In this paper, we present a novel perspective on how SFs relate to model-free and model-based RL. First, we show that learning SFs is equivalent to learning value functions directly from trial-and-error interactions with the environment. Hence, learning SFs is akin model-free RL. Then, we show that if SFs are used to learn a state representation, then this state representation encodes a compression of the transition and reward function, also called a model reduction (Givan et al., 2003). Rather than just representing one control policy that can be improved, model reductions, and thus SFs, encode a representation of the control task itself, a key property of model-based RL algorithms. This approach provides a novel perspective on model-based RL: Rather than finding approximations of the one-step transition and reward function, a state representation is learned that compresses the dynamics of a control task into a linear action model (Yao and Szepesvári, 2012). Because the state space is compressed using a model reduction, the linear action model is guaranteed to produce the same reward sequences as the original control problem given any arbitrary action sequence. If no approximation errors are present, the linear action model will simulate reward sequences in the same way as the original control problem. If approximation errors are present, the learned linear action model resembles a “softened” model reduction and produces reward sequences that approximately resemble reward sequences produced by the original control problem. For this case we present prediction errors bounds linear in the objective function used to train the model.

2 State Representations in Reinforcement Learning

A Markov decision process (MDP) is a tuple , with a state space , a action space , a transition function , a reward function , and a discount factor . For finite state and action spaces, the transition and reward functions can also be written in matrix or vector notation as a left-stochastic state-to-state transition matrix and an expected reward vector of dimension . Each entry of is set to , where the expectation is computed over next states . This paper will consider both finite (“discrete”) or uncountably infinite (“continuous”) state spaces.

A policy

specifies the probabilities with which actions are selected at any given state (and

). If a policy is used, the transition function and expected rewards generated by this policy are denoted by and , respectively. The value function

(1)

predicts the expected discounted return where the expectation is computed over all infinite length trajectories that start in state and select actions according to the policy . The value function can also be written as a vector . The action-conditional Q-function is defined similarly as

(2)
(3)

where Eq. (3) is the usual Bellman fixed point (Puterman, 1994). The expectation in Eq. (2) ranges over all infinite length trajectories generated by a policy but that start at state with action . The expected reward is denoted with . A basis function (Sutton, 1996; Konidaris et al., 2011) is a function mapping states or state–action pairs to a real valued vector. Specifically, a state-conditional basis function maps a state to a column vector , and a state–action–conditional basis function maps a state–action pair to a column vector . Basis functions perform transforms on the state or state–action space to be able to represent a certain objective effectively, for example, to express linear approximations of the Q-function with

(4)

For finite action spaces, state-conditional basis functions can be used to construct a state–action-conditional basis function by constructing a feature vector of dimension 111The dimension of the vector is denoted with . and placing the vector into the entries corresponding to action and setting all other entries to zero. Specifically, for each state ,

(5)
(6)
(7)
(8)

Dayan (1993) introduced the successor representation (SR), a feature representation for finite state and action spaces that predicts visitation frequencies over future states. Suppose each state in a finite state space is written as a one-hot vector of dimension , then for any fixed policy the SR is defined as 222Rather than writing the SR using an indicator function, we use one-hot bit vectors , which is equivalent.. In matrix notation, a SR for a particular policy can be written as

(9)

Intuitively, the matrix describes a discounted visitation frequency of all future states. A column of

then contains a marginal probability (over time steps) of reaching a specific state, where the number of time steps needed to reach a state follows a geometric distribution with parameter

. Similar to value functions, the SR obeys a recursive identity

(10)

where has a dependency on the policy  (Lehnert et al., 2017). Barreto et al. (2017) generalize the SR to successor features (SFs) by assuming that state–action pairs are represented with a basis function . The SF at state for selecting action is defined as

(11)

where the expectation ranges over all infinite length trajectories generated by a policy but that start at state with action . SFs and the SR are closely tied to value functions. If the expected rewards are written out in a vector

(12)

then the expected reward function can be parametrized using and the one-hot bit vector (state–action-conditioned) basis function with

(13)

With this parametrization of the reward function, substituting Eq. (13) into Eq. (2) shows that SFs form an exact basis function for Q-values (Barreto et al., 2017):

(14)

A similar connection holds between state-conditional value functions and the SR.

3 Model-Free Learning

Because SFs are linear in the value function, algorithms that learn SFs can be derived similarly to linear Q-learning (Sutton and Barto, 1998, Chapter 8.4)

. In linear Q-learning, stochastic gradient descent is used to optimize the Mean Squared Value Error

(15)

where and . The expectation in Eq. (15) is computed with respect to some distribution with which transitions are sampled. When computing a gradient of the target is considered a constant. For each transition , linear Q-learning performs the update rule

(16)

where is a learning rate. The term is called the TD-error. Similar to Lehnert et al. (2017), an SF-learning algorithm can be derived by defining the Mean Squared SF Error

(17)

Because the SF is a vector of dimension , the target

(18)

is also a vector of dimension . The action selected at the next time step is computed greedily with respect to the Q-value estimate:

(19)

where the weight vector is the reward model parameter vector from Eq. (13). We make the assumption that SFs are approximated linearly using the basis function and that

(20)

where is a square matrix. Computing the gradient of with respect to results in an update rule similar to linear Q-learning:

(21)

We call the error term the SF-error. The iterate in Eq. (21) is similar the TD update rules presented by Dayan (1993); Barreto et al. (2017); Lehnert et al. (2017). Similarly, the reward model is optimized using the loss objective resulting in the update rule

(22)

Algorithm 1 summarizes the algorithm, which we refer to as SF-learning.

1:  loop
2:     Collect transition using control policy .
3:     
4:     
5:     
6:  end loop
Algorithm 1 SF-learning

Both linear Q-learning and SF-learning are off-policy learning algorithms, because a control policy is used to select actions while the value function of a different optimal policy is learned. Suppose is a function of the state and the Q-value estimates at state , and the time step . For example, -greedy exploration represents such a policy, because with probability the action greedy with respect to the current Q-value estimates is selected. Further, the parameter could be annealed to zero as the time step increases. For this class of control policies, linear Q-learning and SF-learning produce the same sequence of value functions, assuming that SF-learning is provided with an optimal reward model.

Theorem 1 (SF-learning and Q-learning Equivalence).

Consider an MDP and a basis function such that

(23)

Suppose linear Q-learning and SF-learning are run in starting at state using a control policy . If , , and , then

(24)

As a result, linear Q-learning and SF-learning will produce the same Q-value function sequence and trajectory.

Appendix A presents a formal proof of Theorem 1. Note that this result also applies to fixed control policies that do not change with the time step or the current value estimates—the policy can depend on a state , Q-values, and time step , but does not have to. If both algorithms produce the same sequence of value functions up to time step , then they will choose the same actions with equal probabilities for the next time step. If SF-learning is not initialized with the correct reward weight vector and is required to learn a reward model, then both algorithms will not produce identical value function sequences and behaviour. Theorem 1 points out the key distinction between SF-learning and Q-learning: Q-learning samples the reward function directly by incorporating the reward of each transition directly into its Q-value estimates while SF-learning first builds a reward model. However, if the reward model is initialized to the correct “ground truth” reward function, SF-learning becomes identical to Q-learning and–most importantly–searches for the optimal policy in the same fashion. In this light, learning SFs is akin to model-free RL.

4 Successor Features Encode Model-based Representations

The previous section shows how learning SFs is closely related to learning value functions from temporal difference errors, given a fixed basis function. In this section the learning problem is changed by including the basis function into the set of parameters the agent aims to learn. Specifically, state-conditional basis functions are considered that can produce accurate SFs and one step reward predictions. We will show that these two criteria are sufficient to characterize and learn a feature representation that resembles a model reduction. Section 4.1 reviews model reductions and Section 4.2 presents how SFs can be related to model reductions. Section 4.3 shows how model reductions can be approximated by learning SFs and Section 4.4 presents two experiments on finite state-action MDPs to illustrate how model reductions can be approximated using gradient-based optimization techniques.

4.1 Model Reductions

0

0

1

0

0

1

0

0

1

30 rows

0

0

1

(a) Grid World
(b) Optimization Objective
(c) Iteration 0
(d) Iteration 300
(e) Iteration 700
(f) Iteration 1000
Figure 1: Model features in the column grid world example (similar to “upworld” (Abel et al., 2017)). The agent can move up, down, left, or right, and will always receive a +1 reward for selecting an action in the red column, and a zero reward otherwise. Model reductions collapse each column into a single state (right grid in 1(a)). This three-state MDP captures all relevant dynamics: the +1 reward state is distinct from the remaining states, which describe the distance to the positive reward state. The row index is not needed for evaluating an arbitrary policy or for predicting reward sequences. Our goal is to optimize an initially random feature representation (Figure 1(c)) so that bisimilar states are assigned approximately the same feature vector (Figure 1(f)). We accomplish this by using a SF model to contruct a loss objective and minimizing this objective using gradient descent (Figure 1(b)). The dots in Figure 1(b) correspond to the different feature representations shown in Figure 1(c) through 1(f) obtained during training.

A model reduction (Givan et al., 2003) is a partitioning of the state space such that two states of the same partition are equivalent in terms of their on step rewards as well as states reachable within one time step. The example grid world shown in Figure 1 illustrates the intuition behind model reductions. In this MDP, each column forms a state partition because two criteria are satisfied:

  1. the one-step rewards are the same, and

  2. for two states of the same partition, the distributions over next state partitions are identical.

The compressed MDP retains all information necessary to predict future reward outcomes, because the columns describe the distance in terms of time steps to the +1 reward. For example, suppose a model-based agent is in the left (blue) column and wants to predict the reward sequence when moving right twice. If the state space is partitioned as shown in Figure 1(a), then the agent can predict the correct reward sequence using only state clusters. If the blue and green column were merged into one partition, then the agent will not be able to predict the correct reward sequence. In that case the state space is compressed into two partitions and the information about the distance to the right column is lost. Consequently an agent placed into the left column and moving right could not distinguish if the reward occurs in one or two time steps using only the transitions and rewards between state partitions. States that are merged into the same partition by a model reduction are called behaviourally equivalent or bisimilar.

Bisimilarity can be formally defined as an equivalence relation between different states, which induces a partitioning of the state space. The set of state partitions are formally defined as follows.

Definition 1 (Quotient/Partition Set).

Let be a set and an equivalence relation defined on . Then, the quotient or partition set

(25)

Each element is a state partition, a subset of the state space, and is a set of partitions. Bisimilarity can be formally defined as follows.

Definition 2 (Bisimilarity).

For an MDP , where

is either a probability distribution over states if

is discrete or a density function if is continuous. The equivalence relation is a bisimluation if,

(26)

where for discrete state spaces ,

(27)

and for continuous state spaces ,

(28)

The condition (26) states that two states can only be bisimilar if they produce the same one-step rewards and if they transition to the same partition with the same probability. Note that this condition on the transition dynamics is recursive: Two bisimilar states and have to transition with equal probability to a partition , that itself can only consist of bisimilar states. For discrete state spaces, Definition 2 is identical to the definition presented by Li et al. (2006). For continuous state spaces Eq. (28) generalizes the same idea by integrating the transition function over the state partition .

4.2 Connection to Successor Features

As discussed before, a state-conditional basis function maps states into some feature space and thus establishes a many-to-one relation between states and features. Model reductions can be encoded through a basis function by constraining such that two states can only be assigned the same feature vector if they are bisimilar. For example, the feature representation could be constructed by partitioning the state space into partitions and then assigning each partition a unique dimensional one-hot bit vector. We introduce model features, which assign identical feature vectors to states that are bisimilar.

Definition 3 (Model Features).

A feature representation is a model feature if and only if

(29)

Because of their relationship to bisimulation, model features are designed to be rich enough to allow the prediction of future reward outcomes while removing any other information from the state space. Note that in contrast to Abel et al. (2018), the work presented in this paper does not attempt to compute a maximal compression of the state space. Hence, if states are encoded as one-hot bit vectors then the identity map provides a trivial model feature for finite state spaces (each partition consist of a singleton set).

Suppose is a feature representation and is a one-hot bit vector of dimension with the th entry being set to 1. In that case, the representation partitions the state space by assigning the same one-hot bit vector to states of the same partition. Implicitly the function encodes an equivalence relation such that

(30)

Definition 2 states that a bisimulation relation can only relate states with equal one step expected rewards. Suppose is such that

(31)

then each entry of the vector contains the expected reward value associated with a particular state partition and action because is a one-hot bit vector. Thus we have that

(32)

because both and are assigned the same feature vector. If satisfies Eq. (31), then the resulting equivalence relation satisfies the reward condition in Definition 2.

SFs encode the condition on the transition dynamics of bisimulation relations. For two states and to be bisimilar, they have to transition to the same state partition with equal probability. Repeating from Definition 2, the state space needs to be partitioned such that for any such two states and ,

(33)

The probability describes the marginal of transitioning form a state into any state that lies in the state partition labelled with the vector . To relate these probabilities to SFs, we first observe that the probability of transitioning from partition to partition can be written as the marginal

(34)

where is the probability of encountering a state in the state partition . The distribution can be understood as a visitation distribution over states generated by a particular policy. This paper refers to as a weighting function.

Definition 4 (Weighting Function).

For an MDP , let be an equivalence relation on the state space . If is discrete, then the weighting function is defined such that

(35)

If is continuous, then the weighting function is defined such that

(36)

ProbabilityDensity

FeatureSpace

Figure 2: Weighting functions determine partition-to-partition transition probabilities. Suppose the state space is a bounded interval in and a feature representation clusters states into four partitions as shown above. Each partition is labelled with one feature vector from the set . The schematic plots the density function over states of selecting action at state (blue area) and the density function over the state partition (orange area). The probability of transitioning to the partition is denoted with and the probability of transitioning from partition-to-partition is denoted with . If a transition is mapped to a transition , then the probability with which this transition in feature space occurs depends on the probability with which states are sampled from the partition . The weighting function models this probability distribution.

Figure 2 shows a schematic of how the different probability distributions relate to one another. Suppose we define the SF

(37)

then each row contains the SF associated with all states belonging to the same partition. The right hand side of the fix-point equation in line (37) computes the expected SF across all actions with respect to a policy :

(38)

where is matrix with the action-selection probabilities along its diagonal. This dependency of the SF representation on will be discussed at the end of this section.

Because is a one-hot bit vector, we can construct a partition-to-partition transition matrix such that

(39)

Leveraging the property that is a one-hot bit vector, the matrix encodes the visitation frequencies over state partitions, similar to how the SR matrix encodes the visitation frequencies over states. This property stems from the fact that is a function of the feature representation . Previously we have shown that (see also Eq. (10)) and similarly we can write for each matrix

(40)

Interestingly, because the matrices are used to compute a SF with respect to the transition dynamics of the MDP, the feature representation is constructed such that

(41)

Eq. (41) implies that we could pick for any state a weighting function that is a Dirac delta function centered around and

(42)

Meaning two states and of the same state partition will transition to state partitions with equal probability and . To see why the partition-to-partition transition probabilities become independent of , we observe that the SF describes the visitation frequencies over future state partitions. If two states and are assigned to the same partition, then for every action . That is, both states and must have identical future state-partition visitation frequencies and thus . Because these state-to-partition transition probabilities are equal for all states of the same partition, the integral in Eq. (42) evaluates to the same value for any . The following lemma proves lines (41) and (42) formally.

Lemma 1.

For an MDP , let be a feature representation and be a one-hot bit vector of dimension with entry being set to 1. If

(43)

then

  1. where

    is a left-stochastic matrix, and

  2. , , where and are arbitrary weighting functions.

Appendix A presents a formal proof of Lemma 1. Using this lemma and the previously presented arguments, we can prove the main theorem which shows that SFs and a one-step reward predictor are sufficient to construct model features and thus model reductions.

Theorem 2.

For an MDP , let be a feature representation and be a one-hot bit vector of dimension with the th entry being set to 1. If the feature representation satisfies for some exploratory policy that

(44)

then is a model-feature and any two states and are bisimilar if .

A formal proof of Theorem 2 is listed in Appendix A. The SF

(45)

is action-conditional, meaning the SF depends on the policy after the first time step because the expectation in Eq. (45) ranges over all trajectories that start at state with action —action is not selected according to . This dependency of the first time step on an action is sufficient to encode the state-to-partition transition probabilities in the matrices . Because the feature representation is conditioned only on the state and is constrained to predict action-conditional SFs, it is independent of the policy . In fact, any exploratory policy can be used to construct SFs and model features. By Lemma 1, the partition-to-partition transition probabilities can be computed analytically with333The proof of Lemma 1 in Appendix A shows formally why the inverse of exists.

(46)

If a one-hot bit vector feature representation can be constructed satisfying the conditions outlined in Theorem 2, then in principle any policy could be used to construct model features. However, if model features are approximated then a dependency on becomes visible to some extent. The following sections explore how model features can be approximated and illustrate that in the presence of approximation errors a model feature approximation produces weaker predictions for policies that are increasingly dissimilar to .

4.3 Approximate Model Reductions

This section will generalize the previously discussed model to arbitrary basis functions and show how model features can be approximated. Rather than reasoning about approximate bisimulations in the form of bisimulation metrics (Ferns et al., 2011, 2004), we will consider two key properties to characterize approximations of model features:

  1. the ability to predict reward sequences given any arbitrary start state and action sequence, and

  2. the ability to predict the value function of any arbitrary policy .

We will present our results for arbitrary state spaces and introduce two small values and we desire to minimize:

(47)
and (48)

where the following assumption is made about the matrices :

Assumption 1.

Assume that for every action , there exists a matrix with such that .

In contrast to the one-hot bit vector model discussed previously, Assumption 1 is required because for any arbitrary feature representation it may not be possible to predict the feature-to-feature transitions linearly. While the following results are stated for finite action spaces, they can also be extended to arbitrary action spaces by assuming functions rather than a set of vectors and similarly for the matrices .

4.3.1 Rollout Predictions

Given a start state , the expected future reward after selecting actions according to the sequence can be approximated with

(49)

where the expectation is over all possible -step trajectories starting in and following the given action sequence .

Theorem 3 (Rollout Bound).

Consider an MDP and assume that and are defined as in Eq. (47) and Eq. (48), then ,

(50)

If , then the bound in Eq. (50) is equal to because the transition model is not used to predict one-step rewards and cannot influence the prediction error. SF predict visitation frequencies over future transitions and are thus discounted multi-step models. Because a multi-step model is used for multi-step predictions, the second term in Eq. (50), which describes the prediction error induced due to approximation errors in the transition model, is linear in and the rollout length .

This result stands in contrast to the usual approach of approximating a one-step transition model. The drawback of directly approximating transition functions is that approximation errors compound when making multi-step predictions, because predictions about the state at time step are made given only an approximation of the state at time step . Hence, existing error bounds are exponential in  (Asadi et al., 2018, Theorem 1) (Talvitie, 2017, Lemma 2). This dependency does not appear in Theorem 3 because directly upper bounds a discounted multi-step transition model. Our model is free to construct any feature space that produces low prediction errors. Hence, if is low enough, then Theorem 3 shows that the learned feature space can support predictions about multi-step transitions.

4.3.2 Value Predictions

The following theorem states that an approximation of model features can also be used to represent Q-value functions such that for every action . Because value functions are a discounted sum of expected future rewards, the ability of model features to be suitable for Q-value prediction follows from the previous discussion.

Theorem 4 (Approximate Model Features).

Consider an MDP and assume that and are defined as in Eq. (47) and Eq. (48), then

(51)

The value error bound Eq. (51) is also linear in and, similarly to the rollout bound in Eq. (50), this linear dependency stems from the fact that bounds the prediction error of a discounted multi-step model. Note that if approximation errors are very high, then both error bounds in Theorem 3 and Theorem 4 can grow to a value larger than the reward or possible value range of the MDP at hand.

4.4 Learning Model Features

Using the previously presented results, model features, and thus model reductions, can be approximated via (stochastic) gradient descent on the loss objective

(52)

where the expectation ranges over all possible transitions that are sampled as training data. For all experiments we assume to select actions uniformly at random. The gradient of is computed with respect to all parameters , and could be any arbitrary function parametrized by some weight vector. The experiments presented in this section focus on finite state and action spaces, hence

(53)

where is a one-hot bit vector representation of the state , and is a real valued matrix. For stochastic gradient descent a transition data set is collected and each gradient update is computed by sampling a (sub)set of transitions from the entire transition data set.

To access how well a learned feature representation can be used to predict the value function of an arbitrary policy, we fix a particular policy that is defined on the state space and record the value error for each policy . Note that for estimating , only the feature transition and reward models and are used. The state-conditional value function can be expressed as a vector

(54)

where . Algorithm 2 outlines how is computed, which is similar to performing value iteration but with a fixed policy defined in the original state space . Line 4 in Algorithm 2 first computes the value function , which is defined on the original state space because each entry of the vector corresponds to the value of some state . The pseudo inverse is used to solve the equation for the variable . In all our experiments we always found that the left Moore-Penrose pseudo-inverse of exists, because the rows of the matrix span all dimensions, where is the dimension of the learned feature representation. The following sections will present empirical results on two example MDPs and discuss why the row space of spans all dimensions.

1:  Given , , and policy matrices .
2:  repeat
3:     
4:      { is the pseudo-inverse}
5:  until  converges
Algorithm 2 Feature Policy Evaluation

4.4.1 Column Grid World

We tested our implementation on the column world shown in Figure 1. In column world transitions are deterministic and the gradient is computed for each transition individually. Because the feature representation is linear in a one-hot state representation (Eq. (53)), the feature matrix is obtained by minimizing the loss objective

(55)

The factor appears in Eq. (55) because the L2 matrix norm sums the L2 norm of each row rather than averaging over all rows. Optimizing the loss objective in Eq. (55) with a gradient optimizer is equivalent to performing batch gradient descent on the loss objective in Eq. (52).

Figure 1(c) shows the initial feature representation which was sampled uniformly at random. Figures 1(d)1(e), and  1(f) show how the feature representation evolves as the loss objective is optimized. One can observe that after 1000 gradient steps, the learned feature representation assigns approximately the same feature vector to bisimilar states. In that sense the learned feature representation is an approximation of a model reduction. Figure 3 plots the value error for a range of -greedy policies at each gradient iteration. Each -greedy policy selects actions uniformly at random with probability and otherwise selects the optimal action. Because the discount factor was set to , the value function can span the interval . One can observe that after around 400 iterations the value error of each tested policy is comparably small. During training, uniform random action selection () has the lowest prediction error and prediction errors increase as tends to zero and the policy becomes more similar to the optimal policy. This behaviour is expected because the SF representation was trained for uniform random action selection and approximation errors limit the learned representation’s ability to generalize ro different policies. In this experiment we found that Assumption 1 may not hold during training, however, the value error was comparably small and Algorithm 1 always converged to a solution. Appendix B lists all hyper parameters needed to reproduce this experiment.

Figure 3: Column world value error for different -greedy policies.

Because we use a linear action model, feature-to-feature transitions are modelled with a matrix multiplication. If the feature representation maps states to one-hot bit vectors, then any particular transition function can be represented with stochastic transition matrices . Hence if the learned cluster centers in Figure 1(f) would all fall on some one-hot bit vector, then this solution would have a zero loss value. Because the solution learned for column world is approximate, the feature centers in Figure 1(f)

only form a linearly independent set and the learned feature vectors do not fall exactly on the cluster centers. However, the number of bisimilar partitions still appears in the form of the feature dimensions of the learned solution, because in three dimensions one can only represent at most three linearly independent vectors. Because the feature representation approximately clustered into 3 linearly independent vectors and because it is initialized uniformly at random, the feature matrix

has a row-space spanning all three dimensions. A similar argument applies to the following puddle world experiments.

4.4.2 Puddle World Experiments

Puddle world (Boyan and Moore, 1995) is a grid-world navigation task where the agent has to navigate to a goal location to collect a +1 reward while avoiding a puddle. Figure 4(a) shows a map of puddle world. Entering a puddle grid cell results in reward for each transition. The agent selects one out of four actions to move up, down, left, or right. Transitions are successful with 90% probability and with a 10% probability will result in a transition to a cell orthogonal to the intended direction or the agent does not move.

The only perfect model reduction on the state space of puddle world is the identity map. Suppose two adjacent states are clustered. Then, the compressed state space would lose the exact position information in the grid, rendering accurate predictions about expected future rewards impossible if only the compressed feature model is used.

Stochastic gradient descent was used to learn a feature representation for puddle world. To further stabilize training, the SF target was assumed to be fixed and no gradients are computed through it. Given a transition data set, the following expectation was sampled:

(56)

where is the SF target (similar to the SF-learning algorithm). The transition data set for puddle world was generated by iterating over the state space and sampling one transition starting at each state, until the transition data set of 1000 transitions is reached. Actions were sampled uniformly at random to generate a transition.

(a) Puddle World. The agent has to navigate to the green goal cell and avoid the orange puddle.
(b) Feature Clustering at initialization.
(c) Feature Clustering after 40000 gradient updates.
(d) Feature Clustering after 200000 gradient updates.
Figure 4: Puddle world map and the learned partitioning of the state space.
(a) Value function predicition errors for different -greedy policies. Value errors can be at most 10.
(b) Rollout prediction errors for randomly sampled action sequence. Each curve plots the average over 100 randomly sampled state and action sequence pairs.
Figure 5: Puddle World Prediction Errors.
(a) Initialization
(b) 20000 Update Steps
(c) 400000 Update Steps
Figure 6: Example rollouts on puddle world.

To access which representations approximate model features find, we constrained the feature representation to 80 dimensions, effectively forcing the algorithm to over-compress the state space into 80 linearly independent features and to introduce some approximation error. During different stages of training, the feature vector for each state was computed and this set of vectors was clustered into 80 clusters using k-Means clustering 

(Bishop, 2006). Figures 4(b)4(c), and 4(d) illustrate the found clustering by linking states together that have the same closest cluster centroid in common. Figure 4(b) shows the clustering obtained for the feature initialization, where feature vectors are sampled uniformly at random. One can oberve that through training (Figures 4(c) and 4(d)) clusters that contain states that are far apart in the grid are broken up and states that are close together in the grid are mapped to feature vectors that are close together in terms of their euclidean L2 norm. As expected, the approximate feature representation attempts to preserve the position in the grid in order to ensure accurate predictions of future reward sequences and thus resembles an approximate model reduction.

Overall, stochastic gradient descent was performed on the loss objective for 200000 steps. Figure 5 shows the prediction errors for the learned feature representation at different stages of training. Initially value errors are high but are reduced significantly during training (Figure 5(a)). Similar to the column world experiments, uniform random action selection () produces the lowest value errors, because the SFs are trained for this policy. As is decreased and the policy becomes more similar to the (greedy) optimal policy, value errors increase because to some extend approximation errors limit the learned representation to generalize to arbitrarily different policies. However, at the end of training all policies can be predicted with comparably low value errors. Note that Figure 5(a) plots for each policy the highest absolute value difference (the maximum is computed over all 100 states). Figure 5(b) plots the expected reward prediction error for 200 step action sequences. For each curve, 100 state and action sequence pairs were sampled uniformly at random and the expected reward was computed for each time step using the feature model. The expected reward errors fall into the interval . At initialization, expected reward prediction errors are at a comparably high level and then decrease to about 0.025 as training progresses. Because model features directly attempt to approximate a model reduction and effectively learn a multi-step model, the reward prediction error curves move horizontally down as training progresses. The discount factor discounts the step feature prediction errors with . Increasing the discount factor towards one then results in the model assigning more weight to predictions that range over longer time steps. In this experiment, the discount factor and thus choosing a 200 step rollout length is sufficient to estimate a -discounted value function: if rewards that lie 200 time steps into the future were predicted uniformly at random in the interval , then the value error can be influenced by at most . If the goal is to predict rewards further than 200 time steps into the future, then one would have to further increase the discount factor .

Figure 6 shows three example rollouts at three different stages during training. We only evaluate expected future rewards. Because the transition function is stochastic, the reward prediction curves in Figure 6 are averaged over multiple reward values and are thus smoothed. All hyper-parameters for the puddle world experiment are documented in Appendix B.

5 Discussion

We demonstrate how SFs can be used for model-based RL by learning a feature representation and optimizing over the space of all possible value functions. Given a fixed basis function, Section 3 shows how learning SFs is equivalent to learning value functions through Q-learning. However, Q-learning does not only evaluate a particular control policy but also searches for the optimal policy. For finite state and action spaces, Strehl et al. (2006) show that such model-free algorithms have a sample complexity exponential in the size of the state space and converge much slower than model-based algorithms such as RMax (Strehl et al., 2009). A similar argument applies for learning SFs. For finite state and action spaces and given a fixed policy , the SFs for this policy can be directly computed with and , which are both at most operations4