1 Introduction
Research in reinforcement learning (RL) (Sutton and Barto, 1998; Kaelbling et al., 1996) devises algorithms for computing an actionselection strategy, also called a policy, that maximizes a reward objective in a control problem. In a control problem, also called an environment
, an agent uses its policy to choose an action at the current state to cause a transition between states and generate a reward, a single scalar number. One key assumption—Markovianness—states that rewards and transitions depend only on the current state but are otherwise assumed to be arbitrary. RL algorithms can be coarsely classified into modelfree algorithms or modelbased algorithms. Modelfree RL algorithms arrive at an optimal policy through continued trialanderror interactions with the control problem while simultaneously improving an intermediate policy
(Watkins and Dayan, 1992; Rummery, 1995; Mnih et al., 2013), perhaps indirectly through value estimates. Modelbased RL algorithms estimate the transition and reward function and use this learned model to compute the optimal policy
(Sutton, 1990; Brafman and Tennenholtz, 2002). While algorithms such as Dyna (Sutton, 1990) also use modelfree TDlearning, this paper considers a “strict” form of modelbased RL where the optimal policy is computed explicitly from a learned model, rather than using a model to speed up or inform otherwise modelfree learning. One shortcoming of modelbased RL is that approximation errors of a singlestep transition model compound for predictions over multiple time steps. At each time step, the transition model only outputs an approximate state, and this approximate state is then reused to predict the state for the next time step. As a result, the predicted state sequence diverges from the true state sequence as the number of time steps increase (Talvitie, 2017). Further, a predicted state sequence may not contain actual states, rendering the prediction of future reward outcomes difficult.Dayan (1993) presents the successor representation (SR), a representation that encodes each state in terms of the visitation frequencies over future states. The SR has also been extended to successor features (SFs) (Barreto et al., 2017; Lehnert et al., 2017)
by first encoding the state into a feature vector and then estimating SFs to predict visitation frequencies over future states. Because the SR encodes future state visitation frequencies, the SR also encodes information about the transition dynamics of a control problem and can be seen as an intermediate between modelbased and modelfree RL algorithms
(Momennejad et al., 2017; Russek et al., 2017).In this paper, we present a novel perspective on how SFs relate to modelfree and modelbased RL. First, we show that learning SFs is equivalent to learning value functions directly from trialanderror interactions with the environment. Hence, learning SFs is akin modelfree RL. Then, we show that if SFs are used to learn a state representation, then this state representation encodes a compression of the transition and reward function, also called a model reduction (Givan et al., 2003). Rather than just representing one control policy that can be improved, model reductions, and thus SFs, encode a representation of the control task itself, a key property of modelbased RL algorithms. This approach provides a novel perspective on modelbased RL: Rather than finding approximations of the onestep transition and reward function, a state representation is learned that compresses the dynamics of a control task into a linear action model (Yao and Szepesvári, 2012). Because the state space is compressed using a model reduction, the linear action model is guaranteed to produce the same reward sequences as the original control problem given any arbitrary action sequence. If no approximation errors are present, the linear action model will simulate reward sequences in the same way as the original control problem. If approximation errors are present, the learned linear action model resembles a “softened” model reduction and produces reward sequences that approximately resemble reward sequences produced by the original control problem. For this case we present prediction errors bounds linear in the objective function used to train the model.
2 State Representations in Reinforcement Learning
A Markov decision process (MDP) is a tuple , with a state space , a action space , a transition function , a reward function , and a discount factor . For finite state and action spaces, the transition and reward functions can also be written in matrix or vector notation as a leftstochastic statetostate transition matrix and an expected reward vector of dimension . Each entry of is set to , where the expectation is computed over next states . This paper will consider both finite (“discrete”) or uncountably infinite (“continuous”) state spaces.
A policy
specifies the probabilities with which actions are selected at any given state (and
). If a policy is used, the transition function and expected rewards generated by this policy are denoted by and , respectively. The value function(1) 
predicts the expected discounted return where the expectation is computed over all infinite length trajectories that start in state and select actions according to the policy . The value function can also be written as a vector . The actionconditional Qfunction is defined similarly as
(2)  
(3) 
where Eq. (3) is the usual Bellman fixed point (Puterman, 1994). The expectation in Eq. (2) ranges over all infinite length trajectories generated by a policy but that start at state with action . The expected reward is denoted with . A basis function (Sutton, 1996; Konidaris et al., 2011) is a function mapping states or state–action pairs to a real valued vector. Specifically, a stateconditional basis function maps a state to a column vector , and a state–action–conditional basis function maps a state–action pair to a column vector . Basis functions perform transforms on the state or state–action space to be able to represent a certain objective effectively, for example, to express linear approximations of the Qfunction with
(4) 
For finite action spaces, stateconditional basis functions can be used to construct a state–actionconditional basis function by constructing a feature vector of dimension ^{1}^{1}1The dimension of the vector is denoted with . and placing the vector into the entries corresponding to action and setting all other entries to zero. Specifically, for each state ,
(5)  
(6)  
(7)  
(8) 
Dayan (1993) introduced the successor representation (SR), a feature representation for finite state and action spaces that predicts visitation frequencies over future states. Suppose each state in a finite state space is written as a onehot vector of dimension , then for any fixed policy the SR is defined as ^{2}^{2}2Rather than writing the SR using an indicator function, we use onehot bit vectors , which is equivalent.. In matrix notation, a SR for a particular policy can be written as
(9) 
Intuitively, the matrix describes a discounted visitation frequency of all future states. A column of
then contains a marginal probability (over time steps) of reaching a specific state, where the number of time steps needed to reach a state follows a geometric distribution with parameter
. Similar to value functions, the SR obeys a recursive identity(10) 
where has a dependency on the policy (Lehnert et al., 2017). Barreto et al. (2017) generalize the SR to successor features (SFs) by assuming that state–action pairs are represented with a basis function . The SF at state for selecting action is defined as
(11) 
where the expectation ranges over all infinite length trajectories generated by a policy but that start at state with action . SFs and the SR are closely tied to value functions. If the expected rewards are written out in a vector
(12) 
then the expected reward function can be parametrized using and the onehot bit vector (state–actionconditioned) basis function with
(13) 
With this parametrization of the reward function, substituting Eq. (13) into Eq. (2) shows that SFs form an exact basis function for Qvalues (Barreto et al., 2017):
(14) 
A similar connection holds between stateconditional value functions and the SR.
3 ModelFree Learning
Because SFs are linear in the value function, algorithms that learn SFs can be derived similarly to linear Qlearning (Sutton and Barto, 1998, Chapter 8.4)
. In linear Qlearning, stochastic gradient descent is used to optimize the Mean Squared Value Error
(15) 
where and . The expectation in Eq. (15) is computed with respect to some distribution with which transitions are sampled. When computing a gradient of the target is considered a constant. For each transition , linear Qlearning performs the update rule
(16) 
where is a learning rate. The term is called the TDerror. Similar to Lehnert et al. (2017), an SFlearning algorithm can be derived by defining the Mean Squared SF Error
(17) 
Because the SF is a vector of dimension , the target
(18) 
is also a vector of dimension . The action selected at the next time step is computed greedily with respect to the Qvalue estimate:
(19) 
where the weight vector is the reward model parameter vector from Eq. (13). We make the assumption that SFs are approximated linearly using the basis function and that
(20) 
where is a square matrix. Computing the gradient of with respect to results in an update rule similar to linear Qlearning:
(21) 
We call the error term the SFerror. The iterate in Eq. (21) is similar the TD update rules presented by Dayan (1993); Barreto et al. (2017); Lehnert et al. (2017). Similarly, the reward model is optimized using the loss objective resulting in the update rule
(22) 
Algorithm 1 summarizes the algorithm, which we refer to as SFlearning.
Both linear Qlearning and SFlearning are offpolicy learning algorithms, because a control policy is used to select actions while the value function of a different optimal policy is learned. Suppose is a function of the state and the Qvalue estimates at state , and the time step . For example, greedy exploration represents such a policy, because with probability the action greedy with respect to the current Qvalue estimates is selected. Further, the parameter could be annealed to zero as the time step increases. For this class of control policies, linear Qlearning and SFlearning produce the same sequence of value functions, assuming that SFlearning is provided with an optimal reward model.
Theorem 1 (SFlearning and Qlearning Equivalence).
Consider an MDP and a basis function such that
(23) 
Suppose linear Qlearning and SFlearning are run in starting at state using a control policy . If , , and , then
(24) 
As a result, linear Qlearning and SFlearning will produce the same Qvalue function sequence and trajectory.
Appendix A presents a formal proof of Theorem 1. Note that this result also applies to fixed control policies that do not change with the time step or the current value estimates—the policy can depend on a state , Qvalues, and time step , but does not have to. If both algorithms produce the same sequence of value functions up to time step , then they will choose the same actions with equal probabilities for the next time step. If SFlearning is not initialized with the correct reward weight vector and is required to learn a reward model, then both algorithms will not produce identical value function sequences and behaviour. Theorem 1 points out the key distinction between SFlearning and Qlearning: Qlearning samples the reward function directly by incorporating the reward of each transition directly into its Qvalue estimates while SFlearning first builds a reward model. However, if the reward model is initialized to the correct “ground truth” reward function, SFlearning becomes identical to Qlearning and–most importantly–searches for the optimal policy in the same fashion. In this light, learning SFs is akin to modelfree RL.
4 Successor Features Encode Modelbased Representations
The previous section shows how learning SFs is closely related to learning value functions from temporal difference errors, given a fixed basis function. In this section the learning problem is changed by including the basis function into the set of parameters the agent aims to learn. Specifically, stateconditional basis functions are considered that can produce accurate SFs and one step reward predictions. We will show that these two criteria are sufficient to characterize and learn a feature representation that resembles a model reduction. Section 4.1 reviews model reductions and Section 4.2 presents how SFs can be related to model reductions. Section 4.3 shows how model reductions can be approximated by learning SFs and Section 4.4 presents two experiments on finite stateaction MDPs to illustrate how model reductions can be approximated using gradientbased optimization techniques.
4.1 Model Reductions
A model reduction (Givan et al., 2003) is a partitioning of the state space such that two states of the same partition are equivalent in terms of their on step rewards as well as states reachable within one time step. The example grid world shown in Figure 1 illustrates the intuition behind model reductions. In this MDP, each column forms a state partition because two criteria are satisfied:

the onestep rewards are the same, and

for two states of the same partition, the distributions over next state partitions are identical.
The compressed MDP retains all information necessary to predict future reward outcomes, because the columns describe the distance in terms of time steps to the +1 reward. For example, suppose a modelbased agent is in the left (blue) column and wants to predict the reward sequence when moving right twice. If the state space is partitioned as shown in Figure 1(a), then the agent can predict the correct reward sequence using only state clusters. If the blue and green column were merged into one partition, then the agent will not be able to predict the correct reward sequence. In that case the state space is compressed into two partitions and the information about the distance to the right column is lost. Consequently an agent placed into the left column and moving right could not distinguish if the reward occurs in one or two time steps using only the transitions and rewards between state partitions. States that are merged into the same partition by a model reduction are called behaviourally equivalent or bisimilar.
Bisimilarity can be formally defined as an equivalence relation between different states, which induces a partitioning of the state space. The set of state partitions are formally defined as follows.
Definition 1 (Quotient/Partition Set).
Let be a set and an equivalence relation defined on . Then, the quotient or partition set
(25) 
Each element is a state partition, a subset of the state space, and is a set of partitions. Bisimilarity can be formally defined as follows.
Definition 2 (Bisimilarity).
For an MDP , where
is either a probability distribution over states if
is discrete or a density function if is continuous. The equivalence relation is a bisimluation if,(26) 
where for discrete state spaces ,
(27) 
and for continuous state spaces ,
(28) 
The condition (26) states that two states can only be bisimilar if they produce the same onestep rewards and if they transition to the same partition with the same probability. Note that this condition on the transition dynamics is recursive: Two bisimilar states and have to transition with equal probability to a partition , that itself can only consist of bisimilar states. For discrete state spaces, Definition 2 is identical to the definition presented by Li et al. (2006). For continuous state spaces Eq. (28) generalizes the same idea by integrating the transition function over the state partition .
4.2 Connection to Successor Features
As discussed before, a stateconditional basis function maps states into some feature space and thus establishes a manytoone relation between states and features. Model reductions can be encoded through a basis function by constraining such that two states can only be assigned the same feature vector if they are bisimilar. For example, the feature representation could be constructed by partitioning the state space into partitions and then assigning each partition a unique dimensional onehot bit vector. We introduce model features, which assign identical feature vectors to states that are bisimilar.
Definition 3 (Model Features).
A feature representation is a model feature if and only if
(29) 
Because of their relationship to bisimulation, model features are designed to be rich enough to allow the prediction of future reward outcomes while removing any other information from the state space. Note that in contrast to Abel et al. (2018), the work presented in this paper does not attempt to compute a maximal compression of the state space. Hence, if states are encoded as onehot bit vectors then the identity map provides a trivial model feature for finite state spaces (each partition consist of a singleton set).
Suppose is a feature representation and is a onehot bit vector of dimension with the th entry being set to 1. In that case, the representation partitions the state space by assigning the same onehot bit vector to states of the same partition. Implicitly the function encodes an equivalence relation such that
(30) 
Definition 2 states that a bisimulation relation can only relate states with equal one step expected rewards. Suppose is such that
(31) 
then each entry of the vector contains the expected reward value associated with a particular state partition and action because is a onehot bit vector. Thus we have that
(32) 
because both and are assigned the same feature vector. If satisfies Eq. (31), then the resulting equivalence relation satisfies the reward condition in Definition 2.
SFs encode the condition on the transition dynamics of bisimulation relations. For two states and to be bisimilar, they have to transition to the same state partition with equal probability. Repeating from Definition 2, the state space needs to be partitioned such that for any such two states and ,
(33) 
The probability describes the marginal of transitioning form a state into any state that lies in the state partition labelled with the vector . To relate these probabilities to SFs, we first observe that the probability of transitioning from partition to partition can be written as the marginal
(34) 
where is the probability of encountering a state in the state partition . The distribution can be understood as a visitation distribution over states generated by a particular policy. This paper refers to as a weighting function.
Definition 4 (Weighting Function).
For an MDP , let be an equivalence relation on the state space . If is discrete, then the weighting function is defined such that
(35) 
If is continuous, then the weighting function is defined such that
(36) 
Figure 2 shows a schematic of how the different probability distributions relate to one another. Suppose we define the SF
(37) 
then each row contains the SF associated with all states belonging to the same partition. The right hand side of the fixpoint equation in line (37) computes the expected SF across all actions with respect to a policy :
(38) 
where is matrix with the actionselection probabilities along its diagonal. This dependency of the SF representation on will be discussed at the end of this section.
Because is a onehot bit vector, we can construct a partitiontopartition transition matrix such that
(39) 
Leveraging the property that is a onehot bit vector, the matrix encodes the visitation frequencies over state partitions, similar to how the SR matrix encodes the visitation frequencies over states. This property stems from the fact that is a function of the feature representation . Previously we have shown that (see also Eq. (10)) and similarly we can write for each matrix
(40) 
Interestingly, because the matrices are used to compute a SF with respect to the transition dynamics of the MDP, the feature representation is constructed such that
(41) 
Eq. (41) implies that we could pick for any state a weighting function that is a Dirac delta function centered around and
(42) 
Meaning two states and of the same state partition will transition to state partitions with equal probability and . To see why the partitiontopartition transition probabilities become independent of , we observe that the SF describes the visitation frequencies over future state partitions. If two states and are assigned to the same partition, then for every action . That is, both states and must have identical future statepartition visitation frequencies and thus . Because these statetopartition transition probabilities are equal for all states of the same partition, the integral in Eq. (42) evaluates to the same value for any . The following lemma proves lines (41) and (42) formally.
Lemma 1.
For an MDP , let be a feature representation and be a onehot bit vector of dimension with entry being set to 1. If
(43) 
then

where
is a leftstochastic matrix, and

, , where and are arbitrary weighting functions.
Appendix A presents a formal proof of Lemma 1. Using this lemma and the previously presented arguments, we can prove the main theorem which shows that SFs and a onestep reward predictor are sufficient to construct model features and thus model reductions.
Theorem 2.
For an MDP , let be a feature representation and be a onehot bit vector of dimension with the th entry being set to 1. If the feature representation satisfies for some exploratory policy that
(44) 
then is a modelfeature and any two states and are bisimilar if .
A formal proof of Theorem 2 is listed in Appendix A. The SF
(45) 
is actionconditional, meaning the SF depends on the policy after the first time step because the expectation in Eq. (45) ranges over all trajectories that start at state with action —action is not selected according to . This dependency of the first time step on an action is sufficient to encode the statetopartition transition probabilities in the matrices . Because the feature representation is conditioned only on the state and is constrained to predict actionconditional SFs, it is independent of the policy . In fact, any exploratory policy can be used to construct SFs and model features. By Lemma 1, the partitiontopartition transition probabilities can be computed analytically with^{3}^{3}3The proof of Lemma 1 in Appendix A shows formally why the inverse of exists.
(46) 
If a onehot bit vector feature representation can be constructed satisfying the conditions outlined in Theorem 2, then in principle any policy could be used to construct model features. However, if model features are approximated then a dependency on becomes visible to some extent. The following sections explore how model features can be approximated and illustrate that in the presence of approximation errors a model feature approximation produces weaker predictions for policies that are increasingly dissimilar to .
4.3 Approximate Model Reductions
This section will generalize the previously discussed model to arbitrary basis functions and show how model features can be approximated. Rather than reasoning about approximate bisimulations in the form of bisimulation metrics (Ferns et al., 2011, 2004), we will consider two key properties to characterize approximations of model features:

the ability to predict reward sequences given any arbitrary start state and action sequence, and

the ability to predict the value function of any arbitrary policy .
We will present our results for arbitrary state spaces and introduce two small values and we desire to minimize:
(47)  
and  (48) 
where the following assumption is made about the matrices :
Assumption 1.
Assume that for every action , there exists a matrix with such that .
In contrast to the onehot bit vector model discussed previously, Assumption 1 is required because for any arbitrary feature representation it may not be possible to predict the featuretofeature transitions linearly. While the following results are stated for finite action spaces, they can also be extended to arbitrary action spaces by assuming functions rather than a set of vectors and similarly for the matrices .
4.3.1 Rollout Predictions
Given a start state , the expected future reward after selecting actions according to the sequence can be approximated with
(49) 
where the expectation is over all possible step trajectories starting in and following the given action sequence .
Theorem 3 (Rollout Bound).
If , then the bound in Eq. (50) is equal to because the transition model is not used to predict onestep rewards and cannot influence the prediction error. SF predict visitation frequencies over future transitions and are thus discounted multistep models. Because a multistep model is used for multistep predictions, the second term in Eq. (50), which describes the prediction error induced due to approximation errors in the transition model, is linear in and the rollout length .
This result stands in contrast to the usual approach of approximating a onestep transition model. The drawback of directly approximating transition functions is that approximation errors compound when making multistep predictions, because predictions about the state at time step are made given only an approximation of the state at time step . Hence, existing error bounds are exponential in (Asadi et al., 2018, Theorem 1) (Talvitie, 2017, Lemma 2). This dependency does not appear in Theorem 3 because directly upper bounds a discounted multistep transition model. Our model is free to construct any feature space that produces low prediction errors. Hence, if is low enough, then Theorem 3 shows that the learned feature space can support predictions about multistep transitions.
4.3.2 Value Predictions
The following theorem states that an approximation of model features can also be used to represent Qvalue functions such that for every action . Because value functions are a discounted sum of expected future rewards, the ability of model features to be suitable for Qvalue prediction follows from the previous discussion.
Theorem 4 (Approximate Model Features).
The value error bound Eq. (51) is also linear in and, similarly to the rollout bound in Eq. (50), this linear dependency stems from the fact that bounds the prediction error of a discounted multistep model. Note that if approximation errors are very high, then both error bounds in Theorem 3 and Theorem 4 can grow to a value larger than the reward or possible value range of the MDP at hand.
4.4 Learning Model Features
Using the previously presented results, model features, and thus model reductions, can be approximated via (stochastic) gradient descent on the loss objective
(52) 
where the expectation ranges over all possible transitions that are sampled as training data. For all experiments we assume to select actions uniformly at random. The gradient of is computed with respect to all parameters , and could be any arbitrary function parametrized by some weight vector. The experiments presented in this section focus on finite state and action spaces, hence
(53) 
where is a onehot bit vector representation of the state , and is a real valued matrix. For stochastic gradient descent a transition data set is collected and each gradient update is computed by sampling a (sub)set of transitions from the entire transition data set.
To access how well a learned feature representation can be used to predict the value function of an arbitrary policy, we fix a particular policy that is defined on the state space and record the value error for each policy . Note that for estimating , only the feature transition and reward models and are used. The stateconditional value function can be expressed as a vector
(54) 
where . Algorithm 2 outlines how is computed, which is similar to performing value iteration but with a fixed policy defined in the original state space . Line 4 in Algorithm 2 first computes the value function , which is defined on the original state space because each entry of the vector corresponds to the value of some state . The pseudo inverse is used to solve the equation for the variable . In all our experiments we always found that the left MoorePenrose pseudoinverse of exists, because the rows of the matrix span all dimensions, where is the dimension of the learned feature representation. The following sections will present empirical results on two example MDPs and discuss why the row space of spans all dimensions.
4.4.1 Column Grid World
We tested our implementation on the column world shown in Figure 1. In column world transitions are deterministic and the gradient is computed for each transition individually. Because the feature representation is linear in a onehot state representation (Eq. (53)), the feature matrix is obtained by minimizing the loss objective
(55) 
The factor appears in Eq. (55) because the L2 matrix norm sums the L2 norm of each row rather than averaging over all rows. Optimizing the loss objective in Eq. (55) with a gradient optimizer is equivalent to performing batch gradient descent on the loss objective in Eq. (52).
Figure 1(c) shows the initial feature representation which was sampled uniformly at random. Figures 1(d), 1(e), and 1(f) show how the feature representation evolves as the loss objective is optimized. One can observe that after 1000 gradient steps, the learned feature representation assigns approximately the same feature vector to bisimilar states. In that sense the learned feature representation is an approximation of a model reduction. Figure 3 plots the value error for a range of greedy policies at each gradient iteration. Each greedy policy selects actions uniformly at random with probability and otherwise selects the optimal action. Because the discount factor was set to , the value function can span the interval . One can observe that after around 400 iterations the value error of each tested policy is comparably small. During training, uniform random action selection () has the lowest prediction error and prediction errors increase as tends to zero and the policy becomes more similar to the optimal policy. This behaviour is expected because the SF representation was trained for uniform random action selection and approximation errors limit the learned representation’s ability to generalize ro different policies. In this experiment we found that Assumption 1 may not hold during training, however, the value error was comparably small and Algorithm 1 always converged to a solution. Appendix B lists all hyper parameters needed to reproduce this experiment.
Because we use a linear action model, featuretofeature transitions are modelled with a matrix multiplication. If the feature representation maps states to onehot bit vectors, then any particular transition function can be represented with stochastic transition matrices . Hence if the learned cluster centers in Figure 1(f) would all fall on some onehot bit vector, then this solution would have a zero loss value. Because the solution learned for column world is approximate, the feature centers in Figure 1(f)
only form a linearly independent set and the learned feature vectors do not fall exactly on the cluster centers. However, the number of bisimilar partitions still appears in the form of the feature dimensions of the learned solution, because in three dimensions one can only represent at most three linearly independent vectors. Because the feature representation approximately clustered into 3 linearly independent vectors and because it is initialized uniformly at random, the feature matrix
has a rowspace spanning all three dimensions. A similar argument applies to the following puddle world experiments.4.4.2 Puddle World Experiments
Puddle world (Boyan and Moore, 1995) is a gridworld navigation task where the agent has to navigate to a goal location to collect a +1 reward while avoiding a puddle. Figure 4(a) shows a map of puddle world. Entering a puddle grid cell results in reward for each transition. The agent selects one out of four actions to move up, down, left, or right. Transitions are successful with 90% probability and with a 10% probability will result in a transition to a cell orthogonal to the intended direction or the agent does not move.
The only perfect model reduction on the state space of puddle world is the identity map. Suppose two adjacent states are clustered. Then, the compressed state space would lose the exact position information in the grid, rendering accurate predictions about expected future rewards impossible if only the compressed feature model is used.
Stochastic gradient descent was used to learn a feature representation for puddle world. To further stabilize training, the SF target was assumed to be fixed and no gradients are computed through it. Given a transition data set, the following expectation was sampled:
(56) 
where is the SF target (similar to the SFlearning algorithm). The transition data set for puddle world was generated by iterating over the state space and sampling one transition starting at each state, until the transition data set of 1000 transitions is reached. Actions were sampled uniformly at random to generate a transition.
To access which representations approximate model features find, we constrained the feature representation to 80 dimensions, effectively forcing the algorithm to overcompress the state space into 80 linearly independent features and to introduce some approximation error. During different stages of training, the feature vector for each state was computed and this set of vectors was clustered into 80 clusters using kMeans clustering
(Bishop, 2006). Figures 4(b), 4(c), and 4(d) illustrate the found clustering by linking states together that have the same closest cluster centroid in common. Figure 4(b) shows the clustering obtained for the feature initialization, where feature vectors are sampled uniformly at random. One can oberve that through training (Figures 4(c) and 4(d)) clusters that contain states that are far apart in the grid are broken up and states that are close together in the grid are mapped to feature vectors that are close together in terms of their euclidean L2 norm. As expected, the approximate feature representation attempts to preserve the position in the grid in order to ensure accurate predictions of future reward sequences and thus resembles an approximate model reduction.Overall, stochastic gradient descent was performed on the loss objective for 200000 steps. Figure 5 shows the prediction errors for the learned feature representation at different stages of training. Initially value errors are high but are reduced significantly during training (Figure 5(a)). Similar to the column world experiments, uniform random action selection () produces the lowest value errors, because the SFs are trained for this policy. As is decreased and the policy becomes more similar to the (greedy) optimal policy, value errors increase because to some extend approximation errors limit the learned representation to generalize to arbitrarily different policies. However, at the end of training all policies can be predicted with comparably low value errors. Note that Figure 5(a) plots for each policy the highest absolute value difference (the maximum is computed over all 100 states). Figure 5(b) plots the expected reward prediction error for 200 step action sequences. For each curve, 100 state and action sequence pairs were sampled uniformly at random and the expected reward was computed for each time step using the feature model. The expected reward errors fall into the interval . At initialization, expected reward prediction errors are at a comparably high level and then decrease to about 0.025 as training progresses. Because model features directly attempt to approximate a model reduction and effectively learn a multistep model, the reward prediction error curves move horizontally down as training progresses. The discount factor discounts the step feature prediction errors with . Increasing the discount factor towards one then results in the model assigning more weight to predictions that range over longer time steps. In this experiment, the discount factor and thus choosing a 200 step rollout length is sufficient to estimate a discounted value function: if rewards that lie 200 time steps into the future were predicted uniformly at random in the interval , then the value error can be influenced by at most . If the goal is to predict rewards further than 200 time steps into the future, then one would have to further increase the discount factor .
Figure 6 shows three example rollouts at three different stages during training. We only evaluate expected future rewards. Because the transition function is stochastic, the reward prediction curves in Figure 6 are averaged over multiple reward values and are thus smoothed. All hyperparameters for the puddle world experiment are documented in Appendix B.
5 Discussion
We demonstrate how SFs can be used for modelbased RL by learning a feature representation and optimizing over the space of all possible value functions. Given a fixed basis function, Section 3 shows how learning SFs is equivalent to learning value functions through Qlearning. However, Qlearning does not only evaluate a particular control policy but also searches for the optimal policy. For finite state and action spaces, Strehl et al. (2006) show that such modelfree algorithms have a sample complexity exponential in the size of the state space and converge much slower than modelbased algorithms such as RMax (Strehl et al., 2009). A similar argument applies for learning SFs. For finite state and action spaces and given a fixed policy , the SFs for this policy can be directly computed with and , which are both at most operations^{4}
Comments
There are no comments yet.