1 Introduction
Reinforcement Learning (RL) (Kaelbling et al., 1996; Sutton & Barto, 1998) studies the problem of computing an optimal control strategy using onestep interactions sampled from an environment. For each selected action, the environment also provides a reward, a single scalar number. The goal is to compute a control strategy, also called a policy, that maximizes the cumulative reward received while interacting with the environment. One challenge in this setting is transferring knowledge about one environment to another when only the reward specification changes, but the remaining specification of the environment stays fixed. In this paper, we consider the approach presented by Barreto et al. (2016), which uses Successor Features (SF) to compute a representation of the environment that can be transferred across different reward functions. We present an implementation of this method and show that while learning a SF representation has significant benefits for transfer, it has also some fundamental limitations.
2 Background
We consider a Markov Decision Process (MDP) with a finite state space and a finite action space . The transition function specifies with
the probability of transitioning from a state
to a state when selecting an action . For every such transition, the reward is specified by the reward function . Further, we assume a discount factor that weights the tradeoffs between immediate and long term rewards.Let be a policy that specifies the distribution with which actions are selected, conditioned on the state space . The Qfunction of this policy is defined as
(1) 
where the expectation is over all possible infinite length trajectories in and the reward at time step .
Several algorithms have been developed to estimate a Qfunction, however, one important question is how to represent a current Qfunction estimate. For example, suppose the state space of an MDP
consists of states andactions, then an estimate of the Qfunction can be stored in a vector
of dimension :(2) 
To compute the Qvalue for a stateaction pair , a basis function
(3) 
can be used, where is a onehot bit vector of dimension . Basis functions can also be generalized to have different forms to further improve scalability of different learning algorithms (Sutton, 1996; Konidaris et al., 2011).
3 Learning Successor Features for Transfer
Dayan (1993) presented Successor Features (SFs), a particular type of basis function that represents a state as a feature vector such that under a given policy the feature representation is similar to the feature representation of its successor states. The idea originates from the Bellman fixedpoint equation,
(4) 
where is the sampled next state and is the sampled next action at state . If the Qfunction is approximated linearly, then
(5) 
Note that, depending on the choice of basis function, (5) may not hold exactly because we only estimate a linear approximation of the true Qfunction. The objective of finding a good SF representation is to find a basis function such that (5) holds as exactly as possible.
Barreto et al. (2016) revisited this approach in the context of transferring a feature representation within a set of MDPs where only the reward function varies. While various different approaches were presented to this problem (see Taylor & Stone (2009) for a survey), Barreto et al. approach this transfer problem by learning a feature representation that is descriptive of the entire set of MDPs and can be used for transfer across different reward functions.
Intuitively, the Qfunction combines information about the reward function itself, as well as the temporal ordering of the received rewards. This temporal ordering is induced by the current policy and the transition dynamics of the MDP that determine which trajectories are generated.
For transfer, Barreto et al. present an approach that isolates the reward function from the Qfunction. They define a basis function to parametrize the reward function with
(6) 
Since (6) is stated as a strict equality, the assumption is made that is not too restrictive and the reward function can be represented exactly. Using this assumption, Barreto et al. rewrite the Qfunction as
(7) 
where is the reward feature at time step for a trajectory started at . Suppose is a basis function that tabulates the stateaction space, i.e. is a onehot bitvector of dimension . In this case, the weight vector can be thought of as the full reward model written out as a vector. This means (7) can be interpreted as a separation of the Qfunction into a (linear) factor describing rewards only and a (linear) factor describing the ordering with which rewards are observed. Hence, Barreto et al. propose to learn a Successor Feature satisfying
(8) 
In addition, they also present policy improvement theorems similar to the usual dynamic programming improvement theorems (Sutton & Barto, 1998).
3.1 Algorithm Derivation
Similar to Fitted Qiteration (Antos et al., 2006), DQN (Mnih et al., 2015), and the method outlined by Zhang et al. (2016)
, we derive a learning algorithm that fits a reward model and SF model by simultaneously minimizing two loss functions. The reward model is fitted by minimizing the reward loss
(9) 
where the expectation is with respect to some visitation distribution over the stateaction space , and where the scalar is the reward received for a particular transition.
The SF is learned by first estimating a target
(10) 
for every collected transition . For computing this target, the SF estimate of the previous update iteration is used. Unlike Mnih et al.’s Deep Qlearning, the target is a vector and not a single scalar variable. For learning a SF representation, the loss objective
(11) 
is used. The gradient of (11) with respect to the parameters is
(12) 
which is similar to the gradient used by Deep Qlearning with the distinction that (12) is a matrix rather than a vector, and (11) is defined on the SF , rather than Qvalues.
Algorithm 1 outlines the implemented SF learning method. Learning is stabilized by sampling a batch of transitions and using the entire batch to make a gradient descent update.
4 Experiments: Grid World
Algorithm 1 is first evaluated on a grid world navigation task with four actions: up, down, left, or right. Transitions are stochastic and with a 5% probability the agent moves sideways. Rewards are set to 1 for entering the goal cell (terminal state) in the top right corner, and otherwise a zero reward is given. Every episode is started in the bottom right corner and the discount factor is set to . Actions are selected using an greedy policy with respect to the current Qvalue estimates: with probability actions are selected uniformly at random and with probability the action with the highest Qvalue estimate is used.
We compare our Fitted SF implementation against a Fitted Qiteration implementation. To ensure a fair comparison, Fitted Qiteration is identical to Fitted SF except that Fitted Qiteration minimizes the loss objective
(13) 
where the target is set to
(14) 
The value estimate and is the Qfunction estimate of the previous iteration.
In all experiments, the Qfunction in Fitted Qiteration uses a basis function tabulating the state action space and the weight vector is learned as described in (2). Further, the basis function used for estimating the reward model (6
) also tabulates the stateaction space; that is, the reward model can always exactly represent the true reward function. The SF representation is learned as a linear transform on the tabular basis function
:(15) 
Because all basis functions are chosen to be tabular, and SFs are linear in a tabular onehot basis function, both algorithms are not constrained in their representation and can always capture the true value function, reward model, and successor features.
4.1 Single Task Learning
Figure 1 compares the performance of the Fitted SF algorithm against Fitted Qiteration. Both algorithms converge to a good solution and can perform the navigation task in few steps at the end of training^{1}^{1}1Note that the control policy was constrained to be only greedy with .. The Fitted SF algorithm converges slower, which can be explained by the fact that it has to learn a full reward model before it can form good Qvalue estimates. Figure 4 shows that the Fitted SF algorithm robustly minimizes both its loss objectives.
Evolution of the loss objectives. Fitted SF minimizes using the Adagrad gradient descent optimizer implemented in Tensorflow
(Abadi et al., 2015). A learning rate of performed best for the loss objective and a learning rate of performed best for the loss objective . The fitted Qiteration implementation performed best with a learning rate of . Otherwise Tensorflow’s default parameters were used.4.2 Multi Task Learning
The Fitted SF algorithm was also tested in two transfer settings where the start and goal locations are changed periodically between a fixed set of different locations. Changing the goal location is equivalent to changing the reward function while holding the transition dynamics fixed.
Transfer with Slight Reward Changes
Figure (a)a compares the episode length of the Fitted Qiteration implementation and Fitted SF implementation when start and goal locations are moved by one grid cell. Once the reward function is changed, the weight parameter of the Fitted SF algorithm is reinitialized to zero. For Fitted Qiteration the trained weights are kept after every reward function change. While initial training is slower for the Fitted SF algorithm, a change in reward function degrades performance significantly less in comparison to Fitted Qiteration, demonstrating the robustness of the Fitted SF algorithm. Figure (b)b compares two different resetting strategies for the Fitted SF learning algorithm: in one run all weights are reinitialized after a reward function change, while in the other the learned SF is kept between training rounds. One can see that keeping the SF weight matrix boosts performance significantly. This verifies the assumption presented by Barreto et al..
Transfer with Significant Reward Changes
To further test if SFs can be used for transfer between different domains, both algorithms are evaluated again on the same grid world, but the goal location is rotated through all four corners of the grid. The start location is always the corner diagonally across the grid from the goal. Changing start and goal locations in this way causes the reward function and the optimal policy to change more significantly.
To further stabilize learning and ensure sufficient exploration, both algorithms select actions using an greedy policy. The probability is decayed according to the rule , where is the episode index. This episode index is reset to zero after every reward function change. Ensuring sufficient exploration allows the Fitted SF algorithm to efficiently reestimate its reward model.
Figure 4.2 compares the episode length of both algorithms over several repeats of the four goal locations. The ordering of the different goal locations is not changed during the experiment. One can see that the change in reward function has an impact on both algorithms, but the Fitted SF algorithm outperforms Fitted Qiteration significantly. Table 1 compares the average episode length across all episodes and shows that our Fitted SF algorithm outperforms the Fitted Qiteration significantly. Figure 12 shows how the loss functions of the Fitted SF algorithm evolves during the experiment. Updates were done only every 100 steps (each gradient update used a batch of 100 transitions). As expected, the reward loss does not seem to decrease significantly in a steady way but oscillates instead. However, the estimates seem to be good enough to achieve a significant performance difference over Fitted Qiteration. Interestingly, the SF loss oscillates during training between very low and high values.
Figure 9 shows a failure setting of the Fitted SF algorithm: If and is not annealed, only the first optimal policy and the first reward function is learned and then preserved across all subsequent changes. As a result, one can see a learning curve for the first 400 episodes and then Fitted SF hits the episode timeout of 200 steps for the next reward configuration. If a reward function similar to the first is presented to the agent again, Fitted SF solves this problem easily because it reuses the weights it has learned at the beginning of the experiment. In other words, Fitted SF is not able to transfer the solution learned in the first 400 episodes to the other tested reward functions.
5 Discussion
The goal of using SFs is to capture a feature set common to a set of MDPs and this idea seems to perform well for transfer between these MDPs. Interestingly, Figure (a)a shows that the SF loss objective oscillates despite the fact that the algorithm recovers a near optimal policy quickly.
To get a better understanding why the loss objective oscillates, consider the transfer example shown in Figure 13. In this example, the two MDPs have two actions and deterministic transitions indicated by arrows. Rewards are indicated by the arrow labels and the two MDPs only differ in reward for two specific transitions. This difference in reward causes the optimal policy for each MDP to be different: The policy , which only selects action , is optimal in the first MDP; the policy , which selects action at state and action elsewhere, is optimal in the second MDP. The left side of Figure 13 shows the successor feature for both optimal policies, which is different for the two MDPs. This difference is caused because SFs are constrained to be similar to features the agent sees in the future. However, which features are seen is governed by the (optimal) policy. This highlights a key limitation of using Successor Features for transfer: the learned representation is not transferrable between optimal policies. When solving a previously unseen MDP, a learned SF representation can only be used to initialize the search for an optimal policy and the agent still has to adjust the SF representation to the policy that is only optimal in the current MDP.
The fact that the SF representation has to be relearned for each individual MDP can be seen in our experiments. In Figure (a)a they contribute to the oscillations of the SF loss objective. In the failure case shown in Figure 9 the SF representation does not transfer at all and instead represents an initialization that the gradient optimizer cannot use to adjust to the new reward function. This behaviour is not surprising because in this experiment the goal location was changed to a different corner in the grid, causing the optimal policy to change significantly. In the positive test case shown in Figure 4.2 this is mitigated by resetting the policy first to uniformly random exploration (by annealing from 1.0 to 0.1) which can be thought of as smoothing the transitions between different reward functions.
This result also agrees with the first transfer experiment shown in Figure 7. Because the reward function and optimal policy is only changed slightly, the SF representations corresponding to each optimal policy and reward function are likely to be very similar. As a result, the algorithm can adjust to the new reward function very quickly. Barreto et al. also presented empirical results using a variation of Generalized Value Iteration (Sutton & Barto, 1998) on a version of Puddle World (Sutton, 1996) where the location of the puddle changed slightly. Their experiment, which shows a significant performance boost by transferring a SF representation, is similar to slight reward change test case because the changes in the reward function did not cause a drastic change in the optimal policy.
6 Conclusion
The presented empirical results demonstrate an interesting advantage and disadvantage of transferring SFs between MDPs that only differ in reward function. While we were able to show a significant performance boost by using this approach, we also highlighted that the learned feature representation is dependent on the policy they are learned for. Hence, SF representations are an unsuitable choice in this context because one is typically interested in transferring knowledge between tasks with different optimal policies.
The fact that transferring a SF representation between tasks gives a significant boost in learning speed also suggests that learning a transferrable feature representation might be an interesting direction to pursue. However, such a feature representation needs to be independent of the task’s optimal policy.
References

Abadi et al. (2015)
Abadi, Martín, Agarwal, Ashish, Barham, Paul, Brevdo, Eugene, Chen,
Zhifeng, Citro, Craig, Corrado, Greg S., Davis, Andy, Dean, Jeffrey, Devin,
Matthieu, Ghemawat, Sanjay, Goodfellow, Ian, Harp, Andrew, Irving, Geoffrey,
Isard, Michael, Jia, Yangqing, Jozefowicz, Rafal, Kaiser, Lukasz, Kudlur,
Manjunath, Levenberg, Josh, Mané, Dan, Monga, Rajat, Moore, Sherry,
Murray, Derek, Olah, Chris, Schuster, Mike, Shlens, Jonathon, Steiner,
Benoit, Sutskever, Ilya, Talwar, Kunal, Tucker, Paul, Vanhoucke, Vincent,
Vasudevan, Vijay, Viégas, Fernanda, Vinyals, Oriol, Warden, Pete,
Wattenberg, Martin, Wicke, Martin, Yu, Yuan, and Zheng, Xiaoqiang.
TensorFlow: Largescale machine learning on heterogeneous systems, 2015.
URL http://tensorflow.org/. Software available from tensorflow.org. 
Antos et al. (2006)
Antos, András, Szepesvári, Csaba, and Munos, Rémi.
Learning nearoptimal policies with bellmanresidual minimization
based fitted policy iteration and a single sample path.
In
International Conference on Computational Learning Theory
, pp. 574–588. Springer, 2006.  Barreto et al. (2016) Barreto, André, Munos, Rémi, Schaul, Tom, and Silver, David. Successor features for transfer in reinforcement learning. CoRR, abs/1606.05312, 2016. URL http://arxiv.org/abs/1606.05312.
 Dayan (1993) Dayan, Peter. Improving generalization for temporal difference learning: The successor representation. Neural Computation, 5(4):613–624, 1993.

Kaelbling et al. (1996)
Kaelbling, Leslie Pack, Littman, Michael L, and Moore, Andrew W.
Reinforcement learning: A survey.
Journal of artificial intelligence research
, 4:237–285, 1996.  Konidaris et al. (2011) Konidaris, George, Osentoski, Sarah, and Thomas, Philip. Value function approximation in reinforcement learning using the fourier basis. Proceedings of the TwentyFifth AAAI Conference on Artificial Intelligence, pp. pages 380–385, August 2011.
 Mnih et al. (2015) Mnih, Volodymyr, Kavukcuoglu, Koray, Silver, David, Rusu, Andrei A, Veness, Joel, Bellemare, Marc G, Graves, Alex, Riedmiller, Martin, Fidjeland, Andreas K, Ostrovski, Georg, et al. Humanlevel control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
 Sutton (1996) Sutton, Richard S. Generalization in reinforcement learning: Successful examples using sparse coarse coding. Advances in neural information processing systems, pp. 1038–1044, 1996.
 Sutton & Barto (1998) Sutton, Richard S. and Barto, Andrew G. Reinforcement Learning: An Introduction. A Bradford Book. MIT Press, Cambridge, MA, 1 edition, 1998.
 Taylor & Stone (2009) Taylor, Matthew E. and Stone, Peter. Transfer learning for reinforcement learning domains: A survey. Journal of Machine Learning Research, 10(1):1633–1685, 2009.
 Zhang et al. (2016) Zhang, Jingwei, Springenberg, Jost Tobias, Boedecker, Joschka, and Burgard, Wolfram. Deep reinforcement learning with successor features for navigation across similar environments. arXiv preprint arXiv:1612.05533, 2016.
Comments
There are no comments yet.