Reinforcement Learning (RL) (Kaelbling et al., 1996; Sutton & Barto, 1998) studies the problem of computing an optimal control strategy using one-step interactions sampled from an environment. For each selected action, the environment also provides a reward, a single scalar number. The goal is to compute a control strategy, also called a policy, that maximizes the cumulative reward received while interacting with the environment. One challenge in this setting is transferring knowledge about one environment to another when only the reward specification changes, but the remaining specification of the environment stays fixed. In this paper, we consider the approach presented by Barreto et al. (2016), which uses Successor Features (SF) to compute a representation of the environment that can be transferred across different reward functions. We present an implementation of this method and show that while learning a SF representation has significant benefits for transfer, it has also some fundamental limitations.
We consider a Markov Decision Process (MDP) with a finite state space and a finite action space . The transition function specifies with
the probability of transitioning from a stateto a state when selecting an action . For every such transition, the reward is specified by the reward function . Further, we assume a discount factor that weights the tradeoffs between immediate and long term rewards.
Let be a policy that specifies the distribution with which actions are selected, conditioned on the state space . The Q-function of this policy is defined as
where the expectation is over all possible infinite length trajectories in and the reward at time step .
Several algorithms have been developed to estimate a Q-function, however, one important question is how to represent a current Q-function estimate. For example, suppose the state space of an MDPconsists of states and
actions, then an estimate of the Q-function can be stored in a vectorof dimension :
To compute the Q-value for a state-action pair , a basis function
can be used, where is a one-hot bit vector of dimension . Basis functions can also be generalized to have different forms to further improve scalability of different learning algorithms (Sutton, 1996; Konidaris et al., 2011).
3 Learning Successor Features for Transfer
Dayan (1993) presented Successor Features (SFs), a particular type of basis function that represents a state as a feature vector such that under a given policy the feature representation is similar to the feature representation of its successor states. The idea originates from the Bellman fixed-point equation,
where is the sampled next state and is the sampled next action at state . If the Q-function is approximated linearly, then
Note that, depending on the choice of basis function, (5) may not hold exactly because we only estimate a linear approximation of the true Q-function. The objective of finding a good SF representation is to find a basis function such that (5) holds as exactly as possible.
Barreto et al. (2016) re-visited this approach in the context of transferring a feature representation within a set of MDPs where only the reward function varies. While various different approaches were presented to this problem (see Taylor & Stone (2009) for a survey), Barreto et al. approach this transfer problem by learning a feature representation that is descriptive of the entire set of MDPs and can be used for transfer across different reward functions.
Intuitively, the Q-function combines information about the reward function itself, as well as the temporal ordering of the received rewards. This temporal ordering is induced by the current policy and the transition dynamics of the MDP that determine which trajectories are generated.
For transfer, Barreto et al. present an approach that isolates the reward function from the Q-function. They define a basis function to parametrize the reward function with
Since (6) is stated as a strict equality, the assumption is made that is not too restrictive and the reward function can be represented exactly. Using this assumption, Barreto et al. rewrite the Q-function as
where is the reward feature at time step for a trajectory started at . Suppose is a basis function that tabulates the state-action space, i.e. is a one-hot bitvector of dimension . In this case, the weight vector can be thought of as the full reward model written out as a vector. This means (7) can be interpreted as a separation of the Q-function into a (linear) factor describing rewards only and a (linear) factor describing the ordering with which rewards are observed. Hence, Barreto et al. propose to learn a Successor Feature satisfying
In addition, they also present policy improvement theorems similar to the usual dynamic programming improvement theorems (Sutton & Barto, 1998).
3.1 Algorithm Derivation
, we derive a learning algorithm that fits a reward model and SF model by simultaneously minimizing two loss functions. The reward model is fitted by minimizing the reward loss
where the expectation is with respect to some visitation distribution over the state-action space , and where the scalar is the reward received for a particular transition.
The SF is learned by first estimating a target
for every collected transition . For computing this target, the SF estimate of the previous update iteration is used. Unlike Mnih et al.’s Deep Q-learning, the target is a vector and not a single scalar variable. For learning a SF representation, the loss objective
is used. The gradient of (11) with respect to the parameters is
Algorithm 1 outlines the implemented SF learning method. Learning is stabilized by sampling a batch of transitions and using the entire batch to make a gradient descent update.
4 Experiments: Grid World
Algorithm 1 is first evaluated on a grid world navigation task with four actions: up, down, left, or right. Transitions are stochastic and with a 5% probability the agent moves sideways. Rewards are set to 1 for entering the goal cell (terminal state) in the top right corner, and otherwise a zero reward is given. Every episode is started in the bottom right corner and the discount factor is set to . Actions are selected using an -greedy policy with respect to the current Q-value estimates: with probability actions are selected uniformly at random and with probability the action with the highest Q-value estimate is used.
We compare our Fitted SF implementation against a Fitted Q-iteration implementation. To ensure a fair comparison, Fitted Q-iteration is identical to Fitted SF except that Fitted Q-iteration minimizes the loss objective
where the target is set to
The value estimate and is the Q-function estimate of the previous iteration.
In all experiments, the Q-function in Fitted Q-iteration uses a basis function tabulating the state action space and the weight vector is learned as described in (2). Further, the basis function used for estimating the reward model (6
) also tabulates the state-action space; that is, the reward model can always exactly represent the true reward function. The SF representation is learned as a linear transform on the tabular basis function:
Because all basis functions are chosen to be tabular, and SFs are linear in a tabular one-hot basis function, both algorithms are not constrained in their representation and can always capture the true value function, reward model, and successor features.
4.1 Single Task Learning
Figure 1 compares the performance of the Fitted SF algorithm against Fitted Q-iteration. Both algorithms converge to a good solution and can perform the navigation task in few steps at the end of training111Note that the control policy was constrained to be only -greedy with .. The Fitted SF algorithm converges slower, which can be explained by the fact that it has to learn a full reward model before it can form good Q-value estimates. Figure 4 shows that the Fitted SF algorithm robustly minimizes both its loss objectives.
Evolution of the loss objectives. Fitted SF minimizes using the Adagrad gradient descent optimizer implemented in Tensorflow(Abadi et al., 2015). A learning rate of performed best for the loss objective and a learning rate of performed best for the loss objective . The fitted Q-iteration implementation performed best with a learning rate of . Otherwise Tensorflow’s default parameters were used.
4.2 Multi Task Learning
The Fitted SF algorithm was also tested in two transfer settings where the start and goal locations are changed periodically between a fixed set of different locations. Changing the goal location is equivalent to changing the reward function while holding the transition dynamics fixed.
Transfer with Slight Reward Changes
Figure (a)a compares the episode length of the Fitted Q-iteration implementation and Fitted SF implementation when start and goal locations are moved by one grid cell. Once the reward function is changed, the weight parameter of the Fitted SF algorithm is re-initialized to zero. For Fitted Q-iteration the trained weights are kept after every reward function change. While initial training is slower for the Fitted SF algorithm, a change in reward function degrades performance significantly less in comparison to Fitted Q-iteration, demonstrating the robustness of the Fitted SF algorithm. Figure (b)b compares two different resetting strategies for the Fitted SF learning algorithm: in one run all weights are re-initialized after a reward function change, while in the other the learned SF is kept between training rounds. One can see that keeping the SF weight matrix boosts performance significantly. This verifies the assumption presented by Barreto et al..
Transfer with Significant Reward Changes
To further test if SFs can be used for transfer between different domains, both algorithms are evaluated again on the same grid world, but the goal location is rotated through all four corners of the grid. The start location is always the corner diagonally across the grid from the goal. Changing start and goal locations in this way causes the reward function and the optimal policy to change more significantly.
To further stabilize learning and ensure sufficient exploration, both algorithms select actions using an -greedy policy. The probability is decayed according to the rule , where is the episode index. This episode index is reset to zero after every reward function change. Ensuring sufficient exploration allows the Fitted SF algorithm to efficiently re-estimate its reward model.
Figure 4.2 compares the episode length of both algorithms over several repeats of the four goal locations. The ordering of the different goal locations is not changed during the experiment. One can see that the change in reward function has an impact on both algorithms, but the Fitted SF algorithm outperforms Fitted Q-iteration significantly. Table 1 compares the average episode length across all episodes and shows that our Fitted SF algorithm outperforms the Fitted Q-iteration significantly. Figure 12 shows how the loss functions of the Fitted SF algorithm evolves during the experiment. Updates were done only every 100 steps (each gradient update used a batch of 100 transitions). As expected, the reward loss does not seem to decrease significantly in a steady way but oscillates instead. However, the estimates seem to be good enough to achieve a significant performance difference over Fitted Q-iteration. Interestingly, the SF loss oscillates during training between very low and high values.
Figure 9 shows a failure setting of the Fitted SF algorithm: If and is not annealed, only the first optimal policy and the first reward function is learned and then preserved across all subsequent changes. As a result, one can see a learning curve for the first 400 episodes and then Fitted SF hits the episode time-out of 200 steps for the next reward configuration. If a reward function similar to the first is presented to the agent again, Fitted SF solves this problem easily because it reuses the weights it has learned at the beginning of the experiment. In other words, Fitted SF is not able to transfer the solution learned in the first 400 episodes to the other tested reward functions.
The goal of using SFs is to capture a feature set common to a set of MDPs and this idea seems to perform well for transfer between these MDPs. Interestingly, Figure (a)a shows that the SF loss objective oscillates despite the fact that the algorithm recovers a near optimal policy quickly.
To get a better understanding why the loss objective oscillates, consider the transfer example shown in Figure 13. In this example, the two MDPs have two actions and deterministic transitions indicated by arrows. Rewards are indicated by the arrow labels and the two MDPs only differ in reward for two specific transitions. This difference in reward causes the optimal policy for each MDP to be different: The policy , which only selects action , is optimal in the first MDP; the policy , which selects action at state and action elsewhere, is optimal in the second MDP. The left side of Figure 13 shows the successor feature for both optimal policies, which is different for the two MDPs. This difference is caused because SFs are constrained to be similar to features the agent sees in the future. However, which features are seen is governed by the (optimal) policy. This highlights a key limitation of using Successor Features for transfer: the learned representation is not transferrable between optimal policies. When solving a previously unseen MDP, a learned SF representation can only be used to initialize the search for an optimal policy and the agent still has to adjust the SF representation to the policy that is only optimal in the current MDP.
The fact that the SF representation has to be re-learned for each individual MDP can be seen in our experiments. In Figure (a)a they contribute to the oscillations of the SF loss objective. In the failure case shown in Figure 9 the SF representation does not transfer at all and instead represents an initialization that the gradient optimizer cannot use to adjust to the new reward function. This behaviour is not surprising because in this experiment the goal location was changed to a different corner in the grid, causing the optimal policy to change significantly. In the positive test case shown in Figure 4.2 this is mitigated by resetting the policy first to uniformly random exploration (by annealing from 1.0 to 0.1) which can be thought of as smoothing the transitions between different reward functions.
This result also agrees with the first transfer experiment shown in Figure 7. Because the reward function and optimal policy is only changed slightly, the SF representations corresponding to each optimal policy and reward function are likely to be very similar. As a result, the algorithm can adjust to the new reward function very quickly. Barreto et al. also presented empirical results using a variation of Generalized Value Iteration (Sutton & Barto, 1998) on a version of Puddle World (Sutton, 1996) where the location of the puddle changed slightly. Their experiment, which shows a significant performance boost by transferring a SF representation, is similar to slight reward change test case because the changes in the reward function did not cause a drastic change in the optimal policy.
The presented empirical results demonstrate an interesting advantage and dis-advantage of transferring SFs between MDPs that only differ in reward function. While we were able to show a significant performance boost by using this approach, we also highlighted that the learned feature representation is dependent on the policy they are learned for. Hence, SF representations are an unsuitable choice in this context because one is typically interested in transferring knowledge between tasks with different optimal policies.
The fact that transferring a SF representation between tasks gives a significant boost in learning speed also suggests that learning a transferrable feature representation might be an interesting direction to pursue. However, such a feature representation needs to be independent of the task’s optimal policy.
Abadi et al. (2015)
Abadi, Martín, Agarwal, Ashish, Barham, Paul, Brevdo, Eugene, Chen,
Zhifeng, Citro, Craig, Corrado, Greg S., Davis, Andy, Dean, Jeffrey, Devin,
Matthieu, Ghemawat, Sanjay, Goodfellow, Ian, Harp, Andrew, Irving, Geoffrey,
Isard, Michael, Jia, Yangqing, Jozefowicz, Rafal, Kaiser, Lukasz, Kudlur,
Manjunath, Levenberg, Josh, Mané, Dan, Monga, Rajat, Moore, Sherry,
Murray, Derek, Olah, Chris, Schuster, Mike, Shlens, Jonathon, Steiner,
Benoit, Sutskever, Ilya, Talwar, Kunal, Tucker, Paul, Vanhoucke, Vincent,
Vasudevan, Vijay, Viégas, Fernanda, Vinyals, Oriol, Warden, Pete,
Wattenberg, Martin, Wicke, Martin, Yu, Yuan, and Zheng, Xiaoqiang.
TensorFlow: Large-scale machine learning on heterogeneous systems, 2015.URL http://tensorflow.org/. Software available from tensorflow.org.
Antos et al. (2006)
Antos, András, Szepesvári, Csaba, and Munos, Rémi.
Learning near-optimal policies with bellman-residual minimization
based fitted policy iteration and a single sample path.
International Conference on Computational Learning Theory, pp. 574–588. Springer, 2006.
- Barreto et al. (2016) Barreto, André, Munos, Rémi, Schaul, Tom, and Silver, David. Successor features for transfer in reinforcement learning. CoRR, abs/1606.05312, 2016. URL http://arxiv.org/abs/1606.05312.
- Dayan (1993) Dayan, Peter. Improving generalization for temporal difference learning: The successor representation. Neural Computation, 5(4):613–624, 1993.
Kaelbling et al. (1996)
Kaelbling, Leslie Pack, Littman, Michael L, and Moore, Andrew W.
Reinforcement learning: A survey.
Journal of artificial intelligence research, 4:237–285, 1996.
- Konidaris et al. (2011) Konidaris, George, Osentoski, Sarah, and Thomas, Philip. Value function approximation in reinforcement learning using the fourier basis. Proceedings of the Twenty-Fifth AAAI Conference on Artificial Intelligence, pp. pages 380–385, August 2011.
- Mnih et al. (2015) Mnih, Volodymyr, Kavukcuoglu, Koray, Silver, David, Rusu, Andrei A, Veness, Joel, Bellemare, Marc G, Graves, Alex, Riedmiller, Martin, Fidjeland, Andreas K, Ostrovski, Georg, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
- Sutton (1996) Sutton, Richard S. Generalization in reinforcement learning: Successful examples using sparse coarse coding. Advances in neural information processing systems, pp. 1038–1044, 1996.
- Sutton & Barto (1998) Sutton, Richard S. and Barto, Andrew G. Reinforcement Learning: An Introduction. A Bradford Book. MIT Press, Cambridge, MA, 1 edition, 1998.
- Taylor & Stone (2009) Taylor, Matthew E. and Stone, Peter. Transfer learning for reinforcement learning domains: A survey. Journal of Machine Learning Research, 10(1):1633–1685, 2009.
- Zhang et al. (2016) Zhang, Jingwei, Springenberg, Jost Tobias, Boedecker, Joschka, and Burgard, Wolfram. Deep reinforcement learning with successor features for navigation across similar environments. arXiv preprint arXiv:1612.05533, 2016.