Advantages and Limitations of using Successor Features for Transfer in Reinforcement Learning

07/31/2017 ∙ by Lucas Lehnert, et al. ∙ 0

One question central to Reinforcement Learning is how to learn a feature representation that supports algorithm scaling and re-use of learned information from different tasks. Successor Features approach this problem by learning a feature representation that satisfies a temporal constraint. We present an implementation of an approach that decouples the feature representation from the reward function, making it suitable for transferring knowledge between domains. We then assess the advantages and limitations of using Successor Features for transfer.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Reinforcement Learning (RL) (Kaelbling et al., 1996; Sutton & Barto, 1998) studies the problem of computing an optimal control strategy using one-step interactions sampled from an environment. For each selected action, the environment also provides a reward, a single scalar number. The goal is to compute a control strategy, also called a policy, that maximizes the cumulative reward received while interacting with the environment. One challenge in this setting is transferring knowledge about one environment to another when only the reward specification changes, but the remaining specification of the environment stays fixed. In this paper, we consider the approach presented by Barreto et al. (2016), which uses Successor Features (SF) to compute a representation of the environment that can be transferred across different reward functions. We present an implementation of this method and show that while learning a SF representation has significant benefits for transfer, it has also some fundamental limitations.

2 Background

We consider a Markov Decision Process (MDP) with a finite state space and a finite action space . The transition function specifies with

the probability of transitioning from a state

to a state when selecting an action . For every such transition, the reward is specified by the reward function . Further, we assume a discount factor that weights the tradeoffs between immediate and long term rewards.

Let be a policy that specifies the distribution with which actions are selected, conditioned on the state space . The Q-function of this policy is defined as

(1)

where the expectation is over all possible infinite length trajectories in and the reward at time step .

Several algorithms have been developed to estimate a Q-function, however, one important question is how to represent a current Q-function estimate. For example, suppose the state space of an MDP

consists of states and

actions, then an estimate of the Q-function can be stored in a vector

of dimension :

(2)

To compute the Q-value for a state-action pair , a basis function

(3)

can be used, where is a one-hot bit vector of dimension . Basis functions can also be generalized to have different forms to further improve scalability of different learning algorithms (Sutton, 1996; Konidaris et al., 2011).

3 Learning Successor Features for Transfer

Dayan (1993) presented Successor Features (SFs), a particular type of basis function that represents a state as a feature vector such that under a given policy the feature representation is similar to the feature representation of its successor states. The idea originates from the Bellman fixed-point equation,

(4)

where is the sampled next state and is the sampled next action at state . If the Q-function is approximated linearly, then

(5)

Note that, depending on the choice of basis function, (5) may not hold exactly because we only estimate a linear approximation of the true Q-function. The objective of finding a good SF representation is to find a basis function such that (5) holds as exactly as possible.

Barreto et al. (2016) re-visited this approach in the context of transferring a feature representation within a set of MDPs where only the reward function varies. While various different approaches were presented to this problem (see Taylor & Stone (2009) for a survey), Barreto et al. approach this transfer problem by learning a feature representation that is descriptive of the entire set of MDPs and can be used for transfer across different reward functions.

Intuitively, the Q-function combines information about the reward function itself, as well as the temporal ordering of the received rewards. This temporal ordering is induced by the current policy and the transition dynamics of the MDP that determine which trajectories are generated.

For transfer, Barreto et al. present an approach that isolates the reward function from the Q-function. They define a basis function to parametrize the reward function with

(6)

Since (6) is stated as a strict equality, the assumption is made that is not too restrictive and the reward function can be represented exactly. Using this assumption, Barreto et al. rewrite the Q-function as

(7)

where is the reward feature at time step for a trajectory started at . Suppose is a basis function that tabulates the state-action space, i.e. is a one-hot bitvector of dimension . In this case, the weight vector can be thought of as the full reward model written out as a vector. This means (7) can be interpreted as a separation of the Q-function into a (linear) factor describing rewards only and a (linear) factor describing the ordering with which rewards are observed. Hence, Barreto et al. propose to learn a Successor Feature satisfying

(8)

In addition, they also present policy improvement theorems similar to the usual dynamic programming improvement theorems (Sutton & Barto, 1998).

3.1 Algorithm Derivation

Similar to Fitted Q-iteration (Antos et al., 2006), DQN (Mnih et al., 2015), and the method outlined by Zhang et al. (2016)

, we derive a learning algorithm that fits a reward model and SF model by simultaneously minimizing two loss functions. The reward model is fitted by minimizing the reward loss

(9)

where the expectation is with respect to some visitation distribution over the state-action space , and where the scalar is the reward received for a particular transition.

The SF is learned by first estimating a target

(10)

for every collected transition . For computing this target, the SF estimate of the previous update iteration is used. Unlike Mnih et al.’s Deep Q-learning, the target is a vector and not a single scalar variable. For learning a SF representation, the loss objective

(11)

is used. The gradient of (11) with respect to the parameters is

(12)

which is similar to the gradient used by Deep Q-learning with the distinction that (12) is a matrix rather than a vector, and (11) is defined on the SF , rather than Q-values.

Algorithm 1 outlines the implemented SF learning method. Learning is stabilized by sampling a batch of transitions and using the entire batch to make a gradient descent update.

  Initialize , , and .
  loop
     Collect transitions using the Q-function estimate
     Using perform gradient update on and
  end loop
Algorithm 1 Fitted SF Learning

4 Experiments: Grid World

Algorithm 1 is first evaluated on a grid world navigation task with four actions: up, down, left, or right. Transitions are stochastic and with a 5% probability the agent moves sideways. Rewards are set to 1 for entering the goal cell (terminal state) in the top right corner, and otherwise a zero reward is given. Every episode is started in the bottom right corner and the discount factor is set to . Actions are selected using an -greedy policy with respect to the current Q-value estimates: with probability actions are selected uniformly at random and with probability the action with the highest Q-value estimate is used.

We compare our Fitted SF implementation against a Fitted Q-iteration implementation. To ensure a fair comparison, Fitted Q-iteration is identical to Fitted SF except that Fitted Q-iteration minimizes the loss objective

(13)

where the target is set to

(14)

The value estimate and is the Q-function estimate of the previous iteration.

In all experiments, the Q-function in Fitted Q-iteration uses a basis function tabulating the state action space and the weight vector is learned as described in (2). Further, the basis function used for estimating the reward model (6

) also tabulates the state-action space; that is, the reward model can always exactly represent the true reward function. The SF representation is learned as a linear transform on the tabular basis function

:

(15)

Because all basis functions are chosen to be tabular, and SFs are linear in a tabular one-hot basis function, both algorithms are not constrained in their representation and can always capture the true value function, reward model, and successor features.

4.1 Single Task Learning

Figure 1 compares the performance of the Fitted SF algorithm against Fitted Q-iteration. Both algorithms converge to a good solution and can perform the navigation task in few steps at the end of training111Note that the control policy was constrained to be only -greedy with .. The Fitted SF algorithm converges slower, which can be explained by the fact that it has to learn a full reward model before it can form good Q-value estimates. Figure 4 shows that the Fitted SF algorithm robustly minimizes both its loss objectives.

Figure 1:

Episode length for the best Fitted Q-iteration run and Fitted SF run. All experiments were repeated 20 times and the average episode length plus standard deviation is plotted. The shorter the episode, the sooner the agent can reach the +1 reward state—a shorter episode is better.

(a) Loss Objective  (11)
(b) Loss Objective  (9)
Figure 4:

Evolution of the loss objectives. Fitted SF minimizes using the Adagrad gradient descent optimizer implemented in Tensorflow 

(Abadi et al., 2015). A learning rate of performed best for the loss objective and a learning rate of performed best for the loss objective . The fitted Q-iteration implementation performed best with a learning rate of . Otherwise Tensorflow’s default parameters were used.
(a) Comparison of Fitted SF learning with Fitted Q-iteration. Fitted Q-iteration used a learning rate of , Fitted SF learning used a learning rate of for the SF and a learning rate of for the reward model.
(b) Comparison of different weight resetting strategies for the SF algorithm. The green curve is the same as in Figure (a)a. The blue curve shows the episode length when all weights are reinitialized between training rounds, the green curve keeps the matrix between reward function changes. The blue curve used a learning rate of 0.001 for the SF and 0.01 for the reward model.
Figure 7: Performance results for repeatedly moving start and goal position by one cell every 400 episodes. A total of three different start and goal positions were used and then repeated. The episode length was capped at 200 steps.

4.2 Multi Task Learning

The Fitted SF algorithm was also tested in two transfer settings where the start and goal locations are changed periodically between a fixed set of different locations. Changing the goal location is equivalent to changing the reward function while holding the transition dynamics fixed.

Transfer with Slight Reward Changes

Figure (a)a compares the episode length of the Fitted Q-iteration implementation and Fitted SF implementation when start and goal locations are moved by one grid cell. Once the reward function is changed, the weight parameter of the Fitted SF algorithm is re-initialized to zero. For Fitted Q-iteration the trained weights are kept after every reward function change. While initial training is slower for the Fitted SF algorithm, a change in reward function degrades performance significantly less in comparison to Fitted Q-iteration, demonstrating the robustness of the Fitted SF algorithm. Figure (b)b compares two different resetting strategies for the Fitted SF learning algorithm: in one run all weights are re-initialized after a reward function change, while in the other the learned SF is kept between training rounds. One can see that keeping the SF weight matrix boosts performance significantly. This verifies the assumption presented by Barreto et al..

Transfer with Significant Reward Changes

To further test if SFs can be used for transfer between different domains, both algorithms are evaluated again on the same grid world, but the goal location is rotated through all four corners of the grid. The start location is always the corner diagonally across the grid from the goal. Changing start and goal locations in this way causes the reward function and the optimal policy to change more significantly.

To further stabilize learning and ensure sufficient exploration, both algorithms select actions using an -greedy policy. The probability is decayed according to the rule , where is the episode index. This episode index is reset to zero after every reward function change. Ensuring sufficient exploration allows the Fitted SF algorithm to efficiently re-estimate its reward model.

Figure 8: Comparison of the Fitted Q-iteration and Fitted SF algorithm when rotating every 100 episodes the goal location through all four corners of the grid. Fitted Q-iteration uses a learning rate of , Fitted SF learning uses a learning rate of for the SF and a learning rate of for the reward model. The episodes were capped at 4000 steps.
Avg. Episode Length Fitted Q-iteration Fitted SF -value Table 1: Average episode length for Figure 4.2. The

-value of the Welch’s t-test tests if the episode lengths are significantly different.

Figure 9: Episode length of Fitted SF when reward functions change every 400 episodes. The episodes were clipped at 200 steps. All other parameters are the same as in Figure 4.2

Figure 4.2 compares the episode length of both algorithms over several repeats of the four goal locations. The ordering of the different goal locations is not changed during the experiment. One can see that the change in reward function has an impact on both algorithms, but the Fitted SF algorithm outperforms Fitted Q-iteration significantly. Table 1 compares the average episode length across all episodes and shows that our Fitted SF algorithm outperforms the Fitted Q-iteration significantly. Figure 12 shows how the loss functions of the Fitted SF algorithm evolves during the experiment. Updates were done only every 100 steps (each gradient update used a batch of 100 transitions). As expected, the reward loss does not seem to decrease significantly in a steady way but oscillates instead. However, the estimates seem to be good enough to achieve a significant performance difference over Fitted Q-iteration. Interestingly, the SF loss oscillates during training between very low and high values.

Figure 9 shows a failure setting of the Fitted SF algorithm: If and is not annealed, only the first optimal policy and the first reward function is learned and then preserved across all subsequent changes. As a result, one can see a learning curve for the first 400 episodes and then Fitted SF hits the episode time-out of 200 steps for the next reward configuration. If a reward function similar to the first is presented to the agent again, Fitted SF solves this problem easily because it reuses the weights it has learned at the beginning of the experiment. In other words, Fitted SF is not able to transfer the solution learned in the first 400 episodes to the other tested reward functions.

(a) Loss Objective  (11)
(b) Loss Objective  (9)
Figure 12: Evolution of the Loss function for the Fitted SF algorithm. A gradient update was applied every 100 steps.

5 Discussion

Figure 13: Successor Feature Transfer Counter Example. The change in optimal action at state causes the SF at state to change.

The goal of using SFs is to capture a feature set common to a set of MDPs and this idea seems to perform well for transfer between these MDPs. Interestingly, Figure (a)a shows that the SF loss objective oscillates despite the fact that the algorithm recovers a near optimal policy quickly.

To get a better understanding why the loss objective oscillates, consider the transfer example shown in Figure 13. In this example, the two MDPs have two actions and deterministic transitions indicated by arrows. Rewards are indicated by the arrow labels and the two MDPs only differ in reward for two specific transitions. This difference in reward causes the optimal policy for each MDP to be different: The policy , which only selects action , is optimal in the first MDP; the policy , which selects action at state and action elsewhere, is optimal in the second MDP. The left side of Figure 13 shows the successor feature for both optimal policies, which is different for the two MDPs. This difference is caused because SFs are constrained to be similar to features the agent sees in the future. However, which features are seen is governed by the (optimal) policy. This highlights a key limitation of using Successor Features for transfer: the learned representation is not transferrable between optimal policies. When solving a previously unseen MDP, a learned SF representation can only be used to initialize the search for an optimal policy and the agent still has to adjust the SF representation to the policy that is only optimal in the current MDP.

The fact that the SF representation has to be re-learned for each individual MDP can be seen in our experiments. In Figure (a)a they contribute to the oscillations of the SF loss objective. In the failure case shown in Figure 9 the SF representation does not transfer at all and instead represents an initialization that the gradient optimizer cannot use to adjust to the new reward function. This behaviour is not surprising because in this experiment the goal location was changed to a different corner in the grid, causing the optimal policy to change significantly. In the positive test case shown in Figure 4.2 this is mitigated by resetting the policy first to uniformly random exploration (by annealing from 1.0 to 0.1) which can be thought of as smoothing the transitions between different reward functions.

This result also agrees with the first transfer experiment shown in Figure 7. Because the reward function and optimal policy is only changed slightly, the SF representations corresponding to each optimal policy and reward function are likely to be very similar. As a result, the algorithm can adjust to the new reward function very quickly. Barreto et al. also presented empirical results using a variation of Generalized Value Iteration (Sutton & Barto, 1998) on a version of Puddle World (Sutton, 1996) where the location of the puddle changed slightly. Their experiment, which shows a significant performance boost by transferring a SF representation, is similar to slight reward change test case because the changes in the reward function did not cause a drastic change in the optimal policy.

6 Conclusion

The presented empirical results demonstrate an interesting advantage and dis-advantage of transferring SFs between MDPs that only differ in reward function. While we were able to show a significant performance boost by using this approach, we also highlighted that the learned feature representation is dependent on the policy they are learned for. Hence, SF representations are an unsuitable choice in this context because one is typically interested in transferring knowledge between tasks with different optimal policies.

The fact that transferring a SF representation between tasks gives a significant boost in learning speed also suggests that learning a transferrable feature representation might be an interesting direction to pursue. However, such a feature representation needs to be independent of the task’s optimal policy.

References

  • Abadi et al. (2015) Abadi, Martín, Agarwal, Ashish, Barham, Paul, Brevdo, Eugene, Chen, Zhifeng, Citro, Craig, Corrado, Greg S., Davis, Andy, Dean, Jeffrey, Devin, Matthieu, Ghemawat, Sanjay, Goodfellow, Ian, Harp, Andrew, Irving, Geoffrey, Isard, Michael, Jia, Yangqing, Jozefowicz, Rafal, Kaiser, Lukasz, Kudlur, Manjunath, Levenberg, Josh, Mané, Dan, Monga, Rajat, Moore, Sherry, Murray, Derek, Olah, Chris, Schuster, Mike, Shlens, Jonathon, Steiner, Benoit, Sutskever, Ilya, Talwar, Kunal, Tucker, Paul, Vanhoucke, Vincent, Vasudevan, Vijay, Viégas, Fernanda, Vinyals, Oriol, Warden, Pete, Wattenberg, Martin, Wicke, Martin, Yu, Yuan, and Zheng, Xiaoqiang.

    TensorFlow: Large-scale machine learning on heterogeneous systems, 2015.

    URL http://tensorflow.org/. Software available from tensorflow.org.
  • Antos et al. (2006) Antos, András, Szepesvári, Csaba, and Munos, Rémi. Learning near-optimal policies with bellman-residual minimization based fitted policy iteration and a single sample path. In

    International Conference on Computational Learning Theory

    , pp. 574–588. Springer, 2006.
  • Barreto et al. (2016) Barreto, André, Munos, Rémi, Schaul, Tom, and Silver, David. Successor features for transfer in reinforcement learning. CoRR, abs/1606.05312, 2016. URL http://arxiv.org/abs/1606.05312.
  • Dayan (1993) Dayan, Peter. Improving generalization for temporal difference learning: The successor representation. Neural Computation, 5(4):613–624, 1993.
  • Kaelbling et al. (1996) Kaelbling, Leslie Pack, Littman, Michael L, and Moore, Andrew W. Reinforcement learning: A survey.

    Journal of artificial intelligence research

    , 4:237–285, 1996.
  • Konidaris et al. (2011) Konidaris, George, Osentoski, Sarah, and Thomas, Philip. Value function approximation in reinforcement learning using the fourier basis. Proceedings of the Twenty-Fifth AAAI Conference on Artificial Intelligence, pp. pages 380–385, August 2011.
  • Mnih et al. (2015) Mnih, Volodymyr, Kavukcuoglu, Koray, Silver, David, Rusu, Andrei A, Veness, Joel, Bellemare, Marc G, Graves, Alex, Riedmiller, Martin, Fidjeland, Andreas K, Ostrovski, Georg, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
  • Sutton (1996) Sutton, Richard S. Generalization in reinforcement learning: Successful examples using sparse coarse coding. Advances in neural information processing systems, pp. 1038–1044, 1996.
  • Sutton & Barto (1998) Sutton, Richard S. and Barto, Andrew G. Reinforcement Learning: An Introduction. A Bradford Book. MIT Press, Cambridge, MA, 1 edition, 1998.
  • Taylor & Stone (2009) Taylor, Matthew E. and Stone, Peter. Transfer learning for reinforcement learning domains: A survey. Journal of Machine Learning Research, 10(1):1633–1685, 2009.
  • Zhang et al. (2016) Zhang, Jingwei, Springenberg, Jost Tobias, Boedecker, Joschka, and Burgard, Wolfram. Deep reinforcement learning with successor features for navigation across similar environments. arXiv preprint arXiv:1612.05533, 2016.