Deep Successor Reinforcement Learning

06/08/2016 ∙ by Tejas D. Kulkarni, et al. ∙ MIT Harvard University 0

Learning robust value functions given raw observations and rewards is now possible with model-free and model-based deep reinforcement learning algorithms. There is a third alternative, called Successor Representations (SR), which decomposes the value function into two components -- a reward predictor and a successor map. The successor map represents the expected future state occupancy from any given state and the reward predictor maps states to scalar rewards. The value function of a state can be computed as the inner product between the successor map and the reward weights. In this paper, we present DSR, which generalizes SR within an end-to-end deep reinforcement learning framework. DSR has several appealing properties including: increased sensitivity to distal reward changes due to factorization of reward and world dynamics, and the ability to extract bottleneck states (subgoals) given successor maps trained under a random policy. We show the efficacy of our approach on two diverse environments given raw pixel observations -- simple grid-world domains (MazeBase) and the Doom game engine.

READ FULL TEXT VIEW PDF

Authors

page 6

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Many learning problems involve inferring properties of temporally extended sequences given an objective function. For instance, in reinforcement learning (RL), the task is to find a policy that maximizes expected future discounted rewards (value). RL algorithms fall into two main classes: (1) model-free algorithms that learn cached value functions directly from sample trajectories, and (2) model-based algorithms that estimate transition and reward functions, from which values can be computed using tree-search or dynamic programming. However, there is a third class, based on the

successor representation

(SR), that factors the value function into a predictive representation and a reward function. Specifically, the value function at a state can be expressed as the dot product between the vector of expected discounted future state occupancies and the immediate reward in each of those successor states.

Representing the value function using the SR has several appealing properties. It combines computational efficiency comparable to model-free algorithms with some of the flexibility of model-based algorithms. In particular, the SR can adapt quickly to changes in distal reward, unlike model-free algorithms. In this paper, we also highlight a feature of the SR that has been less well-investigated: the ability to extract bottleneck states (candidate subgoals) from the successor representation under a random policy stachenfeld2014design

. These subgoals can then be used within a hierarchical RL framework. In this paper we develop a powerful function approximation algorithm and architecture for the SR using a deep neural network, which we call

Deep Successor Reinforcement Learning (DSR). This enables learning the SR and reward function from raw sensory observations with end-to-end training.

Figure 1: Model Architecture: DSR consists of: (1) feature branch (CNN) which takes in raw images and computes the features , (2) successor branch which computes the SR for each possible action , (3) a deep convolutional decoder which produces the input reconstruction and (4) a linear regressor to predict instantaneous rewards at . The Q-value function can be estimated by taking the inner-product of the SR with reward weights: .

.

The DSR consists of two sub-components: (1) a reward feature learning component, constructed as a deep neural network, predicts intrinsic and extrinsic rewards to learn useful features from raw observations; and (2) an SR component, constructed as a separate deep neural network, that estimates the expected future “feature occupancy” conditioned on the current state and averaged over all actions. The value function can then be estimated as the dot product between these two factored representations. We train DSR by sampling experience trajectories (state, next-state, action and reward) from an experience replay memory and apply stochastic gradient descent to optimize model parameters. To avoid instability in the learning algorithm, we interleave training of the successor and reward components.

We show the efficacy of our approach on two different domains: (1) learning to solve goals in grid-world domains using the MazeBase game engine and (2) learning to navigate a 3D maze to gather a resource using the Doom game engine. We show the empirical convergence results on several policy learning problems as well as sensitivity of the value estimator given distal reward changes. We also demonstrate the possibility of extracting plausible subgoals for hierarchical RL by performing normalized-cuts on the SR shi2000normalized .

2 Related work

The SR has been used in neuroscience as a model for describing different cognitive phenomena. gershman2012successor showed that the temporal context model howard2002distributed , a model of episodic memory, is in fact estimating the SR using the temporal difference algorithm. corneil2015attractor introduced a model based on SR for preplay and rapid path planning in the CA3 region of the hippocampus. They interpret the SR as an an attractor network in a low–dimensional space and show that if the network is stimulated with a goal location it can generate a path to the goal. stachenfeld2014design suggested a model for tying the problems of navigation and reward maximization in the brain. They claimed that the brain’s spatial representations are designed to support the reward maximization problem (RL); they showed the behavior of the place cells and grid cells can be explained by finding the optimal spatial representation that can support RL. Based on their model they proposed a way for identifying reasonable subgoals from the spectral features of the SR. Other work (see for instance, botvinick2014model ; daw2014algorithmic ) have also discussed utilizing the SR for subgoal and option discovery.

There are also models similar to the SR that have been been applied to other RL-related domains. schraudolph1994temporal introduced a model for evaluating the positions in the game of Go; the model is reminiscent of SR as it predicts the fate of every position of the board instead of the overall game score. Another reward-independent model, universal option model (UOM), proposed in szepesvari2014universal , uses state occupancy function to build a general model of options. They proved that UOM of an option, given a reward function, can construct a traditional option model. There has also been a lot of work on option discovery in the tabular setting mcgovern2001automatic ; csimcsek2005identifying ; mannor2004dynamic ; menache2002q ; konidaris2009skill . In more recent work, Machado et al. machado2016learning presented an option discovery algorithm where the agent is encouraged to explore regions that were previously out of reach. However, option discovery where non-linear state approximations are required is still an open problem.

Our model is also related to the literature on value function approximation using deep neural networks. The deep-Q learning model mnih2015human and its variants (e.g., silver2016mastering ; schaul2015prioritized ; nair2015massively ; mnih2016asynchronous ) have been successful in learning Q-value functions from high-dimensional complex input states.

3 Model

3.1 Background

Consider an MDP with a set of states , set of actions , reward function , discount factor , and a transition distribution . Given a policy , the Q-value function for selecting action in state is defined as the expected future discounted return:

(1)

where, is the state visited at time and the expectation is with respect to the policy and transition distribution. The agent’s goal is to find the optimal policy which follows the Bellman equation:

(2)

3.2 The successor representation

The SR can be used for calculating the Q-value function as follows. Given a state , action and future states , SR is defined as the expected discounted future state occupancy:

where when its argument is true and zero otherwise. This implicitly captures the state visitation count. Similar to the Bellman equation for the Q-value function (Eq. 2), we can express the SR in a recursive form:

(3)

Given the SR, the Q-value for selecting action in state can be expressed as the inner product of the immediate reward and the SR dayan1993improving :

(4)

3.3 Deep successor representation

For large state spaces, representing and learning the SR can become intractable; hence, we appeal to non-linear function approximation. We represent each state by a -dimensional feature vector which is the output of a deep neural network parameterized by .

For a feature vector , we define a feature-based SR as the expected future occupancy of the features and denote it by . We approximate by another deep neural network parameterized by  : . We also approximate the immediate reward for state as a linear function of the feature vector : , where is a weight vector. Since reward values can be sparse, we can also train an intrinsic reward predictor . A good intrinsic reward channel should give dense feedback signal and provide features that preserve latent factors of variations in the data (e.g. deep generative models that do reconstruction). Putting these two pieces together, the Q-value function can be approximated as (see 4 for closed form):

(5)

The SR for the optimal policy in the non-linear function approximation case can then be obtained from the following Bellman equation:

(6)

where .

3.4 Learning

The parameters can be learned online through stochastic gradient descent.

The loss function for

is given by:

where and the parameter denotes a previously cached parameter value, set periodically to . This is essential for stable Q-learning with function approximations (see mnih2015human ).

For learning , the weights for the reward approximation function, we use the following squared loss function:

(7)

Parameter is used for obtaining the , the shared feature representation for both reward prediction and SR approximation. An ideal should be: 1) a good predictor for the immediate reward for that state and 2) a good discriminator for the states. The first condition can be handled by minimizing loss function ; however, we also need a loss function to help in the second condition. To this end, we use a deep convolutional auto-encoder to reconstruct images under an L2 loss function. This dense feedback signal can be interpreted as an intrinsic reward function. The loss function can be stated as:

(8)

The composite loss function is the sum of the three loss functions given above:

(9)

For optimizing Eq. 9, with respect to the parameters , we iteratively update and . That is, we learn a feature representation by minimizing ; then given , we find the optimal . This iteration is important to ensure that the successor branch does not back-propagate gradients to affect . We use experience replay memory of size to store transitions, and apply stochastic gradient descent with a learning rate of , momentum of , a discount factor of and the exploration parameter annealed from 1 to 0.1 as training progresses. Algorithm 1 highlights the learning algorithm in greater detail.

1:Initialize experience replay memory , parameters

and exploration probability

.
2:for  do
3:     Initialize game and get start state description
4:     while  not terminal do
5:         
6:         With probability , sample a random action , otherwise choose
7:         Execute and obtain next state and reward from environment
8:         Store transition in
9:         Randomly sample mini-batches from
10:         Perform gradient descent on the loss with respect to , and .
11:         Fix ( and perform gradient descent on with respect to .
12:         
13:     end while
14:     Anneal exploration variable
15:end for
Algorithm 1 Learning algorithm for DSR

4 Automatic Subgoal Extraction

Learning policies given sparse or delayed rewards is a significant challenge for current reinforcement learning algorithms. This is mainly due to inefficient exploration schemes such as greedy. Existing methods like Boltzmann exploration and Thomson sampling stadie2015incentivizing ; osband2016deep offer significant improvements over -greedy, but are limited due to the underlying models functioning at the level of basic actions. Hierarchical reinforcement learning algorithms barto2003recent such as the options framework szepesvari2014universal ; sutton1999between provide a flexible framework to create temporal abstractions, which will enable exploration at different time-scales. The agent will learn options to reach the subgoals which can be used for intrinsic motivation. In the context of hierarchical RL, goel2003subgoal discuss a framework for subgoal extraction using the structural aspects of a learned policy model. Inspired by previous work in subgoal discovery from state trajectories csimcsek2005identifying and the tabular SR stachenfeld2014design , we use the learned SR to generate plausible subgoal candidates.

Given a random policy (), we train the DSR until convergence and collect the SR for a large number of states . Following csimcsek2005identifying ; shi2000normalized

, we generate an affinity matrix

given

, by applying a radial basis function (with Euclidean distance metric) for each pairwise entry

in (to generate ). Let be a diagonal matrix with . Then as per shi2000normalized

, the second largest eigenvalue of the matrix

gives an approximation of the minimum normalized cut value of the partition of . The states that lie on the end-points of the cut are plausible subgoal candidates, as they provide a path between a community of state groups. Given randomly sampled from , we can collect statistics of how many times a particular state lies along the cut. We pick the top-k states as the subgoals. Our experiments indicate that it is possible to extract useful subgoals from the DSR.

5 Experiments

In this section, we demonstrate the properties of our approach on MazeBase sukhbaatar2015mazebase , a grid-world environment, and the Doom game engine kempka2016vizdoom . In both environments, observations are presented as raw pixels to the agent. In the first experiment we show that our approach is comparable to DQN in two goal-reaching tasks. Next, we investigate the effect of modifying the distal reward on the initial Q-value. Finally, using normalized-cuts, we identify subgoals given the successor representations in the two environments.

5.1 Goal-directed Behavior

Solving a maze in MazeBase

We learn the optimal policy in the maze shown in Figure 2 using the DSR and compare its performance to the DQN mnih2015human . The cost of living or moving over water blocks is -0.5 and the reward value is 1. For this experiment, we set the discount rate to 0.99 and the learning rate to . We anneal the from 1 to 0.1 over 20k steps; furthermore, for training the reward branch, we anneal the number of samples that we use, from 4000 to 1 by a factor of 0.5 after each training episode. For all experiments, we prioritize the reward training by keeping a database of non-zero rewards and sampling randomly from the replay buffer with a 0.8 probability and 0.2 from the database. Figure 3 shows the average trajectory (over 5 runs) of the rewards obtained over 100k episodes. As the plot suggests, DSR performs on par with DQN.

Figure 2: Environments: (left) MazeBase sukhbaatar2015mazebase map where the agent starts at an arbitrary location and needs to get to the goal state. The agent gets a penalty of -0.5 per-step, -1 to step on the water-block (blue) and +1 for reaching the goal state. The model observes raw pixel images during learning. (center) A Doom map using the VizDoom engine kempka2016vizdoom where the agent starts in a room and has to get to another room to collect ammo (per-step penalty = -0.01, reward for reaching goal = +1). (right) Sample screen-shots of the agent exploring the 3D maze.

Finding a goal in a 3D environment

We created a map with 4 rooms using the ViZDoom platform kempka2016vizdoom . The map is shown in Figure 2. We share the same network architecture as in the case of MazeBase. The agent is spawned inside a room, and can explore any of the other three rooms. The agent gets a per-step penalty of -0.01 and a positive reward of 1.0 after collecting an item from one of the room (highlighted in red in Figure2). As shown in Figure3, the agent is able to successfully navigate the environment to obtain the reward, and is competitive with DQN.

Figure 3: Average trajectory of the reward (left) over 100k steps for the grid-world maze. (right) over 180k steps for the Doom map over multiple runs.
Figure 4: Changing the value of the distal reward: We train the model to learn the optimal policy on the maze shown in Figure 2. After convergence, we change the value of the distal reward and update the Q-value for the optimal action at the origin (bottom-left corner of the maze). In order for the value function to converge again, the model only needs to update the linear weights given the new external rewards.

5.2 Value function sensitivity to distal reward changes

The decomposition of value function into SR and immediate reward prediction allows DSR to rapidly adapt to changes in the reward function. In order to probe this, we performed experiments to measure the adaptability of the value function to distal reward changes. Given the grid-world map in Figure2, we can train the agent to solve the goal specified in the map as highlighted in section 5.1. Without changing the goal location, we can change the reward scalar value upon reaching the goal from 1.0 to 3.0. Our hypothesis is that due to the SR-based value decomposition, our value estimate will converge to this change by just updating the reward weights (SR remains same). As shown in Figure 4, we confirm that the DSR is able to quickly adapt to the new value function by just updating .

5.3 Extracting subgoals from the DSR

Following section 4, we can also extract subgoals from the SR. We collect by running a random policy on both MazeBase and VizDoom. During learning, we only update SR () and the reconstruction branch (), as the immediate reward at any state is zero (due to random policy).

As shown in Figures 5 and 6, our subgoal extraction scheme is able to capture useful subgoals and clusters the environment into reasonable segments. Such a scheme can be ran periodically within a hierarchical reinforcement learning framework to aid exploration. One inherent limitation of this approach is that due to the random policy, the subgoal candidates are often quite noisy. Future work should address this limitation and provide statistically robust ways to extract plausible candidates. Additionally, the subgoal extraction algorithm should be non-parametric to handle flexible number of subgoals.

(a)   (b)

Figure 5: Subgoal extraction on grid-world: Given a random policy, we train DSR until convergence and collect a large number of sample transitions and their corresponding successor representations as described in section 4. We apply a normalized cut-based algorithm on the SRs to obtain a partition of the environment as well as the bottleneck states (which correspond to goals) (a) Subgoals are states which separate different partitions of the environments under the normalized-cut algorithm. Our approach is able to find reasonable subgoal candidates. (b) Partitions of the environment reflect latent structure in the environment.
Figure 6: Subgoal extraction on the Doom map

The subgoals are extracted using the normalized cut-based algorithm on the SR. The SR samples are collected based on a random policy. The subgoals mostly correspond to the rooms’ entrances in the common area between the rooms. Due to random policy, we sometimes observe high variance in the subgoal quality. Future work should address robust statistical techniques to obtain subgoals, as well as non-parametric approaches to obtaining flexible number of subgoals.

6 Conclusion

We presented the DSR, a novel deep reinforcement learning framework to learn goal-directed behavior given raw sensory observations. The DSR estimates the value function by taking the inner product between the SR and immediate reward predictions. This factorization of the value function gives rise to several appealing properties over existing deep reinforcement learning methods—namely increased sensitivity of the value function to distal reward changes and the possibility of extracting subgoals from the SR under a random policy.

For future work, we plan to combine the DSR with hierarchical reinforcement learning. Learning goal-directed behavior with sparse rewards is a fundamental challenge for existing reinforcement learning algorithms. The DSR can enable efficient exploration by periodically extracting subgoals, learning policies to satisfy these intrinsic goals (skills), and subsequently learning hierarchical policy over these subgoals in an options framework szepesvari2014universal ; kulkarni2016hierarchical ; schaul2015universal . One of the major issues with the DSR is learning discriminative features. In order to scale up our approach to more expressive environments, it will be crucial to combine various deep generative and self-supervised models eslami2016attend ; greff2015binding ; rezende2016one ; kulkarni2015deep ; whitney2016understanding ; gregor2015draw ; huang2015efficient ; noroozi2016unsupervised with our approach. In addition to subgoals, using DSR for extracting other intrinsic motivation measures such as improvements to the predictive world model schmidhuber2010formal or mutual information mohamed2015variational is worth pursuing.

References