Feature Control as Intrinsic Motivation for Hierarchical Reinforcement Learning
The problem of sparse rewards is one of the hardest challenges in contemporary reinforcement learning. Hierarchical reinforcement learning (HRL) tackles this problem by using a set of temporally-extended actions, or options, each of which has its own subgoal. These subgoals are normally handcrafted for specific tasks. Here, though, we introduce a generic class of subgoals with broad applicability in the visual domain. Underlying our approach (in common with work using "auxiliary tasks") is the hypothesis that the ability to control aspects of the environment is an inherently useful skill to have. We incorporate such subgoals in an end-to-end hierarchical reinforcement learning system and test two variants of our algorithm on a number of games from the Atari suite. We highlight the advantage of our approach in one of the hardest games -- Montezuma's revenge -- for which the ability to handle sparse rewards is key. Our agent learns several times faster than the current state-of-the-art HRL agent in this game, reaching a similar level of performance. UPDATE 22/11/17: We found that a standard A3C agent with a simple shaped reward, i.e. extrinsic reward + feature control intrinsic reward, has comparable performance to our agent in Montezuma Revenge. In light of the new experiments performed, the advantage of our HRL approach can be attributed more to its ability to learn useful features from intrinsic rewards rather than its ability to explore and reuse abstracted skills with hierarchical components. This has led us to a new conclusion about the result.READ FULL TEXT VIEW PDF
Explicit engineering of reward functions for given environments has been...
Solving tasks with sparse rewards is a main challenge in reinforcement
Rewards are sparse in the real world and most today's reinforcement lear...
Deep reinforcement learning has achieved many impressive results in rece...
We present a method for learning intrinsic reward functions to drive the...
Learning about many things can provide numerous benefits to a reinforcem...
Intrinsic rewards are introduced to simulate how human intelligence work...
Feature Control as Intrinsic Motivation for Hierarchical Reinforcement Learning
Reinforcement learning methods Sutton:1998 ; mnih2015human often struggle in environments where the rewards are sparsely encountered, and when their acquisition requires the coordination of temporally extended sequences of actions. In these types of environments, archetypally exemplified by the Atari game Montezuma’s revenge, the dearth of feedback the agent receives from the environment makes it very difficult to learn long sequences of actions, particularly when the timescale of the exploration strategy is short.
Hierarchical reinforcement learning (HRL) barto2003recent is an approach that aims to deal with the reward sparsity problem by equipping the agent with temporally extended macro-actions, also known as options sutton1999between or skills konidaris2009skill , which abstract over sequences of primitive actions. If useful options are established, then long sequences of primitive actions can be expressed by much shorter sequences of options, which are easier to learn as the agent can now employ temporally extended exploration in the option space. However, learning useful options is a difficult task in itself; one possibility is to incorporate prior knowledge about the task into their construction barto2003recent ; parr1998reinforcement ; dietterich2000hierarchical but this can limit the generalisability of the algorithm to other tasks.
In this paper, we constrain this prior knowledge to the hypothesis that the ability to control features of its environment is an inherently useful skill for an agent to have for succeeding at a wide variety of tasks. By applying this concept in a deep HRL setting, we design an agent that is intrinsically motivated to control aspects of its environment via a set of options.
The architecture of our agent is inspired by feudal reinforcement learning dayan1992feudal ; vezhnevets2017feudal and the hierarchical deep reinforcement learning framework kulkarni2016hierarchical , whereby a meta-controller provides embedded subgoals to a sub-controller that interacts directly with the environment (see Figure 0(a)). In our agent, the meta-controller, which learns to maximise extrinsic reward from the environment, tells the sub-controller what feature of the environment it should control; the sub-controller receives intrinsic rewards for successfully changing the chosen feature as well as extrinsic rewards from the environment. We show that, when guided by this form of intrinsic motivation, the agent learns to perform better in tasks featuring sparse rewards.
Our main contribution is in the design of discrete sets of subgoals available for the meta-controller to choose from and their corresponding intrinsic reward. By taking an existing idea of feature control jaderberg2016reinforcement ; bengio2017independently and incorporating it into the subgoal design, we introduce a hierarchical agent with generically useful learnable options which we empirically evaluate in the Atari domain.
The idea of embodying an agent with a form of intrinsic motivation, which in our case is the desire to be able to control aspects of its environment, is one that has been explored in several other works. Klyubin et al. introduced empowerment
as an information theoretic measure of degrees of freedom that an agent has over an environmentklyubin2005empowerment . The concept of empowerment has recently gained interest in the context of intrinsically motivated reinforcement learning mohamed2015variational ; gregor2016variational . In the other lines of work, intrinsic motivation is defined in the form of curiosity, which can be measured with model-learning progress schmidhuber1991curious ; houthooft2016vime or information gain bellemare2016unifying .
Jaderberg et al. introduced the idea of off-policy training with auxiliary control tasks, such as pixel control or feature control, which can significantly speed up learning of the main task jaderberg2016reinforcement . The rationale is that learning the auxiliary tasks gives the agent features that are useful for manipulating the environment. Drawing on this idea, we apply the idea of pixel and feature control to the HRL framework so that options are constructed with the explicit motive of altering given features or patches of pixels. By doing so, our agent is equipped with temporally-extended options, which can be used on-policy to explore the environment in a temporally-extended manner, thereby helping to address the problem of sparse reward.
Our architecture also takes inspiration from recent work by Kulkarni et al. kulkarni2016hierarchical and Vezhnevets et al. vezhnevets2017feudal . These works outline hierarchical architectures that comprise a subgoal-selecting meta-controller and a sub-controller that tries to achieve the subgoal. The main feature that sets our model apart is the design of the subgoals. Kulkarni et al. pre-define a set of discrete subgoals specific to the tasks at hand, while Vezhnevets et al. construct subgoals as a large continuous set of embedded states. We construct two discrete sets of subgoals, which are discussed in more detail in Section 2.1. One of them is fixed but is designed to be generically applicable in visual domains; the other can be automatically learned such that the subgoals are useful for solving the task at hand.
There are a large number of works on subgoal discovery csimcsek2004using ; menache2002q ; mcgovern2001automatic , most of which are based on finding bottleneck states. Since finding bottleneck states requires global statistics of the environment, finding them can be difficult and hard to scale. Our work is in line with contemporary HRL, e.g. Option-Critic bacon2016option , which has moved towards end-to-end training where options and subgoals can automatically emerge from the optimisation of the system, with carefully designed architectures and objective functions.
). The bold horizontal line depicts a concatenation of all the incoming vectors. (c) Diagram of our proposed model. The most notable difference is an additional LSTM component which parameterises the meta-controller’s policy and value function. The dotted line represents the slower time scale at which the meta-controller operates.
We consider the standard reinforcement learning setting where an agent interacts with an environment by observing the state of the environment and taking an action at every discrete time step . The environment provides an extrinsic reward to the agent and then transitions to the next state . The goal of the agent is to maximise the accumulated sum of extrinsic rewards over the finite horizon length of an episode.
Specifically, we consider a hierarchical agent with two components: a meta-controller and a sub-controller. The sub-controller is responsible for choosing the agent’s actions and directly interacts with the environment. The meta-controller operates on a longer time scale of time steps and influences the behaviour of the sub-controller through a subgoal argument, . This influence is imposed by giving as an input to the sub-controller in addition to . Importantly, the meta-controller also gives intrinsic reward to the sub-controller, , for successfully completing the subgoal and thus, by learning to associate and , the sub-controller’s behaviour is biased to complete the subgoals. At the same time, the meta-controller learns to select sequences of such that the sub-controller trajectory maximises accumulated extrinsic reward.
Here we detail two variants of our algorithm, corresponding to two ways to deliver the subgoal : (a) the pixel-control agent and (b) the feature-control agent. Both agents have the same architecture, which will be discussed in Section 2.2. The main difference between the two is the calculation of intrinsic reward, which is crucial for manipulation of the behaviour of the sub-controller.
Following Jaderberg et al., we study the most basic form of controlling ability in the visual domain, which is the ability to control a given subset of pixels in the visual input jaderberg2016reinforcement . We divide the pre-processed 84x84 input image into 36 pixel patches of size 14x14. We define the intrinsic reward as the squared difference between two consecutive frames of pixels in the patch, normalized by the squared difference of the whole image. The sub-controller is thus encouraged to maximise the change in values of pixels in the given () patch relative to the entire screen. This can be written formally as
where is an 84x84 binary filter matrix with entries all equal to 0 apart from the pixel patch, which has entries all equal to 1. By applying this filter with element-wise multiplication , only the changes in the relevant part of the screen are taken into account. is a scaling factor which controls the magnitude of the intrinsic reward per time step.111We choose which gives a reasonable value of accumulated intrinsic reward over an episode at the start of the training. We leave the tuning of this parameter for future work. While we believe tuning this parameter is important, the more important parameter is the relative weight between extrinsic and intrinsic reward, see eq. 3.
Jaderberg et al. introduced a notion of feature control which is defined as the ability to control the activations of specific neurons. Similarly, Bengio et al. introduced the notion of feature selectivity that measures how much a feature can be controlled, independently from other featuresbengio2017independently . We define intrinsic reward as Bengio et al.’s feature selectivity measure on the second convolutional layer of our network. To measure the selectivity of a feature, we take the difference between the mean activation of a selected feature map at consecutive time steps and normalize with all feature maps. This can be written as
where denotes the mean over activation values in the feature map and denotes summation over all feature maps.
In contrast with the pixel-control agent, allowing the meta-controller to select a convolutional feature endows the agent with more flexible and abstract control of its environment. The instruction from the meta-controller is more abstract since a feature map can represent a complex function of the sensory inputs, and it is more flexible because the feature maps can be shaped during learning to encode aspects of the environment that are useful to control for the completion of the main task.
In addition to intrinsic reward, we also give extrinsic reward to the sub-controller, enabling it to learn fine-grained behaviour. We adjust the ratio between intrinsic and extrinsic reward with a parameter , which results in the shaped reward,
The sub-controller strictly follows the order of the meta-controller if . On the other hand, if , the meta-controller has little direct influence on the sub-controller. In this case, the sub-controller still receives the subgoal argument as an input but does not receive rewards for attaining the subgoal.
Our baseline model is a variant of the asynchronous advantage actor-critic algorithm (A3C) mnih2016asynchronous . We adapted OpenAI’s A3C implementation222Our implementation is an adaptation of an open-source implementation of A3C, namely, “Universe-Starter-Agent”. (https://github.com/openai/universe-starter-agent ) which is written with Tensorflow
) which is written with Tensorflowtensorflow2015-whitepaper 333The source code of our implementation will be made publicly available. to follow the architecture specified by Wang et al. wang2016learning
. The model consists of two parts: an encoding module and a Long-Short Term Memory layer (LSTM)hochreiter1997long
. The encoding module consists of two convolution layers and a fully connected layer. The first convolution has 16 8x8 filters with a stride length of 4 and the second layer has 32 4x4 filters with a stride length of 2. The second convolution is followed by a fully connected layer with 256 units. The output of the fully connected layer is then concatenated with the previous action and the previous reward, and then fed into an LSTM layer. The LSTM has 256 cells whose output linearly projects into the policy and value networks.
Our hierarchical model extends the baseline model as follows (Figure 1). An additional LSTM layer is added to parameterise the meta-controller’s value and policy function. The input to the meta-controller’s LSTM includes the previous subgoal argument and extrinsic rewards accumulated from the previous meta-step. The sub-controller’s LSTM also has an additional input consisting of the current subgoal argument. The meta-controller operates every time steps.
The subgoal argument is a one-hot vector, which specifies the index of the subgoal selected by the meta-controller. For the pixel-control agent, there are 37 subgoals corresponding to 36 patches of pixel plus a no-op which gives no intrinsic reward. The feature-control agent has 32 discrete subgoals, each corresponding to a feature map in the second convolutional layer.
for all of our experiments. Our experiments use a backpropagation through time (BPTT) trajectory length of either 20 or 100 time steps for sub-controller and the baseline agent, and 20 meta-steps (2000 time steps) for the meta-controller. Finally, the gradients are scaled down when their-norms exceed 40.
The objectives of our experiments are:
To verify that the influence from the meta-controller is beneficial in sparse reward environments.
To evaluate the performance of our agent in several environments in comparison to current state-of-the-art HRL systems.
To study the behaviour of the agents under the influence of pixel-control and feature-control intrinsic motivation.
We evaluated the model on Atari games in the OpenAI gym environment brockman2016openai , a toolkit for comparing reinforcement learning algorithms that wraps the Arcade Learning Environment (ALE) bellemare2013arcade with a number of modifications. In our experiments, we make comparisons with the Feudal Network (FuN) (vezhnevets2017feudal, ) and Option-Critic bacon2016option architectures, both evaluated on the ALE. 444However, it is important to note that they are not directly comparable. OpenAI’s gym adds stochasticity through the use of random frame-skips, while ALE is a deterministic environment. The standard evaluation protocol in ALE is to add a random number of no-op actions at the start of the episode to achieve some stochasticity. The environment provides the state as 210x160x3 RGB pixels. We pre-process the state by reshaping it into an 84x84x3 matrix, retaining the RGB channels. We also clip the extrinsic reward to the range of [-1, 1]. We used v0 setting for all the games, e.g., MontezumaRevenge-v0 for Montezuma’s Revenge.
To evaluate the effectiveness of the meta-controller, we ran our agent with different relative weights between extrinsic and intrinsic reward, and compared the performance with the baseline agent. We used BPTT = 20 for the sub-controller. First, we found that with (no intrinsic reward) the agent’s performance was similar to the baseline. This result demonstrates that any significant gain or decline in performance using other values of can be attributed to the intrinsic reward.
As shown in Figure 2, the feature-control agent with outperforms other agents in Montezuma’s Revenge and Frostbite and is competitive with other agents in Q*bert and Private Eye. This result suggests that introducing a certain proportion of intrinsic reward in the sub-controller has a positive effect in sparse reward environments (such as Montezuma’s Revenge) without degrading the performance on dense reward environments (such as Q*bert).
Our agents with (no extrinsic reward) perform very poorly as expected. Since the sub-controller can only follow a limited number of subgoals from the meta-controller, its behaviours are also limited in this case. Giving extrinsic reward to the sub-controller is a way to allow fine-grained behaviours that are important for maximising extrinsic reward. Interestingly, agents with also perform worse than baseline in Q*bert. This result shows that too much influence from the meta-controller can have negative effect in dense reward environments. Interestingly, the best value of is consistent across all four games.
We observe that the pixel-control agent learns more quickly than the feature-control agent. However, the feature-control agent generally achieves better scores after 100 million frames of training. This is likely due to the fact that the feature-control agent needs time to learn useful features before the influence of the meta-controller becomes meaningful. Once the features have been learned, the subgoals obtained are of higher quality than the hard-coded ones in the pixel-control agent.
In our initial experiments, we observed instability in the training curve of the feature-control agent, which comes in the form of catastrophic drops in performance. To alleviate this problem, we tried increasing the BPTT roll-out from 20 to 100 steps for the sub-controller. We reasoned that a longer unrolled sequence of BPTT could contribute to training stability in the following ways: (i) the updates are less frequent and give the agent more stable features, which are a crucial component in the calculation of the intrinsic reward, and (ii) it allows the gradient to be backpropagated further into the past, which potentially reduces bias in the update.
In Figure 3 we see that in Montezuma’s Revenge the agent attains a much higher score with BPTT = 100 than with BPTT = 20. In Frostbite, however, we observe the opposite effect. This could be because the gradient is already stable at BPTT = 20 and so increasing the BPTT length does not yield any positive effect; on the contrary, as a result of lowering the frequency of updates it can result in slower learning (see Figure 3).
In this experiment, we evaluated our feature-control agent on Ms. Pac-Man, Asterix, Zaxxon and Montezuma’s Revenge. The aim was to show that the method is applicable to a broad range of games, and to compare our system to two state-of-the-art end-to-end HRL systems, namely the Option-Critic bacon2016option and FuN vezhnevets2017feudal architectures.
Our results are shown in Figure 4 and we note the following: (i) On Ms. Pac-Man, Asterix and Zaxxon we achieve better maximum scores than the Option-Critic network but worse maximum scores than the FuN Network; (ii) on Montezuma’s Revenge, our agent reaches approximately the same maximum score as the FuN network but it learns much more quickly, reaching this level of performance after fewer than a fifth of the number of observations. We anticipate being able to improve our agent’s performance with a broader parameter search. For example, the discount parameter, , has been shown to have a significant impact on the performances of both A3C and FuN on different Atari games vezhnevets2017feudal . We did not compare with other state-of-the-art results in Montezuma’s Revenge, such as those obtained with the UNREAL agent jaderberg2016reinforcement , DQN-CTS bellemare2016unifying and DQN-PixelCNN OstrovskiBOM17 , since these are not competing HRL methods and their advantageous features could easily be integrated into our agent.
The influence of the intrinsic motivation provided by the meta-controller on the agent’s behaviour can be most easily visualised with the pixel-control agent. In Figure 5a, a sequence of screenshots from Montezuma’s Revenge is shown where we see the sub-controller moving the character to the patch selected by the meta-controller and causing it to jump around in the patch in order to generate intrinsic reward. Figure 5b shows another sequence where the meta-controller selects a patch over the ladder that must be climbed to collect the key. The character moves to the patch, but when the meta-controller then changes the location of the patch, the sub-controller ignores it and instead proceeds to collect the key, which results in extrinsic reward. This example highlights the importance of motivating the sub-controller with extrinsic as well as intrinsic reward, allowing the agent to be flexible and not completely at the mercy of the meta-controller.
Interpreting the intrinsic motivation of the feature-control agent is much more difficult, since it involves understanding what is encoded in the selected convolutional feature map. In an attempt to visualise this, we upsampled the selected feature map and overlaid it with the raw state input555This approach of visualisation of attention features is used by Xu et al. (2015) xu2015show .. The feature-control agent has to learn strategies to maximally change the activations of this feature map to gain intrinsic reward.
We present two scenarios with the feature-control agent in Figure 5c and d, which indicate how different types of features can evolve to form useful options for the agent. Figure 4(c) shows the agent collecting the key in the first room of Montezuma’s Revenge. In this scenario, the feature map is activated in front of the agent on the path towards the key. This implicitly encourages the agent to move towards the key, as it attempts to maximally alter the activations of the feature map. Figure 4(d) shows the agent collecting the sword in another room. In this scenario, the feature map is only activated when the agent completes the apparent sub-task (collecting the sword), as opposed to the first scenario where the entire path to completing the sub-task (collecting the key) is highlighted.
In this paper, we presented an approach to tackling the reward sparsity problem in the form of a two-module deep hierarchical agent. In Montezuma’s Revenge, an Atari game with particularly sparse rewards, our agent learns several times faster than the current state-of-the-art HRL agents, reaching a similar final level of performance. We also show that our subgoal designs are generically applicable across visual tasks by evaluating the agent on several different games. Our agent almost always performs better than the baseline agent, which suggests that the ability to control aspects of the environment can be a generically useful subgoal.
We argue that part of the performance gain comes from the ability to perform temporally-abstracted exploration. By visualising the trajectories of the pixel-control agent, we observe that it successfully learns to move towards the patch selected by the meta-controller in order to maximise its intrinsic reward; the acquisition of this skill allows the meta-controller to motivate the agent to explore its environment in a broad and temporally extended manner. In the feature-control agent, the options are learned via the shaping of the convolutional features and, while the features are harder to interpret than the pixel patches, there is evidence from our visualisations that they are activated by the completion of intuitive subgoals, such as collection of the sword in Montezuma’s Revenge.
An important result from our experiments is that the best performances are achieved when the sub-controller is motivated by a combination of intrinsic and extrinsic reward. By leaking some extrinsic reward to the sub-controller, it frees us from the restriction that subgoals need to be complete or carefully designed, which can lead to brittle or sub-optimal solutions dietterich2000hierarchical . As long as the subgoals are useful for exploration, an agent equipped with such skills can learn faster, while still maintaining the ability to fine-tune its behaviour to maximise extrinsic reward.
In order to give our agent more flexibility, it would be interesting, in future work, to incorporate a termination condition to the options, which would allow the instruction from the meta-controller to be variable in length and thus more temporally precise. Additionally it would be interesting to quantify the extent to which our agents have learned to control their environment, perhaps by using the measure of empowerment klyubin2005empowerment .
UPDATE 22/11/17: We later found that a flat A3C agent trained with shaped reward according to
equation 3 can perform as well as our feature-control agent in Montezuma Revenge.
This result, in line with  and , supports the claim that additional auxiliary rewards or loss signals can be beneficial when dealing with
sparse reward environments even though the reward can possibly skew the definition of its task.
UPDATE 22/11/17: We later found that a flat A3C agent trained with shaped reward according to equation 3 can perform as well as our feature-control agent in Montezuma Revenge. This result, in line with  and , supports the claim that additional auxiliary rewards or loss signals can be beneficial when dealing with sparse reward environments even though the reward can possibly skew the definition of its task.
Importantly, this raises a question about the benefit of having the hierarchical elements proposed in this paper. It appears that decisions made by the meta-controller do not significantly contribute to the success of feature-control agent.
We would like to thank Marc Deisenroth for providing us with Azure credits from the Microsoft Azure Sponsorship for Teaching and Research. We would also like to thank Kyriacos Nikiforou, Hugh Salimbeni and Kai Arulkumaran for fruitful discussions. N.D. is supported by the DPST scholarship from the Thai government.
TensorFlow: Large-scale machine learning on heterogeneous systems, 2015.Software available from tensorflow.org.
AAAI Conference on Artificial Intelligence, 2017.