Scheduled Intrinsic Drive: A Hierarchical Take on Intrinsically Motivated Exploration

03/18/2019 ∙ by Jingwei Zhang, et al. ∙ 12

Exploration in sparse reward reinforcement learning remains a difficult open challenge. Many state-of-the-art methods use intrinsic motivation to complement the sparse extrinsic reward signal, giving the agent more opportunities to receive feedback during exploration. Most commonly, these signals are added as bonus rewards, which results in the mixture policy faithfully conducting neither exploration nor task fulfillment for an extended amount of time. In this paper, we instead learn separate intrinsic and extrinsic task policies and schedule between these different drives to accelerate exploration and stabilize learning. Moreover, we introduce a new type of intrinsic reward denoted as successor feature control (SFC), which is general and not task-specific. It takes into account statistics over complete trajectories and thus differs from previous methods that only use local information to evaluate intrinsic motivation. We evaluate our proposed scheduled intrinsic drive (SID) agent using three different environments with pure visual inputs: VizDoom, DeepMind Lab and OpenAI Gym classic control from pixels. The results show a greatly improved exploration efficiency with SFC and the hierarchical usage of the intrinsic drives. A video of our experimental results can be found at https://youtu.be/4ZHcBo7006Y.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

page 6

page 7

page 12

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Reinforcement learning (RL) agents learn on evaluative feedbacks, requiring only reward signals rather than ground truth labels as needed in instructive feedbacks (Sutton & Barto, 2018). This goes one step further on the automation of developing intelligent problem-solving agents. With deep networks as powerful function approximators bringing traditional RL into high-dimensional domains, deep reinforcement learning (DRL) has shown great potential (Mnih et al., 2015, 2016; Schulman et al., 2017; Horgan et al., 2018). However, the success of DRL often relies on carefully shaped dense extrinsic reward signals. Although shaping extrinsic rewards can greatly support the agent in finding solutions and shortening the interaction time, designing such dense extrinsic signals often requires substantial domain knowledge, and calculating them typically requires ground truth state information, both of which is hard to obtain in the context of robots acting in the real world. When not carefully designed, the reward shape could sometimes serve as bias or even distraction and could potentially hinder the discovery of optimal solutions. More importantly, learning on dense extrinsic rewards goes backwards on the progress of reducing supervision and could prevent the agent from taking full advantage of the RL framework.

In this paper, we consider terminal reward RL settings, where a signal is only given when the final goal is achieved. When learning with only an extrinsic terminal reward indicating the task at hand, intelligent agents are given the opportunity to potentially discover optimal solutions even out of the scope of the well established domain knowledge.

However, in many real-world problems defining a task only by a terminal reward means that the learning signal can be extremely sparse. The RL agent would have no clue about what task to accomplish until it receives the terminal reward for the first time by chance. Therefore in those scenarios guided and structured exploration is crucial, which is where intrinsically-motivated exploration (Oudeyer & Kaplan, 2008; Schmidhuber, 2010) has recently gained great success (Pathak et al., 2017; Burda et al., 2018b)

. Most commonly in current state-of-the-art approaches, an intrinsic reward is added as a reward bonus to the extrinsic reward. Maximizing this combined reward signal, however, results in a mixture policy that neither acts greedily with regard to extrinsic reward maximization nor to exploration. Furthermore, the non-stationary nature of the intrinsic signals could potentially lead to unstable learning on the combined reward. In addition, current state-of-the-art methods have been mostly looking at local information calculated out of 1-step lookahead for the estimation of the intrinsic rewards, e.g. one step prediction error

(Pathak et al., 2017), or network distillation error of the next state (Burda et al., 2018b). Although those intrinsic signals can be propagated back to earlier states with temporal difference (TD) learning, it is not clear that this results in optimal long-term exploration. We seek to address the aforementioned issues as follows:

  1. We propose a hierarchical agent Scheduled Intrinsic Drive (SID) that focuses on one motivation at a time: It learns two separate policies which maximize the extrinsic and intrinsic rewards respectively. A high-level scheduler periodically selects to follow either policy to gather experiences. Disentangling the two policies allows the agent to faithfully conduct either pure exploration or pure extrinsic task fulfillment. Moreover, scheduling (even within an episode) inexplicitely increases the behavior policy space exponentially, which drastically differs from previous methods where the behavior policy could only change slowly due to the incremental nature of TD learning.

  2. We introduce successor feature control (SFC), a novel intrinsic reward that is based on the concept of successor features. This state representation characterizes states through the features of all its successor states, instead of looking at local information only. This implicitly makes our method temporarily extended, which enables more structured and far-sighted exploration that is crucial in exploration-challenging environments.

We note that both the proposed intrinsic reward SFC and the hierarchical exploration framework SID are without any task-specific components, and can be incorporated into existing DRL methods with minimal computation overhead.

We present experimental results in three sets of environments, evaluating our proposed agent in the domains of visual navigation and control from pixels, as well as its capabilities of finding optimal solutions under distraction.

2 Related Work

Our work is connected to a range of DRL topics including intrinsic motivation, auxiliary tasks, successor representation and hierarchical RL. Below we discuss the most relevant.

2.1 Intrinsic Motivation and Auxiliary Tasks

Intrinsic motivation can be defined as agents conducting actions purely out of the satisfaction of its internal rewarding system rather than the extrinsic rewards (Oudeyer & Kaplan, 2008; Schmidhuber, 2010). There exist various forms of intrinsic motivation and they have achieved substaintial improvement in guiding exploration for DRL, in tasks where extrinsic signals are sparse or missing altogether.

Curiosity, one of the most widely used kinds of intrinsic motivation, is quantified by Pathak et al. (2017) as the 1-step prediction error of the features of the next state made by a forward dynamics model. Their ICM module has been shown to work well in visual domains including first-person view navigation. Since ICM is potentially susceptible to stochastic transitions (Burda et al., 2018a), Burda et al. (2018b) propose as a reward bonus the error of predicting the features of the current state output by a randomly initialized fixed embedding network. Another form of curiosity, learning progress or the change in the prediction error, has been connected to count-based exploration via a pseudo-count (Bellemare et al., 2016; Ostrovski et al., 2017) and has also been used as a reward bonus. Savinov et al. (2018) propose to train a reachability network, which gives out a reward based on whether the current state is reachable with a certain amount of steps from any state in the current episode. Similar to our proposed SFC, their intrinsic motivation is related to choosing states that could lead to novel trajectories. However we note that the reachability reward bonus captures the novelty of states with regard to the current episode, while our proposed SFC reward implicitly captures statistics over the full distribution of policies that have been followed, since the successor features are learned using states sampled from all past experiences.

Auxiliary tasks have been proposed for learning more representative and distinguishable features. Mirowski et al. (2016) add depth prediction and loop closure prediction as auxiliary tasks for learning the features. Jaderberg et al. (2016) learn separate policies for maximizing pixel changes (pixel control) and activating units of a specific hidden layer (feature control). However, their proposed UNREAL agent never follows those auxiliary policies as they are only used to learn more suitable features for the main extrinsic task.

2.2 Hierarchical RL

Various HRL approaches have been proposed (Kulkarni et al., 2016a; Bacon et al., 2017; Vezhnevets et al., 2017; Krishnan et al., 2017). In the context of intrinsic motivation, feature control (Jaderberg et al., 2016) has been adopted into a hierarchical setting (Dilokthanakul et al., 2017), in which options are constructed for altering given features. However, they report that a flat policy trained on the intrinsic bonus achieves similar performance to the hierarchical agent.

Our hierarchical design is perhaps inspired mostly by the work of Riedmiller et al. (2018). Unlike other HRL approaches that try to learn a set of options to construct the optimal policy, their proposed SAC agent aims to learn one flat policy that maximizes the extrinsic reward. While SAC schedules between following the extrinsic task and a set of pre-defined auxiliary tasks such as maximizing touch sensor readings or translation velocity, in this paper we investigate scheduling between the extrinsic task and intrinsic motivation that is general and not task-specific.

2.3 Successor Representation

The successor representation (SR) was first introduced to improve generalization in TD learning (Dayan, 1993). While previous works extended SR to the deep setting for better generalized navigation and control algorithms across similar environments and changing goals (Kulkarni et al., 2016b; Barreto et al., 2017; Zhang et al., 2017), we focus on its temporarily extended property to accelerate exploration.

SR has also been investigated under the options framework (Sutton et al., 1999). Machado et al. (2017); Tomar* et al. (2019) evaluate successor features with random policies to discover bottlenecks or landmarks based on the clustering of such features. Options are then learned to navigate to those sub-goals. However, it remained unclear if the option framework would help in sparse exploration setups.

When using SR to measure the intrinsic motivation, the most relevant work to ours is that of Machado et al. (2018). They also design a task-independent intrinsic reward based on SR, however they rely on the concept of count-based exploration and propose a reward bonus, that vastly differs from ours. We will present our proposed method in the next section.

3 Methods

3.1 Preliminaries

We use the RL framework for learning and decision-making under uncertainty. It is formalized by Markov decision processes (MDPs) defined by the tuple

. At time step the agent selects an action depending on its current state , receives a reward and transits to the next state . The state, action and reward at time

are random variables denoted by

and

. Their probability distributions are defined by the probability kernel

and the agent’s policy . For a given initial state distribution and fixed policy , the joint probability distribution of for all is uniquely determined and is denoted by . An RL agent’s goal is to learn a policy that maximizes the expectation of the discounted future return with the discount rate .

The state-value is defined by and the action-value . When representating the optimal action-value function by a deep network parameterized by (Mnih et al., 2015), optimal policies can be found by minimizing the loss

(1)

where denotes the TD target. Since our algorithm requires an off-policy learning strategy, and in consideration for faster learning and less computation overhead, we use the same -step target () as in the Ape-X DQN (Horgan et al., 2018) for bootstraping without off-policy correction ( denotes the target network)

(2)

Next, we will first introduce our proposed intrinsic reward successor feature control (SFC) (Sec.3.2,3.3,3.4), then present our proposed hierachical framework for accelerating intrinsically-motivated exploration, which we denote as scheduled intrinsic drive (SID) (Sec.3.5,3.6).

3.2 Successor Representation and Successor Features

In order to encode long-term statistics into the intrinsic reward design for far-sighted exploration, we build on the formulation of successor represention (SR) (Dayan, 1993) which introduces a temporarily extended view of the states.

In SR, the occupancy of state denotes the total number of time steps the state process spends in . Dayan (1993) introduced the idea of representing a state by the expected occupancies of its future states in a process starting in following a fixed policy . The SR is defined by

(3)

which is guaranteed to be finite for .

The successor features

(SF) is then introduced by extending the one-hot encoding in Eq.

3 to an arbitrary feature embedding (Kulkarni et al., 2016b; Barreto et al., 2017). It is defined by the

-dimensional vector

(4)

Analogously the SF represents the expected discounted feature activations, when starting in and following a fixed policy . TD-learning can be used to compute the SR (Dayan, 1993), and when transfered to the function approximator case the SF can be learned via the following update rule

(5)

where is the learning rate and .

Previous works for learning the deep SFs have included an auxiliary task of reconstruction on the features (Kulkarni et al., 2016a; Zhang et al., 2017), while in this work we investigate learning without this extra reconstruction stream. Instead of adapting the features while learning the successor features , we fix the randomly initialized . This design follows the intuition that since SFs () estimates the expectation of features () under the transition dynamics and the policy being followed, more stable learning of the SFs could be achieved if the features are kept fixed.

3.3 Successor Distance Metric

SR and SFs have several interesting properties which make them appealing as a basis for an intrinsic reward signal in the tabular and deep settings respectively:

  • They can be learned even in the absence of any extrinsic reward signal and without learning a transition model. They combine advantages of model-based and model-free RL (Stachenfeld et al., 2014).

  • They can be learned via computationally efficient TD.

  • SR and SF estimates the expected future state visitation counts and feature activations respectively, both with statistics over complete state trajectories. Therefore they contain information even of spatially and temporarily distant states which might help for effective far-sighted exploration.

Given the above discussion, we introduce a distance metric, the successor distance (SD), which measures the distance between states by the similarity of their SFs

(6)

This definition bases on a well know approach in distance metric learning that defines distances by . This can be seen by identifying the feature embedding with the dimensional matrix and SR with the matrix . Then for the distance measures and are equal. is symmetric and positive semi-definite (Eq.6) thus defines a pseudometric. This distance metric is illustrated in Fig.1 (Left).

Figure 1: Illustrations of SD and SFC in a grid world with three rooms (with a one-hot encoding ). Left: SD (Eq. 6) of each state to a fixed anchor state marked by , with a discount factor of . It can be observed that the SD roughly correlates to the length of the shortest path from each state to the anchor. Most notably is that the SD increases substantially when crossing rooms. When starting from the anchor state with a random policy, it is relatively unlikely for the agent to enter the other two rooms; thus for a pair of states with a fixed spatial distance, their SD is higher when they locate in different rooms than in the same room. So the SD also captures the connectivity in the state space. Right: The maximal SFC (Eq. 7) obtained from each state via a 1-step transition.

3.4 Successor Feature Control

Using this metric to evaluate the intrinsic motivation, one choice would be to use the SD to a fixed anchor state as the intrinsic reward, which would depend heavily on the position of the anchor state. Even when a sensible choice for the anchor can be found, e.g. the initial state of an episode, the SD of distant states from the anchor assimilate. To circumvent this, we define the intrinsic reward successor feature control (SFC) as the squared SD of a one-step transition

(7)

A high SFC reward indicates a big change in the future state occupancies when is followed. We argue this big change is a strong indicator for bottleneck states, since in bottlenecks a minor change in the action selection can lead to a vastly different trajectory being taken. This is especially true for highly stochastic policies. Fig.1 (right) shows that those highly rewarding states under SFC and the true bottlenecks agree, which can be very valuable for exploration (Kulkarni et al., 2016b). Even when the environment does not contain distinguishable bottlenecks, SFC can still efficiently guide exploration since it takes into account distant future states.

Since our base algorithm is -step Ape-X, we are only accesible to -step experience tuples for learning the SFs. Thus we generalize SFC to -step transitions

(8)

3.5 Scheduled Intrinsic Drive

Having proposed the intrinsic reward SFC, now we propose a hierachical take on intrinsically motivated exploration.

When learning optimal value functions or optimal policies via TD or policy gradient with deep function approximators, optimizing with algorithms such as gradient descent means that the policy would only evolve incrementally: It is necessary that the TD-target values do not change drastically over a short period of time in order for the gradient updates to be meaningful. The common practice of utilizing a target network in off-policy DRL (Mnih et al., 2015) stabilizes the update but in the meanwhile making the policy adapt even more incrementally over each step.

But intrinsically motivated exploration, or exploration in general, might benefit from an opposite treatment of the policy update. This is because the intrinsic reward is non-stationary by nature, as well as the fact that the exploration policy should reflect the optimal strategy corresponding to the current stage of learning, and thus is also non-stationary.

With the commonly adopted way of using intrinsic reward as a bonus to the extrinsic reward and train a mixture policy on top, exploration would be a balancing art between the incrementally updated target values for stable learning and the dynamically adapted intrinsic signals for efficient exploration. Moreover, neither the extrinsic nor the intrinsic signal is followed for an extended amout of time.

Therefore, we propose to address this issue with a hierarchical approach that by design has slowly changing target values while still allowing drastic behavior changes. The idea is to learn not a single, but multiple policies, with each one optimizing on a different reward function. To be more specific, we assume to have tasks (e.g. and where denotes the extrinsic task and the intrinsic task) defined by reward functions (e.g. and ) that share the state and action space. The optimal policy for each of these different MDPs can be learned with arbitrary off-policy DRL algorithms. During each episode, a high-level scheduler periodically selects a policy for the agent to follow to gather experience, and each policy is trained with all experience collected following those different policies. The overall learning objective is to maximize the main extrinsic task reward

(9)

where denotes the macro-policy of the scheduler.

By allowing the agent to follow one motivation at a time, it is possible to have a pool of different behavior policies without creating unstable targets for off-policy learning. By scheduling times even during an episode, we inexplicitely increase the behavior policy space by exponential to for a single episode (e.g. for and the behavior policy space could go up to ). We investigated several types of high-level schedulers, however, none of them consistently outperforms a random one. We suspect the reason why a random scheduler already performs very well under the SID framework, is that a highly stochastic scheduling can be beneficial to make full use of the big behavior policy space. We present the different scheduler choices we tested in Supplementary F, and leave more sophisticated scheduler design to future work.

Moreover, disentangling the extrinsic and intrinsic policy strictly separates stationary and non-stationary behaviors, and the different sub-objectives would each be allocated with its own interaction time, such that extrinsic reward maximizaton and exploration do not distract each other.

We emphasize again that our proposed framework can be applied with any off-policy algorithms, and is directly applicable to settings with multiple extrinsic or intrinsic tasks.

3.6 Algorithm

Now we present in detail an instantiation of our proposed SID framework when using Ape-X DQN as a base off-policy DRL algorithm (Horgan et al., 2018), with SFC as the intrinsic reward. In detail, the algorithm is composed of:

  • Q-Net : Contains an embedding and two Q-value output heads (extrinsic) and (intrinsic).

  • SF-Net : Contains an embedding and a successor feature head . is initialized randomly and kept fixed during training. The output of SF-Net is used to calculate the SFC intrinsic reward (Eq. 8).

  • A high-level scheduler: Instantiated in each actor, selects which policy to follow (extrinsic or intrinsic) after a fixed number of environment steps (max episode length). The sheduler randomly picks one of the tasks with equal probability.

  • 8 parallel actors: Each actor instantiates its own copy of the environment, periodically copy the latest model from the learner. We learn from -step experiences, so each actor at each environment step stores into a shared replay buffer. Each actor will act according to either the extrinsic or the intrinsic head based on the current task selected by its scheduler.

  • 1 learner: Learns Q-Net and SF-Net parameters from samples from the same shared replay buffer, which contains all experiences collected from following different policies. The SFs are learned via Eq.5, and the value heads and are learned with the extrinsic reward and respectively (Eq.1).

4 Experiments

4.1 Experimental Setup

We evaluate our proposed intrinsic reward SFC and the hierarchical framework of intrinsic motivation SID in three sets of simulated environments: VizDoom (Kempka et al., 2016), DeepMind Lab (Beattie et al., 2016) and OpenAI Gym classic control (Brockman et al., 2016). The extrinsic rewards are designed to be very sparse in all environments such that agents with random exploration strategies are very unlikely to accomplish the task even once. This is done to reveal the exploration capabilities of each agent. Throughout all experiments, agents receive as input only raw pixels with no additional domain knowledge or task specific information. We mainly compare the following agent configurations: M: train with only the extrinsic main task reward; M+ICM: add the ICM reward (Pathak et al., 2017) as a bonus to the extrinsic main task reward and train a mixture policy on this combined reward signal; M+SFC: same as M+ICM except for using our proposed SFC as the intrinsic reward bonus; SID (M, ICM): schedule between following the extrinsic main task policy and the intrinsic policy trained with the ICM rewards; SID (M, SFC): same as SID(M,ICM) except for using SFC to train the intrinsic policy. We note that except for the agents M and M+ICM, all other agents consist of our proposed algorithms (SFC or SID or both).

(a) Top-down view.
(b) Entry.
(c) Wing.
(d) Goal.
Figure 2: The FlytrapEscape environment. 1(a) shows the top-down view of the FlytrapEscape map, with the red dot marking the starting location, green dot indicating the goal location; 1(b),1(c),1(d) show exemplary first-person views captured from the marked poses (blue dots with arrows) in FlytrapEscape.

For the intrinsic reward normalization and the scaling for the extrinsic and intrinsic rewards we do a parameter sweep for each environment and choose the best setting for each agent. We note that the agents with the SID component are much less sensitive to different scalings. Since our proposed SID setup requires an off-policy algorithm in order to learn from experiences generated by following different policies, we implement all the agents above under the Ape-X DQN framework (Horgan et al., 2018) for a fair comparison. Additional experimental setup and model architecture details are reported in Supplementary B,C,D and E.

After a parameter sweep we set the number of scheduled tasks per episode to for the SID agents in all experiments, meaning each episode is divided into up to sub-episodes, and for each of which either the extrinsic or the intrinsic policy is sampled as the behavior policy.

4.2 VizDoom: FlytrapEscape

Figure 3: Extrinsic rewards per episode obtained by different agents in FlytrapEscape. Each plot shows the average performance with standard derivation over 3 random non-tuned seeds.

We conduct a first set of experiments in the VizDoom research platform (Kempka et al., 2016). The VizDoom game ”DoomMyWayHome-v0” was previously used in several intrinsic motivation papers (Pathak et al., 2017; Savinov et al., 2018). It contains 8 rooms which are connected by corridors and each has a distinct unique texture, and the agent needs to navigate to a preset goal location based only on visual inputs in first-person view. We verified our implementation of the baseline algorithms through a set of experiments in this game and report the results in Supplementary A Fig.9. However, we found that the environment is not challenging enough since our base algorithm Ape-X, which already has descent random exploration capabilities, can solve the task reliably without intrinsic motivation.

We then designed a new game in VizDoom which is extremely exploration-challenging. Inspired by how flytraps catch insects, we design the layout of the rooms in a geometrically challenging way that escaping from one room to the next with random actions is extremely unlikely. The maze consists of 4 rooms separated by V-shaped walls pointing inwards the rooms. The small entry to each room locates at the junction of the V-shape, which is extremely difficult to maneuver into without a sequence of precise movements. The same also holds if the agent were trying to reach a next entry (Fig.1(b)) once stuck in the wing areas (Fig.1(c)). The layout of FlytrapEscape is shown in Fig.1(a). The rooms are roughly 10 times the size of the rooms in ”DoomMyWayHome-v0”.

In each episode, the agent starts from the red dot shown in Fig.1(a) with a random orientation. An episode terminates if the final goal is reached in which case the agent will receive a reward of , or if a maximum episode steps of is reached. We note the maximum episode steps for ”DoomMyWayHome-v0” is . The task is to escaping the forth room (marked as a green dot in Fig.1(a)).

Figure 4: Top-down projection of the SFs of an agent after it learned how to navigate to the goal on FlytrapEscape (Fig.1(a)). For the purpose of visualization we discretized the map into grids and position the trained agent SID(M,SFC) at each grid, then computed the successor features for that location for each of the orientations , which resulted in a matrix. We then calculated the -difference of this matrix with a vector containing the successor features of the starting position with the different orientations. Shown in -scale.

The results of our experiments are shown in Fig.3 and demonstrate that our agent SID(M,SFC) efficiently explores the flytrap map and reliably learns how to navigate to the goal. The M agent without any intrinsic reward did not navigate to the goal even once. The agents with the ICM component perform poorly. Only run of M+ICM learned to navigate to the goal, while the scheduling agent SID(M,ICM) did not solve the task even once. But for the two SFC agents, the scheduling greatly improves the performance. Although the reward bonus agent M+SFC was not successful in every run, the SID(M,SFC) agent solved the FlytrapEscape in out of runs. We hypothesize the reason for the superior performance of SID(M,SFC) compared to M+SFC could be the following: Before seeing the final goal for the first time, the M+SFC agent is essentially learning purely on the SFC reward, which is equivalent to the SFC policy head of the scheduling agent SID(M,SFC). Since SFC might preferably go to bottleneck states as for those states the difference between the SFs of the two neighboring states tend to be relatively large (as can be seen in Fig.1 and Fig.4). From those bottleneck states onwards the extrinsic policy might be a good candidate to explore the next new room as it is doing random exploration before receiving any reward signal. While the new room is being explored its SFs are simultaneously learned and might then guide the agent to the next bottleneck states region. Thus the scheduling helps the agent efficiently explore the map from bottleneck to bottleneck, while a single policy agent could not benefit from the two different behaviors under the extrinsic and intrinsic rewards and would likely oscillate around bottleneck states. On the other hand, sheduling did not help ICM. A reason ccould be that ICM is not especially attracted by bottleneck states so it does not help exploration if the agent spends half of the time acting randomly as the extrinsic policy had no reward yet to learn from.

As an additional evaluation, in Fig.4 we visualize the SFs of an agent which successfully learned to navigate to the goal. We can see that the difference in the SFs of each coordinate to that of the starting position tends to grow as the distance increases, especially for those states that locate on the pathways leading to later rooms. This makes sense, since the SFs of a state are defined in terms of the features of the consecutive states the agent visits. As the agent has learned to navigate to the goal, the SFs of the intermediate states are highly and consistently influenced by the later states along the trajectory to the goal. Furthermore, we observe big intensity changes around the bottlenecks (the room entries) in the heatmap, which also supports the hypothesis that SFC leads the agent to bottleneck states. We believe this is the first time that SFs are shown to behave in a first-person view environment as one would expect from its definition. The evolution of the SFs over time is shown in the attached video https://youtu.be/4ZHcBo7006Y.

4.3 DeepMind Lab: AppleDistractions

Figure 5: The AppleDistractions environment layout.

In the second experiment, we set out to evaluate the capabilities of an agent to reliably collect the faraway big reward in the presence of small nearby distractive rewards. For this experiment we use the 3D visual navigation simulator of DeepMind Lab (Beattie et al., 2016), which provides such a level ”Stairway to Melon”. However, we found it is easily solved without intrinsic rewards. Out of the same reasoning as for VizDoom, we constructed a much more challenging level ”AppleDistractions” (Fig.5) with a maximum episode length of . In this level, the agent starts in the middle of the map and can follow either two corridors. Each corridor has multiple sections and each section consists of two dead-ends and an entry leading to next section. Each section has different randomly generated floor and wall textures. One of the corridors gives a small reward for collecting an apple in each of the section, while the other one contains a single big reward at the end of its last section.

As in the level ”Stairway to Melon”, the optimal policy of an agent in ”AppleDistractions” would be to go for the single big reward at the end. But since the small apple rewards are much closer to the spawning location of the agent, the challenge here is to still explore other areas sufficiently often so that the optimal solution could be recovered.

All agents were evaluated on three different environment seeds, for each of which the texture of the map is different. We evaluated each agent with two random seeds for each of the three environment seeds, which makes a total of 6 runs. Neither the random nor the environment seed is tuned.

Figure 6: Average performce of all agents under comparison in AppleDistractions. Each agent is evaluated on the same sets of random floor and wall textures, and for each set of textures individual runs are conducted with different seeds. Each plot shows the average performance with standard derivation over the runs for each agent. The SID agents perform better than their reward bonus agents, also the SFC agents received higher rewards than their ICM counterparts.

The results are visualized in Fig.6. Our SID(M, SFC) agent received on average the highest rewards. Furthermore, we see that sheduling helped both SFC and ICM to find the big reward and not settle for the small rewards. Their respective reward bonus counterparts were more prone to the small nearby rewards. This behavior is expected. With a separate policy for intrinsic motivation the agent can for some time interval completely ”forget” about the extrinsic reward. The extrinsic policy can simultaneously learn from the new experiences and might learn about the large reward that it did not discover before. Surprisingly, only following the extrinsic reward resulted in a behavior that gained more reward than when the ICM bonus is added. On manual inspection we noticed that the agent with this mixture policy very often gets stuck without even collecting all the small apple rewards. This might due to that the very different textures in each room makes predicting future states difficult, and hence the high intrinsic signals generated by ICM distract the agent from collecting the extrinsic rewards.

​ ​ ​

4.4 Classic Control From Pixels: PendulumPixels

Figure 7: Average performce of all agents under comparison in PendulumPixels, each over random seeds, with the shaded areas indicating standard derivation. The SFC agents achieve the highest reward. SID improves the performance of ICM over its reward bonus counterpart. The M agent could not solve the task.

To test if our proposed methods could be used in domains other than first-person visual navigation, we chose to conduct a third set of experiments in the classical control problem of swinging up an idealized pendulum. We based this experiment on the ”Pendulum-v0” environment of OpenAI Gym (Brockman et al., 2016). The default setting provides dense reward signals with low-dimensional state inputs (angle and angle velocity). In order to learn this task in a sparse setup with pixel inputs, we did the following modifications: we added a custom render method to efficiently generate the front-view as inputs at each time step; we discretized the action space to discrete actions with a max torque of (with this low max torque, several back and forth wings are required for a full swing-up); a single reward signal of will only be provided upon a successful swingup; an episode will be terminated after a success or after a maximum episode length of is reached (Supplementary H).

We show the performance of each agent in Fig.7. We can observe that, SFC is not domain specific to visual navigation, as both SFC agents work very well in this -D front-view environment which is quite different from the first-person view -D environments. Either used as a reward bonus (M+SFC) or as a separate policy (SID(M,SFC)), SFC outperforms the baseline M+ICM by a wide margin, although there are no clear bottlenecks in this environment.

5 Conclusion

In this paper, we investigate an alternative way of applying intrinsic motivation for exploration in DRL. We propose a hierarchical agent SID that schedules between following extrinsic and intrinsic drives. Moreover, we propose a new type of intrinsic reward SFC that is general and evaluates the intrinsic motivation based on longer time horizons. We conduct experiments in three sets of environments (VizDoom, DeepMind Lab and OpenAI Gym) and show that both our contributions: the hierarchical framework SID of using intrinsic drives, and the intrinsic reward SFC help greatly in improving exploration efficiency.

We consider many possible future research directions that could stem from this work, including designing more efficient scheduling strategies, incorporating several intrinsic drives (that are possibly orthogonal and complementary) instead of only one into the hierarchical framework, test our framework in other control domains such as manipulation, and extend our evaluation onto real robotics systems.

References

Appendix A Supplementary: Comparison to Random Network Distillation

To further verify our method, we compared it with random network distillation (RND) (Burda et al., 2018b). We tested on the doom map ”DoomMyWayHome-v0” and on FlytrapEscape. The results can be seen in Fig.8 and Fig.9. While RND reaches a reasonable performance on the relatively easy map ”DoomMyWayHome-v0”, it fails to find any reward on FlytrapEscape.

Figure 8: Extrinsic rewards per episode for different agents in FlytrapEscape. Each plot shows the average performance with standard derivation over 3 random non-tuned seeds.
Figure 9: Extrinsic rewards per episode for different agents on ”DoomMyWayHome-v0”. Each plot shows the average performance with standard derivation over 3 random non-tuned seeds.

From the FlytrapEscape experiment we present in Sec.LABEL:sec:exp-fly, one might suspect that the big performance margin of SFC over ICM could be due to the fixed random features of the SF-Net. But from the results of the comparison with RND shown in Fig.8, we can verify that the improvement from our method is not due to the fixed random features, but due to the temporarily extended nature of the SFs.

Appendix B Supplementary: Implementation Details

For computational efficiency, we implement our own version of the prioritized experience replay. We split the replay buffer into two, with size of and

. Every transition is pushed to the first one, while in the second one only transitions are pushed on which a very large TD-error is computed. We store a running estimate of the mean and the standard deviation of the TD-errors and if for a transition the error is larger than the mean plus two times the standard deviation, the transition is pushed. In the learner a batch of size

consists of transitions drawn from the normal replay buffer and are drawn from the one that stores transition with high TD-error, which as a result have relatively seen a higher chance of being picked.

To adapt the settings from the actors in the Ape-X DQN (Horgan et al., 2018) to our setting of actors, we set a fixed for each actor as

(10)

where and are set as in the original work.

For the first-person view navigation experiments in VizDoom and DeepMind Lab, we use an action repetition of , while for the classic control experiment the action repetition is . In the text, we only refer to the actual environment steps (e.g. before divided by 4).

Appendix C Supplementary: Reward Normalization

Most network parameters are shared for estimating the expected discounted return of the intrinsic and extrinsic rewards. The scale of the rewards has a big influence on the scale of the gradients for the network parameters. Hence, it is important that the rewards are roughly on the same scale, otherwise effectively different learning rates are applied. The loss of the network comes from the regression on the Q-values, which approximate the expected return. So our normalization method aims to bring the discounted return of both tasks into the same range. To do so we first normalize the intrinsic rewards by dividing them by a running estimate of their standard deviation. We also keep a running estimate of the mean of this normalized reward and denote it . Since every time step an intrinsic reward is received we estimate the discounted return via the geometric series. We scale the extrinsic task reward that is always in with , where is the discount rate for the intrinsic reward. Furthermore,

is a hyperparameter which takes into account that for Q-values from states more distant to the goal the reward is discounted with the discount rate for the extrinsic reward depending on how far away that state is. In our experiments we set

(see Appendix E for how we searched for this hyperparameter).

Appendix D Supplementary: Model Architecture

We use the same model architecture as depicted in Fig. 10 across all sets of experiments:

Figure 10: Model architecture for the SID (M, SFC) agent. Components with color yellow are randomly intialized and not trained during learning.

ReLU activation is added after every layer except for the last layers of each dashed blocks in the above figure. For the experiments with the ICM (Pathak et al., 2017), we added BatchNorm (Ioffe & Szegedy, 2015) before activation for the embedding of the ICM module following the original code released by the authors. Code is implemented in pytorch (Paszke et al., 2017).

Appendix E Supplementary: Training Details

We use a batch size of for all experiments. For optimizer, we use Adam (Kingma & Ba, 2014) with a learning rate of .

We did the same search for hyperparameters and normalization technique for all four algorithms that include an intrinsic reward and found the setting described in Section C to work the best for all of them. The algorithms were evaluated on the flytrap. For we tried the values in . We also tried to not normalize the rewards and just scale the intrinsic reward. To scale the intrinsic reward we tried the values . However, we found that as the scale of the intrinsic rewards is not the same over the whole training process this approach does not work well. We also tried to normalize the intrinsic rewards by dividing it by a running estimate of its standard deviation and then scale this quantity with a value in .

Appendix F Supplementary: Scheduler Designs

We investigated three types of high-level schedulers:

  • Random scheduler: Sample a task from uniform distribution every task steps.

  • Switching scheduler: Sequentially switches between extrinsic and intrinsic task.

  • Macro-Q Scheduler: Learn a scheduler that learns with macro actions and from sub-sampled experience tuples.

  • Threshold-Q Scheduler Scheduler: Selects task according to the Q-value output of the extrinsic task head.

Macro-Q Scheduler: In each actor, we keep an additional local buffer that stores subsampled experiences: . Then at each environment step, Besides the -step experience tuple mentioned above, we also store an additional macro-transition along with its sum of discounted rewards to the shared replay buffer. This macro-transition is paired with the current task as its macro-action. The Macro-Q Scheduler is then learned with an additional output head attached to (we also tried ).

Threshold-Q Scheduler: For this scheduler we do not need additional learning. It just selects a task based on the current Q-value of the extrinsic head . We tried the following selection strategies:

  • Running mean: select intrinsic when the current Q-value of the extrinsic head is below its running mean, extrinsic otherwise

  • Heuristic median: observing that the running mean of the Q-values might not be a good statistics for selecting tasks due to the very unevenly distributed Q-values across the map, we choose a fixed value that is around the median of the Q-values (), and choose intrinsic when below, extrinsic otherwise

As we report in the paper, none of the above scheduler choices consistently performs better across all environments than a random scheduler. We leave this part to future work.

Appendix G Supplementary: AppleDistractions

(a) Dead end.
(b) Entry.
(c) Goal.
Figure 11: The AppleDistractions environment. 10(a),10(b),10(c) show exemplary first-person views captured in AppleDistractions.

Appendix H Supplementary: PendulumPixels

(a) Swings to right.
(b) Goal.
Figure 12: Exemplary front-view obversations captured in PendulumPixels via custom rendering.

The pendulum dynamics are given though the ordinary differential equation

with initial conditions , , where is the angle of the rod. The action corresponds to a torque that acts on the pendulum and is chosen from the discretized action space . is uniformly sampled from the interval , where the angle corresponds to the downwards position. The ODE is solved by semi-implicit euler integration with fixed time interval .

In contrast to the OpenAI Gym we use images as observations. A front-view image of the pendulum, as seen in Fig. 12. Where some standard preprocessing steps are taken. The color scheme is changed to greyscale and the rod is given the brightest color (1) and the background is black (0). Taken actions are not visualized explicitly and can be just inferred by the dynamics.

Appendix I Supplementary: Further Analysis of Pendulum experiments

The pendulum experiment also illustrates, why a single mixture policy of intrinsic and extrinsic reward can be hurtful. Interestingly, the SID component helped the SID(M+ICM) agent outperform the baseline M+ICM a lot, while for the SFC agents no big differences can be observed. An explanation for the behaviors of the SFC agents could be that in this environment there are no clear bottleneck states as those in the FlytrapEscape environment. Regarding the ICM we hypothesize that as it calculates rewards depending on how well the model can predict the next state, the policy does not stabilize and starts choosing different actions in a given state as soon as the model has learned for old transitions, which might happen very quickly as the environment is visually very simple. Thus when using ICM as a reward bonus this problem can be critical, as the mixture polciy is not given too much time to converge to the optimal solution once the ICM has learned the transition in a given state since the corresponding intrinsic reward would be very low. This is no serious issue when scheduling between two separate policies, as the extrinsic one is given its own time slots to concentrate on fulfilling the task while the intrinsic one explores the environment.