Mega-Reward: Achieving Human-Level Play without Extrinsic Rewards

05/12/2019 ∙ by Yuhang Song, et al. ∙ 0

Intrinsic rewards are introduced to simulate how human intelligence works, which are usually evaluated by intrinsically-motivated play, i.e., playing games without extrinsic rewards but evaluated with extrinsic rewards. However, none of the existing intrinsic reward approaches can achieve human-level performance under this very challenging setting of intrinsically-motivated play. In this work, we propose a novel megalomania-driven intrinsic reward (mega-reward) which, to our knowledge, is the first approach that achieves comparable human-level performance in intrinsically-motivated play. The intuition of mega-rewards comes from the observation that infants' intelligence develops when they try to gain more control on entities in an environment; therefore, mega-reward aims to maximize the control capabilities of agents on given entities in a given environment. To formalize mega-reward, a relational transition model is proposed to bridge the gaps between direct and latent control. Experimental studies show that mega-reward can (i) greatly outperform all state-of-the-art intrinsic reward approaches, (ii) generally achieves the same level of performance as Ex-PPO and professional human-level scores; and (iii) has also superior performance when it is incorporated with extrinsic reward.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 5

page 6

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Since humans can handle real-world problems without explicit extrinsic reward signals [12], intrinsic rewards [29] are introduced to simulate how human intelligence works. Notable recent advances on intrinsic rewards include empowerment-driven [17, 19, 24, 25], count-based novelty-driven [5, 22, 28, 34], prediction-error-based novelty-driven [1, 30, 6, 7], stochasticity-driven [9], and diversity-driven [33] approaches. Intrinsic reward approaches are usually evaluated by intrinsically-motivated play, where proposed approaches are used to play games without extrinsic rewards but evaluated with extrinsic rewards. However, though proved to be able to learn some useful knowledge [9, 33] or conduct a better exploration [6, 7], none of the state-of-the-art intrinsic reward approaches achieves a performance that is comparable to human professional players under this very challenging setting of intrinsically-motivated play.

In this work, we propose a novel megalomania-driven intrinsic reward (called mega-reward), which, to our knowledge, is the first approach that achieves human-level performance in intrinsically-motivated play. The idea of mega-reward originates from early psychology studies on contingency awareness [35, 2, 4], where infants are found to have awareness of how entities in their observation are potentially under their control. We notice that the way in which contingency awareness helps infants to develop their intelligence is to motivate them to have more control over the entities in the environment; therefore, we believe that having more control over the entities in the environment should be a very good intrinsic reward. Mega-reward is thus proposed to follow this intuition, which aims to maximize the control capabilities of agents on given entities in a given environment.

Figure 1: Latent control in Breakout (left) and DemonAttack (right).

Specifically, taking the game Breakout (shown in Fig. 1 (left)) as an example, if an infant is learning to play this game, contingency awareness may first motivate the infant to realize that he/she can control the movement of an entity, bar; then, with the help of contingency awareness, he/she may continue to realize that blocking another entity, ball, with the bar can result in the ball also being under his/her control. Consequently, the infant’s skills on playing this game is gradually developed by having more control on entities in this game.

Furthermore, we also note that entities can be controlled by two different modes: direct control and latent control. Direct control means that an entity can be controlled directly (e.g., bar in Breakout); while latent control means that an entity can only be controlled indirectly by controlling another entity (e.g., ball is controlled indirectly by controlling bar). In addition, latent control usually forms a hierarchy in most of the games; the game DemonAttack as shown in Fig. 1 (right) is an example: there is a gun which can be fired (direct control); and firing the gun controls bullets (1st-level latent control); then the bullets control enemies if they eliminate enemies (2nd-level latent control); finally, enemies control the score if enemies are eliminated (3rd-level latent control).

Obviously, gradually discovering and utilizing the hierarchy of latent control helps infants to develop their skills on such games. Consequently, mega-reward should be formalized by maximizing not only direct control, but also latent control on entities. This thus requests the formalization of both direct and latent control. However, although we can model direct control with inverse models [8], there is no existing solution that can be used to formalize latent control. Therefore, we further propose a relational transition model (RTM) to bridge the gap between direct and latent control by learning how the transition of each entity is related to itself and other entities. For example, the agent’s direct control on entity can be passed to entity as latent control if contributes to the transition of . With the help of RTM, we are able to formalize mega-reward, which is computationally tractable.

Extensive experimental studies have been conducted on 18 Atari games and the “noisy TV” domain [6]; the experimental results show that: (i) mega-reward significantly outperforms all six state-of-the-art intrinsic reward approaches; (ii) even under the very challenging setting of intrinsically-motivated play, mega-reward (without extrinsic rewards) still achieves generally the same level of performance as two benchmarks (with extrinsic rewards), Ex-PPO and professional human-level scores; and (iii) the performance of mega-reward is also superior when it is incorporated with extrinsic rewards, outperforming state-of-the-art approaches in two different settings.

The contributions can be summarized as follows: (1) We propose a novel intrinsic reward, called mega-reward, which aims to maximize the control capabilities of agents on given entities in a given environment. (2) To realize mega-reward, a relational transition model (RTM) is further proposed to bridge the gap between direct and latent control. (3) Experimental studies on 18 Atari games and the “noisy TV” domain show that mega-reward (i) greatly outperforms all state-of-the-art intrinsic reward approaches, (ii) generally achieves the same level of performance as two benchmarks, Ex-PPO and professional human-level scores; and (iii) also has a superior performance when it is incorporated with extrinsic reward. Easy-to-run codes are also included in the supplementary material.

2 Between Direct and Latent Control

We start with defining the notions of direct and latent control of an action on the state : (1) Direct control denotes the control effect produced by action on state , where . (2) Latent control denotes the control effect produced by action on state , where . For direct control, we define the map at every step , where and , with and being the state height and width, respectively.

denotes the probability of an entity at location

being directly controlled by the agent’s action . In practice, an entity is approximated by a sub-image subdivided from the raw frame (described in Section 3 and following [8]), so is in practice the probability of the sub-image at location being directly controlled by the agent’s action . Note that one way to obtain is described in [8]. We consider a new map with at each step , which represents the probability of an entity at location being controlled (both directly and latently) by the action taken at a previous step (). The entity here is also in practice approximated by a sub-image. Thus, the set of maps contains all the information about what is being controlled by the agent at the current state, considering all the historical actions with both direct and latent control. Obviously, for , corresponding to the definition that latent control cannot happen immediately after an action is taken.

We define a binary random variable

to encode that “event happens at position at step ”. Taking the game Breakout (shown in Fig. 1 (left)) as an example, if we consider the event to be “the ball appears”, represents “the ball appears at location at step ”. Thus, contains: (1) description of event ; (2) location ; and (3) step . The description of event is also in practice a sub-image subdivided from the raw frame (e.g., the sub-image that contains the ball), which can be dealt with the convolutional network to be described in Section 3. An important assumption of event is that it occurs and only occurs at one location at each step . This assumption is made, because (1) if does not occur at step , then it will be unknown where will occur at step , and (2) if occurs at multiple locations at step , then we do not know if they have swapped places at step . Under the above assumption, events

are pairwise disjoint whose union is the entire sample space, which according to the law of total probability

[40] produces:

(1)

Here, is the action taken at step , which is not considered to be a random variable, since it is known to the agent at step . In practice, the description of stacks multiple historical ones , as adopted in [23], so that the transition from to is inferred from longer memories. As defined, represents the probability that a specific event occurs at position at step , so the general definition of can be specified to be , which turns (1) into:

(2)

where . This means that can be iteratively derived. That is, for , we need both: (1) , which equals to , and (2) , for all . The above means that we now build the relationship between direct and latent control via . As can be seen, reveals the need of a new form of transition model, which learns how a part of the current state implies a part of the future state. Since this kind of transition model contains information about the relationships between different parts of the state underlying the transition of full states, we call it a relational transition model (RTM). In the next section, we introduce our method to learn RTMs efficiently.

3 Relational Transition Model

Figure 2: Relational transition model.

Building on the conclusions from the last section, we consider the event to be “a specific entity is under the agent’s control”. The description of event can thus be approximated by a sub-image subdivided from the raw frame. For example, if the entity that we consider is the ball in the game Breakout (shown in Fig. 1 (left)), then we can use a sub-image that is just big enough to contain the ball as a description of event . A similar sub-image approximation of an entity is also used in [16, 8]. With the above approximation of entities in the state, we mesh the state into sub-images, as shown in Fig. 2, so that is in practical a sub-image at the coordinates , and the number of possible coordinates depends on the granularity of the meshing.

We are now ready to propose relational transition models (RTMs), which produce an approximation of mentioned in the last section as the essential step towards modeling from . Fig. 2 shows the structure of RTMs, which consist of two parameterized models, namely, for relational transition modeling and

for combination weights estimation. We first define the forward function of

, which makes a prediction of the transition from  to :

(3)

Here, represents the prediction of . Also, note that apart from taking in , also takes in the relative coordinates and

, both as one-hot vectors, so that the model

knows the relative position of the part to predict and the action token. Furthermore, is the estimated weight of predicting from , which models how informative each of different is for the prediction of . is estimated by the model . Specifically, first estimates with:

(4)

which is later softmaxed over to compute :

(5)

We train our RTM end-to-end with . As an intuitive explaination of RTM, taking the game Breakout (shown in Fig. 1 (left)) as an example, makes three predictions of the current ball based on the previous ball, bar, and brick. Since the final prediction of the current ball is the weighted combination of these three predictions, is further used to estimate the weights of this combination, measuring different control effects that the previous ball, bar, and brick have on the current ball. We thus propose and as relational transition models.

As a result, the combination weight produced by is an approximation of . Thus, Eq. (2) is modeled by:

(6)

RTM has introduced separated forwards over every , , , and ; however, by putting the separated forwards into the batch axis, the computing is well parallelized. We report the running times and include code in the supplementary material.

4 Formalizing Intrinsic Rewards

As stated, contains all the information about what is being controlled by the agent in the current state, considering all the historical actions with both direct and latent control, and each component in the set can be computed via Eq. (6) from . Clearly, computing all components in the above set is intractable as increases. Thus, we define the accumulated latent control mapping , which is a discounted sum of over :

(7)

where is a discount factor, making with have a lower contribution to the estimation of . Then, we show that can be computed from and without enumerating over (see Proof of Lemma 1 in supplementary material):

(8)

which reveals that we can simply maintain an memory for , and then update to at each step with and according to (13). The intuitive integration of is an overall estimation of what is being controlled currently, both directly and latently, considering the effect of all historical actions. This also coincides with our intuition that a human does not explicitly know what is under his/her latent control for each historical action. Instead, we maintain an overall estimation of what is under the historical actions’ control, both directly and latently. At last, to maximize , where is the terminal step, the intrinsic reward (our mega-reward) at each step should be:

(9)

5 Experiments

Extensive experimental studies are conducted to evaluate the performance of mega-reward. We first evaluate the mega-reward on 18 Atari games under the very challenging settings of intrinsically-motivated play in Section 5.1, where a case study is used to visualize how each part of mega-reward works, and mega-reward is compared with six state-of-the-art intrinsic rewards, the benchmark of a PPO agent with access to extrinsic rewards (Ex-PPO), and the benchmark of professional human-level scores, to show its superior performance. Then, we further investigate two possible ways to integrate mega-reward with extrinsic rewards in Sections 5.2 and 5.3. Finally, a few failure cases of mega-reward are investigated in Section 5.4, showing possible topics for future research.

Mega-reward is implemented on PPO in [32] with the same set of hyper-parameters, along with and . The network structures of and are provided in the supplementary material. The hyper-parameters of the other baseline methods are set as in the corresponding original paper. The environment is wrapped as in [6, 23]

5.1 Intrinsically-Motivated Play of Mega-Reward

Intrinsically-motivated play is an evaluation setting where the agents are trained by intrinsic rewards only, and the performance is evaluated using extrinsic rewards. Here, all agents are run for M steps, with the last episodes averaged as the final scores and reported in Table 1. The evaluation is conducted over 18 Atari games; due to the page limit, learning curves as the training progresses and running times are provided in the supplementary material.

Case Study.

Game Emp Cur RND Sto Div Dir Meg Game Emp Cur RND Sto Div Dir Meg
Seaquest 523.1 334.9 227.6 23.13 24.36 323.1 646.1 Bowling 50.12 114.4 24.51 33.12 23.95 123.1 30.00
Venture 72.56 0.0 45.45 50.55 72.12 82.95 118.1 WizardOfWor 582.1 509.0 640.0 144.1 150.9 673.1 1047
Asterix 1420 980.9 309.0 229.43 209.1 1278 2520 Robotank 4.235 2.240 1.160 1.231 0.673 6.217 3.700
BeamRider 892.1 714.4 432.1 123.2 232.0 1178 1381 BattleZone 2213 3398 7400 0.0 0.0 6329 2360
KungFuMaster 124.1 3060 532.7 192.2 203.6 423.1 258.5 Centipede 1523 2137 2280.5 1123 1216 1823.3 2091
Pong -10.23 -7.200 -19.80 -19.4 -18.30 -20.30 -2.000 AirRaid 1242 850.9 836.8 622.3 733.6 1209 2195
DoubleDunk -18.23 -21.13 -19.64 -20.44 -20.98 -19.1 -11.81 DemonAttack 8304 29.09 273.5 32.94 36.91 6859 10170
Berzerk 682.1 262.0 305.2 40.12 39.09 397.8 778.2 Breakout 213.1 55.20 32.80 12.32 0.345 10.24 231.8
Jamesbond 423.1 529.0 177.2 0.0 0.0 1267 3250 UpNDown 7239 7703 1464 152.7 141.1 50923 124693
Table 1: Comparison of mega-reward against six baselines.

Figure 3: Case study: the example of Pong.

Fig. 3 visualizes how each component in our method works as expected. Specifically, the 1st row is a frame sequence. The 2nd row is the corresponding direct control map , indicating how likely each grid being directly controlled by . As expected, the learned map shows the grid containing the bar being directly controlled. The 3rd row is the accumulated latent control map , indicating how likely each grid being controlled (both directly and latently) by historical actions. As expected, the learned map shows: (1) only the bar is under control before the bar hits the ball (frames 1–5); (2) both bar and ball are under control after the bar has hit the ball (frames 6–10); and (3) the bar, ball, and displayed score are all under control if the opponent missed the ball (frame 11). The 4th row is mega-reward , obtained by Eq. (9) from the map in the 3rd row. As expected, it is high when the agent controls a new grid in the 3rd row (achieving more control over the grids in the state).

Against Other Intrinsic Rewards.

To show the superior performance of mega-reward (denoted Meg), we first compare its performance with those of six state-of-the-art intrinsic rewards, i.e., empowerment-driven (denoted Emp[24], curiosity-driven (denoted Cur[6], RND [7], Stochasticity-driven (denoted Sto[9], Diversity-driven (denoted Div[33], and a mega-reward variant with only direct control (denoted Dir). By the experimental results in Table 1, mega-reward outperforms all six baselines substantially. In addition, we also have the following findings: (i) Sto and Div are designed for games with explicit hierarchical structures, so applying them on Atari games with no obvious temporal hierarchical structure will result in the worst performance among all baselines. (ii) Dir is also much worse than the other baselines, proving the necessity of latent control in the formalization of mega-reward. (iii) The failure of the empowerment-driven approach states that applying information theory objectives to complex video games like Atari ones is an open problem.

Against Two Benchmarks.

Figure 4: Mega-reward against Ex-PPO.
Figure 5: Mega-reward against human player.

In general, the purpose of evaluating intrinsic rewards in intrinsically-motivated play is to investigate if the proposed intrinsic reward approaches can achieve the same level of performance as two benchmarks: PPO agents with access to extrinsic rewards (denoted Ex-PPO) and professional human players. Therefore, we evaluate mega-reward using a relative score against two such benchmarks, which can be formally defined as

(10)

where means that mega-reward achieves a better performance than the corresponding benchmark, that it achieves a worse performance, and is random play.

Fig. 5 shows the comparative performance of mega-reward against Ex-PPO on 18 Atari games, where mega-reward greatly outperforms the Ex-PPO benchmark in 8 games, and is close to the benchmark in 2 games. These results show that mega-reward generally achieves the same level of or a comparable performance as Ex-PPO (though strong on some games and weak on others); therefore, the proposed mega-reward is as informative as the human-engineered extrinsic rewards.

Similarly, Fig. 5 shows the comparative performance of mega-reward against professional human players. Since the performance of professional human players (i.e., professional human-player scores) on 16 out of 18 Atari games have already been measured by [23], we measure the professional human-player scores on AirRaid and Berzerk using the same protocol. Generally, in Fig. 5, we find that mega-reward greatly outperforms the professional human-player benchmark in 7 games, and is close to the benchmark in 2 games. Since the professional players are equipped with strong prior knowledge about the game and the scores displayed in the state, they show a relatively high-level of human skills on the corresponding games. Therefore, the results sufficiently prove that mega-reward has generally reached the same level of (or a comparable) performance as a human player.

5.2 Pretraining with Mega-Reward

In many real-world cases, the agent may have access to the dynamics of the environment before the extrinsic rewards are available [14]. This means that an agent can only play with the dynamics of the environment to pretrain itself before being assigned with a specific task (i.e., having access to extrinsic rewards). Therefore, we further investigate the first way to integrate mega-reward with extrinsic rewards (i.e., using mega-reward to pretrain the agent) and compare the pretrained agent with that in the state-of-the-art world model [14].

The evaluation is based on a relative improve score, which is defined formally as

(11)

where is the score after 20M steps with the first 10M steps pretrained without access to extrinsic rewards, and is the score after 10M steps of training from scratch. In 14 out of 18 games (see Fig. 7), pretraining using mega-reward achieves more relative improvements than pretraining using the state-of-the-art world model [14]. This shows that mega-reward is also very helpful for agents to achieve a superior performance when used in a domain with extrinsic rewards.

5.3 Attention with Mega-Reward

Figure 6: Relative scores pretrained with mega-reward and world model.
Figure 7: Comparing RND and Masked-RND.

Furthermore, “noisy TV" is a long-standing open problem in novelty-driven approaches [6, 7]; it means that if there is a TV in the state that displays randomly generated noise at every step, the novelty-driven agent will find that watching at the noisy TV produces great interest. A possible way to solve this problem is to have an attention mask to remove the state changes that are irrelevant to the agent, and we believe the accumulated latent control map can be used as such an attention mask. Specifically, we estimate a running mean for each grid in

, which is then used to binarize

. The binarized is used to mask the state used in the state-of-the-art novelty-driven work, RND [7], making RND generate novelty scores only related to the agent’s control (both direct or latent). The above variant of RND is called Masked-RND, which is another way to apply mega-reward on a domain with extrinsic rewards.

Experiments are conducted on MontezumaRevenge following the same settings as in [7]. Fig. 7 shows the performance of RND and Masked-RND with different degrees of noise (measured by the STD of the normal noise). The result shows that as the noise degree increases, the performance score of RND decreases catastrophically, while the performance drop of Masked-RND is marginal until the noise is so strong (STD 0.6) that it ruins the state representation. This further supports our conclusion that mega-reward can also achieve a superior performance when it is used together with extrinsic rewards.

5.4 Failure Cases

Some failure cases of mega-reward are also noticed. We find that mega-reward works well on most games with a meshing size of ; however, some of the games with extremely small or big entities may fail with this size. This failure can be resolved by extracting the entities from the states using semantic segmentation [13], then applying our method on the semantically segmented entities instead of each grid. In addition, mega-reward also fails when the game terminates with a few seconds of flashing screen, because this will make the agent mistakenly believe that killing itself will flash the screen, which seems like having control on all entities for the agent. This failure can also be resolved by extracting entities using semantic segmentation.

6 Related Work

In this section, we discuss related works on intrinsic rewards, contingency awareness, empowerment, and relational networks.

Intrinsic rewards [29] are the rewards generated by the agent itself, in contrast to extrinsic rewards, which are provided by the environment. Most previous work on intrinsic rewards is based on the general idea of “novelty-drivenness", i.e., higher intrinsic rewards are given to states that occur relatively rarely in the history of an agent. The general idea is also called “surprise" or “curiosity". Based on how to measure the novelty of a state, there are two classes of methods: count-based methods [5, 22, 28, 34] and prediction-error-based methods [1, 30, 6, 7]. Another popular idea to generate intrinsic rewards is “difference-drivenness", meaning that higher intrinsic rewards are given to the states that are different from the resulting states of other subpolicies [9, 33]. To evaluate intrinsic rewards, intrinsically-motivated play has been adopted in several state-of-the-art works. However, it may be an ill-defined problem, i.e., if we flip the extrinsic rewards, the agent only trained by the intrinsic rewards is likely to peform worse than a random agent in terms of the flipped extrinsic rewards. Discarding the possible bug in defining the problem, it indeed helps in many scenarios such as pretraining, improving exploration, as well as understanding human intelligence.

The concept of contingency awareness originally comes from psychology [35, 2], where infants are proved to be aware that the entities in the state are potentially related to their actions. The idea was first introduced into AI by [4]. More recently, the discovery of grid cells [26], a neuroscience finding that supports the psychology concept of contingency awareness, trigged the interests of applying grid cells in AI agents [3, 37]. Another popular idea developed from contingency awareness is the one of inverse models, which are used to learn representations that contain the necessary information about action-related changes in states [30], or generate attention masks about which part of the states is action-related [8]. Other ideas following contingency awareness include controllable feature learning [11, 20], tool use discovery [10] and sensing guidance [17, 18, 19]. However, we formalize contingency awareness into a powerful intrinsic reward (mega-reward) for human-level intrinsically-motivated play. Besides, existing works are only capable of figuring out what is under the agent’s direct control, while we build the awareness of latent control and show that the awareness of latent control is the key to achieving powerful intrinsic reward.

The idea of “having more control over the environment” is also mentioned in empowerment [17, 19], which, however, is commonly based on mutual information between the actions and the entire state [24, 25], the latter of which evolves into stochasticity-drivenness [9]. While our megalomania-drivenness is based on identifying how actions are latently related to each entity in the state, which evolves from contingency awareness [35]. Thus, “megalomania-drivenness” is different from “empowerment”.

A part of RTMs, (see Section 3), is similar to relational networks [31], which have recently been applied to predict temporal transitions [36] and to learn representations in RL [38]. However, relational networks do not explicitly model mega-reward’s (see Section 2), while RTMs model it with (see Section 3). Thus, RTMs are defined and trained in a different way.

7 Summary and Outlook

In this work, we proposed a novel and powerful intrinsic reward, called mega-reward, to maximize the control over given entities in a given environment. To our knowledge, mega-reward is the first approach that achieves the same level of performance as professional human players in intrinsically-motivated play. To formalize mega-reward, a relational transition model is proposed to bridge the gap between direct and latent control. Extensive experimental studies are conducted to show the superior performance of mega-reward in both intrinsically-motivated play and real-world scenarios with also extrinsic rewards. Since human players can be driven by multiple intrinsic rewards, a promising topic for future research is to study how to efficiently and effectively combine mega-reward with other intrinsic rewards to further improve the intelligence of the agent.

Supplementary Material

.1 Neural Network Details

The details of the network architectures to model and are shown in Tables 2 and 3

, respectively. A fully connected layer is denoted FC, and a flatten layer is denoted Flatten. We use leaky rectified linear units (denoted LeakyRelu)

[21] with leaky rate

as the nonlinearity applied to all the hidden layers in our network. Batch normalization

[15] (denoted BatchNorm) is applied after hidden convolutional layers (denoted Conv). To model and , the integration of the three inputs is accomplished by approximated multiplicative interaction [27] (the dot-multiplication in Tables 2 and 3), so that any predictions made by or are conditioned on the three inputs together. Deconvolutional layers (denoted DeConv) [39] in are applied for predicting relational transitions.

.2 Performance on Atari Games

Here, we include a comparison of mega-reward against three state-of-the-art intrinsic rewards (no extrinsic reward), which are curiosity-driven (Cur) [6], RND [7], diversity-driven (Div) [33], as well as PPO with extrinsic reward (Ex-PPO) [32]. For fair comparison, each approach is trained after 10M steps. Fig. 8 shows the final performance, and Table 4 shows the learning speed on the game Seaquest of our approach against baselines and benchmarks on running time (hours).

.3 Proof of Lemmas

Lemma 1
(12)

Proof of Lemma 1:

(13)
Figure 8: Extrinsic reward per episode of mega-reward against other baselines after training 10M steps.
Input 1: Input 2: , as one-hot vector Input 3: , as one-hot vector
Conv: kernel size

, number of features 16, stride 2

BatchNorm
LeakyRelu
Conv: kernel size , number of features 32, stride 1
BatchNorm FC: number of features 1024 FC: number of features 1024
LeakyRelu
Flatten: is flattened to
FC: number of features 1024
BatchNorm
LeakyRelu
Dot-multiply
FC: number of features 1152
BatchNorm
LeakyRelu
Reshape: is reshaped to
DeConv: kernel size , number of features 16, stride 1
BatchNorm
LeakyRelu
DeConv: kernel size , number of features 1, stride 2
Tanh
Output:
Table 2: Network architecture of .
Input 1: Input 2: , as one-hot vector Input 3: , as one-hot vector
Conv: kernel size , number of features 16, stride 2
BatchNorm
LeakyRelu
Conv: kernel size , number of features 32, stride 1
BatchNorm FC: number of features 1024 FC: number of features 1024
LeakyRelu
Flatten: is flattened to
FC: number of features 1024
BatchNorm
LeakyRelu
Dot-multiply
FC: number of features 512
BatchNorm
LeakyRelu
Tanh
FC: number of features 1
Output:
Table 3: Network architecture of .
Game Emp Cur RND Sto Div Dir Meg Ex-PPO
Seaquest 14.24 15.87 17.62 12.96 19.66 21.32 34.22 5.126
Table 4: Comparison of mega-reward against baselines and benchmarks on running time (hours); conducted on a server with i7 CPU (16 cores), and one Nvidia GTX 1080Ti GPU. Each method is ran for 10M frames.

References

  • Achiam and Sastry [2017] Joshua Achiam and Shankar Sastry. Surprise-based intrinsic motivation for deep reinforcement learning. arXiv preprint arXiv:1703.01732, 2017.
  • Baeyens et al. [1990] Frank Baeyens, Paul Eelen, and Omer van den Bergh. Contingency awareness in evaluative conditioning: A case for unaware affective-evaluative learning. Cognition and emotion, 1990.
  • Banino et al. [2018] Andrea Banino, Caswell Barry, Benigno Uria, Charles Blundell, Timothy Lillicrap, Piotr Mirowski, Alexander Pritzel, Martin J Chadwick, Thomas Degris, Joseph Modayil, et al. Vector-based navigation using grid-like representations in artificial agents. Nature, 2018.
  • Bellemare et al. [2012] Marc G Bellemare, Joel Veness, and Michael Bowling. Investigating contingency awareness using atari 2600 games. In AAAI, 2012.
  • Bellemare et al. [2016] Marc Bellemare, Sriram Srinivasan, Georg Ostrovski, Tom Schaul, David Saxton, and Remi Munos. Unifying count-based exploration and intrinsic motivation. In NIPS, 2016.
  • Burda et al. [2018] Yuri Burda, Harri Edwards, Deepak Pathak, Amos Storkey, Trevor Darrell, and Alexei A Efros. Large-scale study of curiosity-driven learning. In NIPS, 2018.
  • Burda et al. [2019] Yuri Burda, Harrison Edwards, Amos Storkey, and Oleg Klimov. Exploration by random network distillation. In ICLR, 2019.
  • Choi et al. [2019] Jongwook Choi, Yijie Guo, Marcin Moczulski, Junhyuk Oh, Neal Wu, Mohammad Norouzi, and Honglak Lee.

    Contingency-aware exploration in reinforcement learning.

    In ICLR, 2019.
  • Florensa et al. [2017] Carlos Florensa, Yan Duan, and Pieter Abbeel.

    Stochastic neural networks for hierarchical reinforcement learning.

    In ICLR, 2017.
  • Forestier and Oudeyer [2016] Sébastien Forestier and Pierre-Yves Oudeyer. Modular active curiosity-driven discovery of tool use. In IROS, 2016.
  • Forestier et al. [2017] Sébastien Forestier, Yoan Mollard, and Pierre-Yves Oudeyer. Intrinsically motivated goal exploration processes with automatic curriculum learning. arXiv preprint arXiv:1708.02190, 2017.
  • Friston [2010] Karl Friston. The free-energy principle: a unified brain theory? Nature Reviews Neuroscience, 2010.
  • Goel et al. [2018] Vikash Goel, Jameson Weng, and Pascal Poupart. Unsupervised video object segmentation for deep reinforcement learning. In NIPS, 2018.
  • Ha and Schmidhuber [2018] David Ha and Jürgen Schmidhuber. World models. In NIPS, 2018.
  • Ioffe and Szegedy [2015] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015.
  • Jaderberg et al. [2017] Max Jaderberg, Volodymyr Mnih, Wojciech Marian Czarnecki, Tom Schaul, Joel Z Leibo, David Silver, and Koray Kavukcuoglu. Reinforcement learning with unsupervised auxiliary tasks. In ICLR, 2017.
  • Klyubin et al. [2005a] Alexander S Klyubin, Daniel Polani, and Chrystopher L Nehaniv. All else being equal be empowered. In ECAL, 2005.
  • Klyubin et al. [2005b] Alexander S Klyubin, Daniel Polani, and Chrystopher L Nehaniv. Empowerment: A universal agent-centric measure of control. In

    IEEE Congress on Evolutionary Computation

    , 2005.
  • Klyubin et al. [2008] Alexander S Klyubin, Daniel Polani, and Chrystopher L Nehaniv. Keep your options open: an information-based driving principle for sensorimotor systems. PloS one, 2008.
  • Laversanne-Finot et al. [2018] Adrien Laversanne-Finot, Alexandre Péré, and Pierre-Yves Oudeyer. Curiosity driven exploration of learned disentangled goal spaces. In CoRL, 2018.
  • Maas et al. [2013] Andrew L Maas, Awni Y Hannun, and Andrew Y Ng. Rectifier nonlinearities improve neural network acoustic models. In ICML, 2013.
  • Martin et al. [2017] Jarryd Martin, Suraj Narayanan Sasikumar, Tom Everitt, and Marcus Hutter. Count-based exploration in feature space for reinforcement learning. In IJCAI, 2017.
  • Mnih et al. [2015] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 2015.
  • Mohamed and Rezende [2015] Shakir Mohamed and Danilo Jimenez Rezende. Variational information maximisation for intrinsically motivated reinforcement learning. In NIPS, 2015.
  • Montúfar et al. [2016] Guido Montúfar, Keyan Ghazi-Zahedi, and Nihat Ay. Information theoretically aided reinforcement learning for embodied agents. arXiv preprint arXiv:1605.09735, 2016.
  • Moser et al. [2015] May-Britt Moser, David C Rowland, and Edvard I Moser. Place cells, grid cells, and memory. Cold Spring Harbor perspectives in Biology, 2015.
  • Oh et al. [2015] Junhyuk Oh, Xiaoxiao Guo, Honglak Lee, Richard L Lewis, and Satinder Singh. Action-conditional video prediction using deep networks in atari games. In NIPS, 2015.
  • Ostrovski et al. [2017] Georg Ostrovski, Marc G Bellemare, Aaron van den Oord, and Rémi Munos. Count-based exploration with neural density models. In ICML, 2017.
  • Oudeyer and Kaplan [2009] Pierre-Yves Oudeyer and Frederic Kaplan. What is intrinsic motivation? a typology of computational approaches. Frontiers in Neurorobotics, 2009.
  • Pathak et al. [2017] Deepak Pathak, Pulkit Agrawal, Alexei A Efros, and Trevor Darrell. Curiosity-driven exploration by self-supervised prediction. In ICML, 2017.
  • Santoro et al. [2017] Adam Santoro, David Raposo, David G Barrett, Mateusz Malinowski, Razvan Pascanu, Peter Battaglia, and Tim Lillicrap. A simple neural network module for relational reasoning. In NIPS, 2017.
  • Schulman et al. [2017] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  • Song et al. [2019] Yuhang Song, Jianyi Wang, Thomas Lukasiewicz, Zhenghua Xu, and Mai Xu. Diversity-driven extensible hierarchical reinforcement learning. In AAAI, 2019.
  • Tang et al. [2017] Haoran Tang, Rein Houthooft, Davis Foote, Adam Stooke, Xi Chen, Yan Duan, John Schulman, Filip DeTurck, and Pieter Abbeel. # exploration: A study of count-based exploration for deep reinforcement learning. In NIPS, 2017.
  • Watson [1966] John S Watson. The development and generalization of" contingency awareness" in early infancy: Some hypotheses. Merrill-Palmer Quarterly of Behavior and Development, 1966.
  • Watters et al. [2017] Nicholas Watters, Daniel Zoran, Theophane Weber, Peter Battaglia, Razvan Pascanu, and Andrea Tacchetti. Visual interaction networks: Learning a physics simulator from video. In NIPS, 2017.
  • Whittington et al. [2018] James CR Whittington, Timothy H Muller, Caswell Barry, and Timothy EJ Behrens. Generalisation of structural knowledge in the hippocampal-entorhinal system. In NIPS, 2018.
  • Zambaldi et al. [2019] Vinicius Zambaldi, David Raposo, Adam Santoro, Victor Bapst, Yujia Li, Igor Babuschkin, Karl Tuyls, David Reichert, Timothy Lillicrap, Edward Lockhart, et al. Deep reinforcement learning with relational inductive biases. In ICLR, 2019.
  • Zeiler et al. [2011] Matthew D Zeiler, Graham W Taylor, and Rob Fergus. Adaptive deconvolutional networks for mid and high level feature learning. In ICCV, 2011.
  • Ziegel [2001] Eric R Ziegel. Standard probability and statistics tables and formulae. Technometrics, 2001.