Attention-based Curiosity-driven Exploration in Deep Reinforcement Learning

10/23/2019 ∙ by Patrik Reizinger, et al. ∙ 0

Reinforcement Learning enables to train an agent via interaction with the environment. However, in the majority of real-world scenarios, the extrinsic feedback is sparse or not sufficient, thus intrinsic reward formulations are needed to successfully train the agent. This work investigates and extends the paradigm of curiosity-driven exploration. First, a probabilistic approach is taken to exploit the advantages of the attention mechanism, which is successfully applied in other domains of Deep Learning. Combining them, we propose new methods, such as AttA2C, an extension of the Actor-Critic framework. Second, another curiosity-based approach - ICM - is extended. The proposed model utilizes attention to emphasize features for the dynamic models within ICM, moreover, we also modify the loss function, resulting in a new curiosity formulation, which we call rational curiosity. The corresponding implementation can be found at



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Learning through reinforcement is a powerful concept, as it - similarly to unsupervised learning - does not need labeled samples, which are both time- and money-consuming to collect. Moreover, *rl is rather similar to the concept of human learning, since it defines the learning process as an interaction between an agent and its environment. Interestingly, in some cases, it was proven that an *rl agent optimizes the same objective function as primates do, as noted in  


Speaking of optimization, an optimum can only be referred to w.r.t. some given criteria, thus *rl has to define its own criterion, as it is the case for *ai generally. In optimization, we generally face the so-called bias-variance problem, which has a new designation in the *rl domain: it is called the exploration-exploitation dilemma. Here, exploitation refers to preferring prior knowledge, i.e. bias. In some sense, exploitation can be thought of as a risk-minimizing way of actions. Nonetheless, this policy may not always be the best option. Clearly, some sort of compromise should be obtained between exploitation and exploration - which stands for a variance-preferring policy.

*rl agents are generally trained based on the feedback - rewards - collected from the environment. However, the environment is not guaranteed to have an inner dynamics that rewards the agent in a way that corresponds with its goals, e.g. if the rewards are not negative the agent may collect an infinite amount of reward given an infinite time horizon, while it can fail its objective if it is to finish the task as quickly as possible. Thus, only having an extrinsic incentive may be impractical. Furthermore, rewards are in most real-life scenarios sparse, so it is more difficult for the agent to assign credit to useful actions and omit disadvantageous ones in the future.

To mitigate the above-mentioned conflict, several methods were developed to supplement the objective function of an *rl agent to improve its performance. On the higher level, i.e. considering model-based *rl, the World model architecture [4] is one answer for these challenges. Curiosity-based methods, which are the main topic of this work, modify the loss function (or even the network architecture) by adding terms to incentivize exploration. In this topic, the *icm module of [8] is a major contribution, which will be used extensively in this work to build upon. *icm introduces inner dynamics (forward and inverse) to quantify the prediction error of the next state and action, which is utilized as an intrinsic reward - a more extensive study can be found in [2]. Another promising work in this field is the notion of disagreement, which is an ensemble-based curiosity formulation [9].

In this paper, we explore the capabilities of applying attention [12] in order to incentivize exploration and improve generalization. We test our proposed methods using Atari games from OpenAI Gym [1]. Our main contributions are as follows:

  • *atta2c, an attention-aided *a2c variant,

  • feature- and action-selective extension of the *icm [8],

  • a rational curiosity formulation.

2 Attention-based Curiosity

After reviewing the approaches of the literature, we turn to the main contributions of this work, which are based on the combination of the advancements of different *dl fields, mainly manifesting in the introduction of the attention mechanism to curiosity-driven *rl - for use in other domains, see e.g. [13]. Using attention, *a2c [6] (one of the standard architectures of *rl) and the module responsible for curiosity-driven exploration of  [8], modifications are carried out to include prior assumptions in the model.

2.1 AttA2C

The main idea for *atta2c comes from the *gan[3]–*ac correspondence [10], which is summarized in Table 1. Highlighting the only difference (typeset in bold in the table), i.e. the fact that while an *ac network feeds the same input/features to its both heads, the *gan does quite the opposite.

Figure 1: The *atta2c architecture

Taking this similarity a step further, we conjecture that separating the feature space into two parts can be advantageous, as different features may be useful for the actor and for the critic. The proposed architecture aims to overcome this disadvantage by utilizing attention. As shown in Figure 1, the input from the environment is fed through separate attention layers, thus both heads can put emphasis onto the most important parts of the latent process, which is formulated as follows:


where (which is basically the policy *policy) and denote the corresponding networks, *feat is the feature transform of the input (if present), while

is a shorthand for the probability distribution of the attention mechanism. Subscripts denote that two separate attention layers are used to predict the policy *policy and the advantage function *advantage.

*gan *ac
Generator Actor
Discriminator Critic
Label Reward
Input State
Latent space State
Table 1: The connection between *gan and *ac concepts [10]

2.2 Feature- and action selective ICM

Being able to separate the most useful features cannot only be advantageous for the *ac network, but also for the curiosity formulation. Thus, we introduce attention mechanism to *icm to make it action- and feature-selective. To do that, attention is applied onto the concatenation of the inputs of the forward and inverse models (denoted by the

and subscripts, respectively):


where stands for predicted values. The equations above describe the single attention case, but we also experimented with double attention, in which we swapped the order of concatenation and attention. In this case, the attention-weighted features and actions are concatenated. The reason for this second formulation was to separate the weighting between the two domains. The main advantage of the latter formulation could be that using double attention could ensure that both in the feature and action spaces there will be a subspace which is emphasized. This is not the case when using single attention, which can be problematic if the distributions are not in the same order of magnitude, implying that one domain is more important due to using different value scales.

Figure 2: Forward loss formulation of the *rcm agent

2.3 Rational curiosity

The curiosity-driven exploration strategies of  [8] and  [9] represent a rather powerful approach to train agents in sparse reward settings. Nonetheless, these models are based on the assumption that enforcing curiosity-driven exploration is a good choice for every state and action, even though this is not true in every scenario. E.g. consider the noisy TV experiment of  [2], where the agent has control over generating new (and thus, unexpected) instances of random noise, resulting in high values of the *icm/disagreement losses.

To classify the states based on the usefulness of curiosity, the state space

can be divided into two subsets: and , where the former denotes the subset of states with useful curiosity, while the latter the useless or harmful curiosity settings. Curiosity is termed as harmful if being driven by this curiosity does not ensure the fulfillment of the original objective, not even in the long run. I.e. sacrificing short-term rewards to develop general skills is useful, but overfitting random noise is not.

Thus, the rational objective of curiosity-driven exploration would be to minimize the probability of being in states in , which corresponds to:


i.e. it is intended to select actions which do not lead the agent into , independent from its actual state.

Figure 3: The *a2c/*atta2c network (gray indicates optional layer)

Figure 4: The *icm/*rcm network (gray indicates optional layer)

To achieve that, an attention network [12] is used on the forward loss function of the *icm module [8] - the inverse loss is not used, as for each action prediction, there is only one true action, where weighting has no effect. The new forward loss term is illustrated in Figure 2, where the round-headed arrow denotes that is used as the control term in the attention layer, i.e. it determines the weighting of the *mse (which is here decomposed into two steps, i.e. subtraction and squared mean) through a probability distribution. Thus, the loss function is modified in the following manner:


where *rcm is the term for the new model, *cost_a2c is the objective function for the *a2c network, *cost_fwd for the forward and *cost_inv for the inverse dynamics, weighted with a scalar *icm_beta. This formula is motivated by the fact that encodes fully whether whether the state of the agent is in or in . We hypothesize that this formulation can help to utilize curiosity only in situations where the agent can benefit from that on the long run, but omits curiosity otherwise.

2.4 Implementation

The proposed methods are implemented in PyTorch 

[7], the agents are based on the implementation of [8] (shown in Figure 3 and Figure 4). The extension is indicated with the gray boxes, as those layers are part of some of the proposed methods discussed in the following. Five configurations were implemented: *atta2c, a single- and double-attention *icm and *rcm, and the traditional *icm agent is used as a baseline [8].

For test purposes, the Atari environments of the OpenAI Gym [1] are used via the Stable-baselines [5] package. Three environments were chosen, i.e. Breakout, Pong and Seaquest (as in [9]), which provide single- (Pong, Seaquest) and multiplayer (Pong) tasks with one (Breakout, Pong) and multiple (Seaquest) objectives. For each of the three, both the deterministic (v4) and the stochastic (v0, with nonzero action repeat probability) variants were evaluated. All agents were trained on 4 environments in a parallel manner for 2,500,000 rollouts with 5 steps each, using 4 stacked frames.

Figure 5: Mean reward in the Breakout environment (v0)
Figure 6: Feature standard deviation in the Breakout environment (v0)

3 Results

We monitored two metrics to compare both performance and generalization: we used the mean reward for the former, and the mean standard deviation of the features for the latter. Due to space restrictions, only the results for the more difficult, stochastic v0 environments are depicted here, but the others are available in the GitHub repository.

For Breakout (Figure 5, showing a confidence interval as well), the selective versions of *icm had a consistently good performance, but the *atta2c agent trained faster at the beginning, and after experiencing a jump in variance, it managed to significantly outperform the other agents. The standard deviation of the feature space shown in Figure 6 visualizes the general setting, i.e. the significantly higher values in case of *atta2c (for every environment, but the ”jumps” are smaller). In our experiments, agents (not concerning *atta2c for this statement) with higher standard deviation performed generally better. In case of Pong (Figure 7), the single-attention *icm performed as the best, followed by the *rcm agent. In this case, *atta2c trained slower, but managed to achieve comparable rewards - the reason for this could be the smaller gradients due to attention between feature space and the actor/critic. The *rcm agent was the best for the most complex environment, Seaquest, as Figure 8 shows. In this case, the selective *icm agents were overtaken by the original *icm, while *atta2c experienced much slower training.

Agent Mean normalized reward Std. dev.
Table 2: Comparison of the agents’ normalized mean performance (higher is better)
Figure 7: Mean reward in the Pong environment (v0)
Figure 8: Mean reward in the Seaquest environment (v0)

To be able to provide a concise but quantitative summary of the agents’ performance in all environments (including both variants), we normalized the highest rewards of each agent, with 100 denoting the best performance. This way, we were able to compare the mean performances in a relative manner, summarized in Table 2 (including a confidence interval). As it shows, the *rcm agent performed the best, followed by the single attention *icm. Note that *atta2c had both the lowest mean and the highest variance, mainly due to the good performance in Breakout, but a moderate one in the other scenarios.

4 Discussion

This work investigated the paradigm of curiosity-driven exploration in *rl, which was extended by the attention mechanism. We proposed three different methods to incorporate attention for utilizing curiosity in a selective manner. The new models were tested in the OpenAI Gym environment and have shown consistent improvement to the baseline models used for comparison.


  • [1] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba (2016) OpenAI Gym. External Links: Link Cited by: §1, §2.4.
  • [2] Y. Burda, H. Edwards, D. Pathak, A. Storkey, T. Darrell, and A. A. Efros (2018) Large-scale study of curiosity-driven learning. arXiv preprint arXiv:1808.04355. External Links: Link Cited by: §1, §2.3.
  • [3] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in Neural Information Processing Systems, pp. 2672–2680. External Links: Link Cited by: §2.1.
  • [4] D. Ha and J. Schmidhuber (2018) World models. arXiv preprint arXiv:1803.10122. External Links: Link, Document Cited by: §1.
  • [5] A. Hill, A. Raffin, M. Ernestus, A. Gleave, R. Traore, P. Dhariwal, C. Hesse, O. Klimov, A. Nichol, M. Plappert, A. Radford, J. Schulman, S. Sidor, and Y. Wu (2018) Stable baselines. GitHub. Note: Cited by: §2.4.
  • [6] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu (2016) Asynchronous methods for deep reinforcement learning. In

    International Conference on Machine Learning

    pp. 1928–1937. Cited by: §2.
  • [7] A. Paszke, G. Chanan, Z. Lin, S. Gross, E. Yang, L. Antiga, and Z. Devito (2017) Automatic differentiation in PyTorch. Advances in Neural Information Processing Systems 30 (Nips), pp. 1–4. Cited by: §2.4.
  • [8] D. Pathak, P. Agrawal, A. A. Efros, and T. Darrell (2017) Curiosity-driven exploration by self-supervised prediction. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops

    pp. 16–17. Cited by: 2nd item, §1, §2.3, §2.3, §2.4, §2.
  • [9] D. Pathak, D. Gandhi, and A. Gupta (2019) Self-supervised exploration via disagreement. arXiv preprint arXiv:1906.04161. External Links: Link Cited by: §1, §2.3, §2.4.
  • [10] D. Pfau (2017) Connecting GANs, Actor-Critic methods and multilevel optimization. External Links: Link Cited by: §2.1, Table 1.
  • [11] R. S. Sutton and A. G. Barto (2017) Reinforcement learning: an introduction. Cited by: §1.
  • [12] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. External Links: Link Cited by: §1, §2.3.
  • [13] T. Xu, P. Zhang, Q. Huang, H. Zhang, Z. Gan, X. Huang, and X. He (2018) Attngan: fine-grained text to image generation with attentional generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1316–1324. Cited by: §2.