Learning through reinforcement is a powerful concept, as it - similarly to unsupervised learning - does not need labeled samples, which are both time- and money-consuming to collect. Moreover, *rl is rather similar to the concept of human learning, since it defines the learning process as an interaction between an agent and its environment. Interestingly, in some cases, it was proven that an *rl agent optimizes the same objective function as primates do, as noted in.
Speaking of optimization, an optimum can only be referred to w.r.t. some given criteria, thus *rl has to define its own criterion, as it is the case for *ai generally. In optimization, we generally face the so-called bias-variance problem, which has a new designation in the *rl domain: it is called the exploration-exploitation dilemma. Here, exploitation refers to preferring prior knowledge, i.e. bias. In some sense, exploitation can be thought of as a risk-minimizing way of actions. Nonetheless, this policy may not always be the best option. Clearly, some sort of compromise should be obtained between exploitation and exploration - which stands for a variance-preferring policy.
*rl agents are generally trained based on the feedback - rewards - collected from the environment. However, the environment is not guaranteed to have an inner dynamics that rewards the agent in a way that corresponds with its goals, e.g. if the rewards are not negative the agent may collect an infinite amount of reward given an infinite time horizon, while it can fail its objective if it is to finish the task as quickly as possible. Thus, only having an extrinsic incentive may be impractical. Furthermore, rewards are in most real-life scenarios sparse, so it is more difficult for the agent to assign credit to useful actions and omit disadvantageous ones in the future.
To mitigate the above-mentioned conflict, several methods were developed to supplement the objective function of an *rl agent to improve its performance. On the higher level, i.e. considering model-based *rl, the World model architecture  is one answer for these challenges. Curiosity-based methods, which are the main topic of this work, modify the loss function (or even the network architecture) by adding terms to incentivize exploration. In this topic, the *icm module of  is a major contribution, which will be used extensively in this work to build upon. *icm introduces inner dynamics (forward and inverse) to quantify the prediction error of the next state and action, which is utilized as an intrinsic reward - a more extensive study can be found in . Another promising work in this field is the notion of disagreement, which is an ensemble-based curiosity formulation .
In this paper, we explore the capabilities of applying attention  in order to incentivize exploration and improve generalization. We test our proposed methods using Atari games from OpenAI Gym . Our main contributions are as follows:
*atta2c, an attention-aided *a2c variant,
feature- and action-selective extension of the *icm ,
a rational curiosity formulation.
2 Attention-based Curiosity
After reviewing the approaches of the literature, we turn to the main contributions of this work, which are based on the combination of the advancements of different *dl fields, mainly manifesting in the introduction of the attention mechanism to curiosity-driven *rl - for use in other domains, see e.g. . Using attention, *a2c  (one of the standard architectures of *rl) and the module responsible for curiosity-driven exploration of , modifications are carried out to include prior assumptions in the model.
The main idea for *atta2c comes from the *gan–*ac correspondence , which is summarized in Table 1. Highlighting the only difference (typeset in bold in the table), i.e. the fact that while an *ac network feeds the same input/features to its both heads, the *gan does quite the opposite.
Taking this similarity a step further, we conjecture that separating the feature space into two parts can be advantageous, as different features may be useful for the actor and for the critic. The proposed architecture aims to overcome this disadvantage by utilizing attention. As shown in Figure 1, the input from the environment is fed through separate attention layers, thus both heads can put emphasis onto the most important parts of the latent process, which is formulated as follows:
where (which is basically the policy *policy) and denote the corresponding networks, *feat is the feature transform of the input (if present), while
is a shorthand for the probability distribution of the attention mechanism. Subscripts denote that two separate attention layers are used to predict the policy *policy and the advantage function *advantage.
2.2 Feature- and action selective ICM
Being able to separate the most useful features cannot only be advantageous for the *ac network, but also for the curiosity formulation. Thus, we introduce attention mechanism to *icm to make it action- and feature-selective. To do that, attention is applied onto the concatenation of the inputs of the forward and inverse models (denoted by theand subscripts, respectively):
where stands for predicted values. The equations above describe the single attention case, but we also experimented with double attention, in which we swapped the order of concatenation and attention. In this case, the attention-weighted features and actions are concatenated. The reason for this second formulation was to separate the weighting between the two domains. The main advantage of the latter formulation could be that using double attention could ensure that both in the feature and action spaces there will be a subspace which is emphasized. This is not the case when using single attention, which can be problematic if the distributions are not in the same order of magnitude, implying that one domain is more important due to using different value scales.
2.3 Rational curiosity
The curiosity-driven exploration strategies of  and  represent a rather powerful approach to train agents in sparse reward settings. Nonetheless, these models are based on the assumption that enforcing curiosity-driven exploration is a good choice for every state and action, even though this is not true in every scenario. E.g. consider the noisy TV experiment of , where the agent has control over generating new (and thus, unexpected) instances of random noise, resulting in high values of the *icm/disagreement losses.
To classify the states based on the usefulness of curiosity, the state spacecan be divided into two subsets: and , where the former denotes the subset of states with useful curiosity, while the latter the useless or harmful curiosity settings. Curiosity is termed as harmful if being driven by this curiosity does not ensure the fulfillment of the original objective, not even in the long run. I.e. sacrificing short-term rewards to develop general skills is useful, but overfitting random noise is not.
Thus, the rational objective of curiosity-driven exploration would be to minimize the probability of being in states in , which corresponds to:
i.e. it is intended to select actions which do not lead the agent into , independent from its actual state.
To achieve that, an attention network  is used on the forward loss function of the *icm module  - the inverse loss is not used, as for each action prediction, there is only one true action, where weighting has no effect. The new forward loss term is illustrated in Figure 2, where the round-headed arrow denotes that is used as the control term in the attention layer, i.e. it determines the weighting of the *mse (which is here decomposed into two steps, i.e. subtraction and squared mean) through a probability distribution. Thus, the loss function is modified in the following manner:
where *rcm is the term for the new model, *cost_a2c is the objective function for the *a2c network, *cost_fwd for the forward and *cost_inv for the inverse dynamics, weighted with a scalar *icm_beta. This formula is motivated by the fact that encodes fully whether whether the state of the agent is in or in . We hypothesize that this formulation can help to utilize curiosity only in situations where the agent can benefit from that on the long run, but omits curiosity otherwise.
The proposed methods are implemented in PyTorch, the agents are based on the implementation of  (shown in Figure 3 and Figure 4). The extension is indicated with the gray boxes, as those layers are part of some of the proposed methods discussed in the following. Five configurations were implemented: *atta2c, a single- and double-attention *icm and *rcm, and the traditional *icm agent is used as a baseline .
For test purposes, the Atari environments of the OpenAI Gym  are used via the Stable-baselines  package. Three environments were chosen, i.e. Breakout, Pong and Seaquest (as in ), which provide single- (Pong, Seaquest) and multiplayer (Pong) tasks with one (Breakout, Pong) and multiple (Seaquest) objectives. For each of the three, both the deterministic (v4) and the stochastic (v0, with nonzero action repeat probability) variants were evaluated. All agents were trained on 4 environments in a parallel manner for 2,500,000 rollouts with 5 steps each, using 4 stacked frames.
We monitored two metrics to compare both performance and generalization: we used the mean reward for the former, and the mean standard deviation of the features for the latter. Due to space restrictions, only the results for the more difficult, stochastic v0 environments are depicted here, but the others are available in the GitHub repository.
For Breakout (Figure 5, showing a confidence interval as well), the selective versions of *icm had a consistently good performance, but the *atta2c agent trained faster at the beginning, and after experiencing a jump in variance, it managed to significantly outperform the other agents. The standard deviation of the feature space shown in Figure 6 visualizes the general setting, i.e. the significantly higher values in case of *atta2c (for every environment, but the ”jumps” are smaller). In our experiments, agents (not concerning *atta2c for this statement) with higher standard deviation performed generally better. In case of Pong (Figure 7), the single-attention *icm performed as the best, followed by the *rcm agent. In this case, *atta2c trained slower, but managed to achieve comparable rewards - the reason for this could be the smaller gradients due to attention between feature space and the actor/critic. The *rcm agent was the best for the most complex environment, Seaquest, as Figure 8 shows. In this case, the selective *icm agents were overtaken by the original *icm, while *atta2c experienced much slower training.
|Agent||Mean normalized reward||Std. dev.|
To be able to provide a concise but quantitative summary of the agents’ performance in all environments (including both variants), we normalized the highest rewards of each agent, with 100 denoting the best performance. This way, we were able to compare the mean performances in a relative manner, summarized in Table 2 (including a confidence interval). As it shows, the *rcm agent performed the best, followed by the single attention *icm. Note that *atta2c had both the lowest mean and the highest variance, mainly due to the good performance in Breakout, but a moderate one in the other scenarios.
This work investigated the paradigm of curiosity-driven exploration in *rl, which was extended by the attention mechanism. We proposed three different methods to incorporate attention for utilizing curiosity in a selective manner. The new models were tested in the OpenAI Gym environment and have shown consistent improvement to the baseline models used for comparison.
-  (2016) OpenAI Gym. External Links: Cited by: §1, §2.4.
-  (2018) Large-scale study of curiosity-driven learning. arXiv preprint arXiv:1808.04355. External Links: Cited by: §1, §2.3.
-  (2014) Generative adversarial nets. In Advances in Neural Information Processing Systems, pp. 2672–2680. External Links: Cited by: §2.1.
-  (2018) World models. arXiv preprint arXiv:1803.10122. External Links: Cited by: §1.
-  (2018) Stable baselines. GitHub. Note: https://github.com/hill-a/stable-baselines Cited by: §2.4.
Asynchronous methods for deep reinforcement learning.
International Conference on Machine Learning, pp. 1928–1937. Cited by: §2.
-  (2017) Automatic differentiation in PyTorch. Advances in Neural Information Processing Systems 30 (Nips), pp. 1–4. Cited by: §2.4.
-  (2017) Curiosity-driven exploration by self-supervised prediction. In , pp. 16–17. Cited by: 2nd item, §1, §2.3, §2.3, §2.4, §2.
-  (2019) Self-supervised exploration via disagreement. arXiv preprint arXiv:1906.04161. External Links: Cited by: §1, §2.3, §2.4.
-  (2017) Connecting GANs, Actor-Critic methods and multilevel optimization. External Links: Cited by: §2.1, Table 1.
-  (2017) Reinforcement learning: an introduction. Cited by: §1.
-  (2017) Attention is all you need. External Links: Cited by: §1, §2.3.
-  (2018) Attngan: fine-grained text to image generation with attentional generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1316–1324. Cited by: §2.