Reinforcement learning (RL) is a general formalism modelling sequential decision making, which supports making minimal assumptions about the task at hand and reducing the need for prior knowledge. By learning behaviour from scratch, RL agents have the potential to surpass human expertise or tackle complex domains where human intuition is not applicable. In practice, however, generality is often traded for performance and efficiency, with RL practitioners tuning algorithms, architectures and hyper-parameters to the task at hand (Hessel et al., 2019). A side-effect is that the resulting methods can be brittle, or difficult to reliably reproduce (Nagarajan et al., 2018).
Exploration is one of the main aspects commonly designed or tuned specifically for the task being solved. Previous work has shown that large sample-efficiency gains are possible, for example, when the exploratory behaviour’s level of stochasticity is adjusted to the environment’s hazard rate (García & Fernández, 2015), or when an appropriate prior is used in large action spaces (Dulac-Arnold et al., 2015; Czarnecki et al., 2018; Vinyals et al., 2019). Exploration in the presence of function approximation should ideally be agent-centred. It ought to focus more on generating data that supports the agent’s learning at its current parameters , rather than making progress on objective measurements of information gathering. A useful notion here is learning progress (), defined as the improvement of the learned policy (Section 3).
The agent’s source of data is its behaviour policy. Beyond the conventional RL setting of a single stream of experience, distributed agents that interact with parallel copies of the environment can have multiple such data sources (Horgan et al., 2018). In this paper, we restrict ourselves to the setting where all behaviour policies are derived from a single set of learned parameters , for example when parameterises an action-value function . The behaviour policies are then given by , where each modulation leads to meaningfully different behaviour. This can be guaranteed if is semantic (e.g. degree of stochasticity) and consistent across multiple time-steps. The latter is achieved by holding fixed throughout each episode (Section 2).
We propose to estimate a proxy that is indicative of future learning progress,(Section 3), separately for each modulation , and to adapt the distribution over modulations to maximize , using a non-stationary multi-armed bandit that can exploit the factored structure of the modulations (Section 4). Figure 1 shows a diagram of all these components. This results in an autonomous adaptation of behaviour to the agent’s stage of learning (Section 5), varying across tasks and across time, and reducing the need for hyper-parameter tuning.
2 Modulated Behaviour
As usual in RL, the objective of the agent is to find a policy that maximises the -discounted expected return , where is the reward obtained during the transition from time to . A common way to address this problem is to use methods that compute the action-value function given by , i.e. the expected return when starting from state with action and then following (Puterman, 1994).
A richer representation of that aims to capture more information about the underlying distribution of has been proposed by Bellemare et al. (2017), and extended by Dabney et al. (2018). Instead of approximating only the mean of the return distribution, we approximate a discrete set of quantile values (where ) such that . Outside the benefits in performance and representation learning (Such et al., 2018), these quantile estimates provide a way of inducing risk-sensitive behaviour. We approximate all
using a single deep neural network with parameters, and define the evaluation policy as the greedy one with respect to the mean estimate:
The behaviour policy is the central element of exploration: it generates exploratory behaviour which is used to learn ; ideally in such a way as to reduce the total amount of experience required to achieve good performance. Instead of a single monolithic behaviour policy, we propose to use a modulated policy to support parameterized variation. Its modulations should satisfy the following criteria: they need to (i) be impactful, having a direct and meaningful effect on generated behaviour; (ii) have small dimensionality, to quickly adapt to the needs of the learning algorithm; (iii) have interpretable semantics to ease the choice of viable ranges and initialisation; and (iv) be frugal, in the sense that they are relatively simple and computationally inexpensive to apply. In this work, we consider five concrete types of such modulations:
a Boltzmann softmax policy based on action-logits, modulated by temperature,.
with probabilitythe agent ignores the action distribution produced by the softmax, and samples an action uniformly at random (-greedy).
Per-action biases: action-logit offsets, , to bias the agent to prefer some actions.
Action-repeat probability: with probability , the previous action is repeated (Machado et al., 2017). This produces chains of repeated actions with expected length .
Optimism: as the value function is represented by quantiles , the aggregate estimate can be parameterised by an optimism exponent , such that recovers the default flat average, while positive values of imply optimism and negative ones pessimism. When near risk-neutral, our simple risk measure produces qualitatively similar transforms to those of Wang (2000).
We combine the above modulations to produce the overall -modulated policy
where , is the indicator function, and the optimism-aggregated value is
Now that the behaviour policy can be modulated, the following two sections discuss the criteria and mechanisms for choosing modulations .
3 Exploration & the Effective Acquisition of Information
A key component of a successful reinforcement learning algorithm is the ability to acquire experience (information) that allows it to make expeditious progress towards its objective of learning to act in the environment in such a way as to optimise returns over the relevant (potentially discounted) horizon. The types of experience that most benefit an agent’s ultimate performance may differ qualitatively throughout the course of learning — a behaviour modulation that is beneficial in the beginning of training often enough does not carry over to the end. An example analysis is shown in Figure 5. However, this analysis was conducted in hindsight, and in general how to generate such experience optimally — optimal exploration in any environment — remains an open problem.
One approach is to require exploration to be in service of the agent’s future learning progress (), and to optimise this quantity during learning. Although there are multiple ways of defining learning progress, in this work we opted for a task-related measure, namely the improvement of the policy in terms of expected return. This choice of measure corresponds to the local steepness of the learning curve of the evaluation policy ,
where the expectation is over start states , the value is the -discounted return one would expected to obtain, starting in state and following policy afterwards, and is a change in the agent’s parameters. Note that this is still a limited criterion, as it is myopic and might be prone to local optima.
As prefaced in the last section, our goal here is to define a mechanism that can switch between different behaviour modulations depending on which of them seems most promising at this point in the training process. Thus in order to adapt the distribution over modulations , we want to assess the expected LP when learning from data generated according to -modulated behaviour,
with the weight-change of learning from trajectory at time . This is a subjective utility measure, quantifying how useful is for a particular learning algorithm, at this stage in training.
Proxies for learning progress:
While is a simple and clear progress metric, it is not readily available during training, so that in practice, a proxy fitness needs to be used. A key practical challenge is to construct from inexpensively measurable proxies, in a way that is sufficiently informative to effectively adapt the distribution over , while being robust to noise, approximation error, state distribution shift and mismatch between the proxies and learning progress. The ideal choice of is a matter of empirical study, and this paper only scratches the surface on this topic.
After some initial experimentation, we opted for the simple proxy of empirical (undiscounted) episodic return: . This is trivial to estimate, but it departs from in a number of ways. First, it does not contain learner-subjective information, but this is partly mitigated through the joint use of prioritized replay (see Section 5.1) that over-samples high error experience. Another potential mechanism by which the episodic return can be indicative of future learning is because an improved policy tends to be preceded by some higher-return episodes – in general, there is a lag between best-seen performance and reliably reproducing it. Second, the fitness is based on absolute returns not differences in returns as suggested by Equation 1; this makes no difference to the relative orderings of (and the resulting probabilities induced by the bandit), but it has the benefit that the non-stationarity takes a different form: a difference-based metric will appear stationary if the policy performance keeps increasing at a steady rate, but such a policy must be changing significantly to achieve that progress, and therefore the selection mechanism should keep revisiting other modulations. In contrast, our absolute fitness naturally has this effect when paired with a non-stationary bandit, as described in the next section.
4 Non-stationary Bandit Tailored to Learning Progress
The most effective modulation scheme may differ throughout the course of learning. Instead of applying a single fixed modulation or fixed blend, we propose an adaptive scheme, in which the choice of modulation is based on learning progress. The adaptation process is based on a non-stationary multi-armed bandit (Besbes et al., 2014; Raj & Kalyani, 2017), where each arm corresponds to a behaviour modulation . The non-stationarity reflects the nature of the learning progress which depends on the time in training through the parameters .
Because of non-stationarity, the core challenge for such a bandit is to identify good modulation arms quickly, while only having access to a noisy, indirect proxy of the quantity of interest . However, our setting also presents an unusual advantage: the bandit does not need to identify the best , as in practice it suffices to spread probability among all arms that produce reasonably useful experience for learning.
Concretely, our bandit samples a modulation according to the probability that it results in higher than usual fitness (measured as the mean over a recent window over episodes):
Note that depends on the payoffs of the actually sampled modulations , allowing the bandit to become progressively more selective (if keeps increasing).
For simplicity, is inferred based on the empirical data within a recent time window of the same horizon that is used to compute . Concretely, with the preferences defined as
where is the number of times that was chosen in the corresponding time window. We encode a prior preference of in the absence of other evidence, as an additional (fictitious) sample.
The choice of can be tuned as a hyper-parameter, but in order to remove all hyper-parameters from the bandit, we adapt it online instead. The update is based on a regression accuracy criterion, weighted by how often the arm is pulled. For the full description, see Appendix A.
As we have seen in Section 2, our concrete modulations have additional factored structure that can be exploited. For that we propose to use a separate sub-bandit (each defined as above) for each dimension of . The full modulation is assembled from the independently sampled from the sub-bandits. This way, denoting by the number of arms for , the total number of arms to model is , which is a significant reduction from the number of arms in the single flattened space . This allows for dramatically faster adaptation in the bandit (see Figure 2). On the other hand, from the perspective of each sub-bandit, there is now another source of non-stationarity due to other sub-bandits shifting their distributions.
The central claim of this paper is that the best fixed hyper-parameters in hindsight for behaviour differ widely across tasks, and that an adaptive approach obtains similar performance to the best choice without costly per-task tuning. We report a broad collection of empirical results on Atari 2600 (Bellemare et al., 2013) that substantiate this claim, and validate the effectiveness of the proposed components. From our results, we distill qualitative descriptions of the adaptation dynamics. To isolate effects, independent experiments may use all or subsets of the dimensions of .
Two initial experiments in a toy grid-world setting are reported in Figure 2. They demonstrate that the proposed bandit works well in both stationary and non-stationary settings. Moreover, they highlight the benefits of using the exact learning progress , and the gap incurred when using less informative proxies . They also indicate that the factored approach can deliver a substantial speed-up. Details of this setting are described in Appendix B.
5.1 Experimental Setup: Atari
Our Atari agent is a distributed system inspired by Impala (Espeholt et al., 2018) and Ape-X (Horgan et al., 2018), consisting of one learner (on GPU), multiple actors (on CPUs), and a bandit providing modulations to the actors. On each episode , an actor queries the bandit for a modulation , and the learner for the latest network weights . At episode end, it reports a fitness value to the bandit, and adds the collected experience to a replay table for the learner. For stability and reliability, we enforce a fixed ratio between experience generated and learning steps, making actors and learner run at the same pace. Our agents learn a policy from 200 million environment frames in 10-12h wall-clock time (compared to a GPU-week for the state-of-art Rainbow agent (Hessel et al., 2018)).
Besides distributed experience collection (i.e., improved experimental turnaround time), algorithmic elements of the learner are similar to Rainbow: the updates use multi-step double Q-learning, with distributional quantile regression (Dabney et al., 2018) and prioritized experience replay (Schaul et al., 2015). All hyper-parameters (besides those determined by by choosing from the discretized sets in Table 4) are kept fixed across all games and all experiments; these are listed in Appendix C alongside default values of . These allow us to generate competitive baseline results ( median human-normalised score) with a so-called reference setting (solid black in all learning curves) which sets the exploration parameters to that is most commonly used in the literature (, , , , ).
If not mentioned otherwise, all aggregate results are across 15 games listed in Appendix D and at least independent runs (seeds). Learning curves shown are evaluations of the greedy policy after the agent has experienced the corresponding number of environment frames. To aggregate scores across these fifteen games we use relative rank, an ordinal statistic that weighs each game equally (despite different score scales) and highlights relative differences between variants. Concretely, the performance outcome is defined as the average return of the greedy policy across the last 10% of the run (20 million frames). All outcomes are then jointly ranked, and the corresponding ranks are averaged across seeds. The averaged ranks are normalized to fall between 0 and 1, such that a normalized rank of 1 corresponds to all seeds of a variant being ranked at the top positions in the joint ranking. Finally, the relative ranks for each variant are averaged across all games. See also Appendix D.
5.2 Quantifying the Tuning Challenges
It is widely appreciated that the best hyper-parameters differ per Atari game. Figure 3 illustrates this point for multiple classes of modulations (different arms come out on top in different games), while Figure 4 quantifies this phenomenon across games and modulation classes and finds that this effect holds in general.
If early performance were indicative of final performance, the cost of tuning could be reduced. We quantify how much performance would be lost if the best fixed arm were based on the first 10% of the run. Figure 5 shows that the mismatch is often substantial. This also indicates the best choice is non-stationary: what is good in early learning may not be good later on — an issue sometimes addressed by hand-crafted schedules (e.g., DQN linearly decreases the value of (Mnih et al., 2015)).
Another approach is to choose not to choose, that is, feed experience from the full set of choices to the learner, an approach taken, e.g., in (Horgan et al., 2018). However, this merely shifts the problem, as it in turn necessitates tuning this set of choices. Figure 6 shows that the difference between a naive and a carefully curated set can indeed be very large (Table 4 in Appendix C lists all these sets).
5.3 Adapting instead of Tuning
Our experiments suggest that adapting the distribution over as learning progresses effectively addresses the three tuning challenges discussed above (per-task differences, early-late mismatch, handling sets). Figure 6 shows that the bandit can quickly suppress the choices of harmful elements in a non-curated set; in other words, the set does not need to be carefully tuned. At the same time, a game-specific schedule emerges from the non-stationary adaptation, for example recovering an -schedule reminiscent of the hand-crafted one in DQN (Mnih et al., 2015) (see Figure 17 in Appendix E). Finally, the overall performance of the bandit is similar to that of the best fixed choice, and not far from an “oracle” that picks the best fixed per game in hindsight (Figure 4).
A number of other interesting qualitative dynamics emerge in our setting (Appendix E): action biases are used initially and later suppressed (e.g., on Seaquest, Figure 19); the usefulness of action repeats varies across training (e.g., on H.E.R.O., Figure 18). Figure 16 looks at additional bandit baselines and finds that addressing the non-stationarity is critical (see Appendix E.3).
Finally, our approach generalizes beyond a single class of modulations; all proposed dimensions can adapt simultaneously within a single run, using a factored bandit to handle the combinatorial space. Figure 13 shows this yields similar performance to adapting within one class. In a few games this outperforms the best fixed choice111Since it is too expensive to investigate all individual combinations in the joint modulation space, we only vary -s along a single dimension at a time. in hindsight; see Figure 6 (‘combo’) and Figure 7; presumably because of the added dynamic adaptation to the learning process. On the entire set of 57 Atari games, the bandit achieves similar performance ( median human-normalized score) to our fixed, tuned reference setting (), despite operating on 60 different combinations of modulations.
6 Related Work
Here we focus on two facets of our research: its relation to exploration, and hyper-parameter tuning.
First, our work can be seen as building on a rich literature on exploration through intrinsic motivation aimed at maximising learning progress. As the true learning progress is not readily available during training, much of this work targets one of a number of proxies: empirical return (Jaderberg et al., 2017); change in parameters, policy, or value function (Itti & Baldi, 2006); magnitude of training loss (Mirolli & Baldassarre, 2013; Schmidhuber, 1991); error reduction or derivative (Schmidhuber, 1991; Oudeyer et al., 2007); expected accuracy improvement (Misra et al., 2018); compression progress (Schmidhuber, 2008); reduction in uncertainty; improvement of value accuracy; or change in distribution of encountered states. Some of these have the desirable property that if the proxy is zero, so is . However, these proxies themselves may only be available in approximated form, and these approximations tend to be highly dependent on the state distribution under which they are evaluated, which is subject to continual shift due to the changes in policy. As a result, direct comparison between different learning algorithms under these proxies tends to be precarious.
Second, our adaptive behaviour modulation can be viewed as an alternative to per-task hyper-parameter tuning, or hyper-parameter tuning with cross-task transfer (Golovin et al., 2017), and can be compared to other works attempting to reduce the need for this common practice. (Note that the best-fixed-arm in our experiments is equivalent to explicitly tuning the modulations as hyper-parameters.) Though often performed manually, hyper-parameter tuning can be improved by random search (Bergstra et al., 2011), but in either case requires many full training cycles, whereas our work optimises the modulations on-the-fly during a single training run.
Like our method, Population Based Training (PBT, Jaderberg et al., 2017) and meta-gradient RL (Andrychowicz et al., 2016; Xu et al., 2018) share the property of dynamically adapting hyper-parameters throughout agent training. However, these methods exist in a distinctly different problem setting: PBT assumes the ability to run multiple independent learners in parallel with separate experience. Its cost grows linearly with the population size (typically ), but it can tune other hyper-parameters than our approach (such as learning rates). Meta-gradient RL, on the other hand, assumes that the fitness is a differentiable function of the hyper-parameters, which may not generally hold for exploration hyper-parameters.
While our method focuses on modulating behaviour in order to shape the experience stream for effective learning, a related but complementary approach is to filter or prioritize the generated experience when sampling from replay. Classically, replay prioritization has been based on TD error, a simple proxy for the learning progress conferred by an experience sample (Schaul et al., 2015). More recently, however, learned and thereby more adaptive prioritization schemes have been proposed (Zha et al., 2019), with (approximate) learning progress as the objective function.
7 Discussion & Future Work
Reiterating one of our key observations: the qualitative properties of experience generated by an agent impact its learning, in a way that depends on characteristics of the task, current learning parameters, and the design of the agent and its learning algorithm. We have demonstrated that by adaptively using simple, direct modulations of the way an agent generates experience, we can improve the efficiency of learning by adapting to the dynamics of the learning process and the specific requirements of the task. Our proposed method222The code for our bandit implementations will be made available online when this paper is published, facilitating community extension and reproducibility. has the potential to accelerate RL research by reducing the burden of hyper-parameter tuning or the requirement for hand-designed strategies, and does so without incurring the computational overhead of some of the alternatives.
The work presented in this paper represents a first step towards exploiting adaptive modulations to the dynamics of learning, and there are many natural ways of extending this work. For instance, such an approach need not be constrained to draw only from experiences generated by the agent; the agent can also leverage demonstrations provided by humans or by other agents. Having an adaptive system control the use of data relieves system designers of the need to curate such data to be of high quality – an adaptive system can learn to simply ignore data sources that are not useful (or which have outlived their usefulness), as our bandit has done in the case of choosing modulations to generate experiences with (e.g., Figures 17, 18, 19).
A potential limitation of our proposal is the assumption that a modulation remains fixed for the duration of an episode. This restriction could be lifted, and one can imagine scenarios in which the modulation used might depend on time or the underlying state. For example, an agent might generate more useful exploratory experiences by having low stochasticity in the initial part of an episode, but switching to have higher entropy once it reaches an unexplored region of state space.
There is also considerable scope to expand the set of modulations used. A particularly promising avenue might be to consider adding noise in parameter space, and controlling the variance(Fortunato et al., 2018; Plappert et al., 2018). In addition, previous works have shown that agents can learn diverse behaviours conditioned on a latent policy embedding (Eysenbach et al., 2018; Haarnoja et al., 2018), goal (Ghosh et al., 2018; Nair et al., 2018) or task specification (Borsa et al., 2019). A bandit could potentially be exposed to modulating the choices in abstract task space, which could be a powerful driver for more directed exploration.
We thank Junyoung Chung, Nicolas Sonnerat, Katrina McKinney, Joseph Modayil, Thomas Degris, Michal Valko, Hado van Hasselt and Remi Munos for helpful discussions and feedback.
- Andrychowicz et al. (2016) Marcin Andrychowicz, Misha Denil, Sergio Gomez Colmenarejo, Matthew W. Hoffman, David Pfau, Tom Schaul, and Nando de Freitas. Learning to learn by gradient descent by gradient descent. CoRR, abs/1606.04474, 2016. URL http://arxiv.org/abs/1606.04474.
Using confidence bounds for exploitation-exploration trade-offs.
Journal of Machine Learning Research, 3(Nov):397–422, 2002.
Bellemare et al. (2013)
Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling.
The arcade learning environment: An evaluation platform for general
Journal of Artificial Intelligence Research, 47:253–279, 2013.
- Bellemare et al. (2017) Marc G Bellemare, Will Dabney, and Rémi Munos. A distributional perspective on reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 449–458. JMLR. org, 2017.
- Bergstra et al. (2011) James S Bergstra, Rémi Bardenet, Yoshua Bengio, and Balázs Kégl. Algorithms for hyper-parameter optimization. In Advances in neural information processing systems, pp. 2546–2554, 2011.
- Besbes et al. (2014) Omar Besbes, Yonatan Gur, and Assaf Zeevi. Stochastic multi-armed-bandit problem with non-stationary rewards. In Advances in Neural Information Processing Systems, pp. 199–207, 2014.
- Borsa et al. (2019) Diana Borsa, Andre Barreto, John Quan, Daniel J. Mankowitz, Hado van Hasselt, Remi Munos, David Silver, and Tom Schaul. Universal successor features approximators. In International Conference on Learning Representations, 2019.
Chapelle & Li (2011)
Olivier Chapelle and Lihong Li.
An empirical evaluation of Thompson sampling.In Advances in neural information processing systems, pp. 2249–2257, 2011.
- Czarnecki et al. (2018) Wojciech Marian Czarnecki, Siddhant M Jayakumar, Max Jaderberg, Leonard Hasenclever, Yee Whye Teh, Simon Osindero, Nicolas Heess, and Razvan Pascanu. Mix&match-agent curricula for reinforcement learning. arXiv preprint arXiv:1806.01780, 2018.
- Dabney et al. (2018) Will Dabney, Mark Rowland, Marc G Bellemare, and Rémi Munos. Distributional reinforcement learning with quantile regression. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
- Dulac-Arnold et al. (2015) Gabriel Dulac-Arnold, Richard Evans, Hado van Hasselt, Peter Sunehag, Timothy Lillicrap, Jonathan Hunt, Timothy Mann, Theophane Weber, Thomas Degris, and Ben Coppin. Deep reinforcement learning in large discrete action spaces. arXiv preprint arXiv:1512.07679, 2015.
- Espeholt et al. (2018) Lasse Espeholt, Hubert Soyer, Remi Munos, Karen Simonyan, Volodymir Mnih, Tom Ward, Yotam Doron, Vlad Firoiu, Tim Harley, Iain Dunning, et al. Impala: Scalable distributed deep-RL with importance weighted actor-learner architectures. arXiv preprint arXiv:1802.01561, 2018.
- Eysenbach et al. (2018) Benjamin Eysenbach, Abhishek Gupta, Julian Ibarz, and Sergey Levine. Diversity is all you need: Learning skills without a reward function. arXiv preprint arXiv:1802.06070, 2018.
- Fortunato et al. (2018) Meire Fortunato, Mohammad Gheshlaghi Azar, Bilal Piot, Jacob Menick, Matteo Hessel, Ian Osband, Alex Graves, Volodymyr Mnih, Remi Munos, Demis Hassabis, Olivier Pietquin, Charles Blundell, and Shane Legg. Noisy networks for exploration. In International Conference on Learning Representations, 2018.
- García & Fernández (2015) Javier García and Fernando Fernández. A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research, 16(42):1437–1480, 2015. URL http://jmlr.org/papers/v16/garcia15a.html.
- Ghosh et al. (2018) Dibya Ghosh, Abhishek Gupta, and Sergey Levine. Learning actionable representations with goal-conditioned policies. arXiv preprint arXiv:1811.07819, 2018.
- Golovin et al. (2017) Daniel Golovin, Benjamin Solnik, Subhodeep Moitra, Greg Kochanski, John Elliot Karro, and D. Sculley (eds.). Google Vizier: A Service for Black-Box Optimization, 2017. URL http://www.kdd.org/kdd2017/papers/view/google-vizier-a-service-for-black-box-optimization.
- Haarnoja et al. (2018) Tuomas Haarnoja, Kristian Hartikainen, Pieter Abbeel, and Sergey Levine. Latent space policies for hierarchical reinforcement learning. arXiv preprint arXiv:1804.02808, 2018.
- Hessel et al. (2018) Matteo Hessel, Joseph Modayil, Hado Van Hasselt, Tom Schaul, Georg Ostrovski, Will Dabney, Dan Horgan, Bilal Piot, Mohammad Azar, and David Silver. Rainbow: Combining improvements in deep reinforcement learning. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
- Hessel et al. (2019) Matteo Hessel, Hado van Hasselt, Joseph Modayil, and David Silver. On inductive biases in deep reinforcement learning. CoRR, abs/1907.02908, 2019.
- Horgan et al. (2018) Dan Horgan, John Quan, David Budden, Gabriel Barth-Maron, Matteo Hessel, Hado Van Hasselt, and David Silver. Distributed prioritized experience replay. arXiv preprint arXiv:1803.00933, 2018.
- Itti & Baldi (2006) Laurent Itti and Pierre F Baldi. Bayesian surprise attracts human attention. In Advances in Neural Information Processing Systems, pp. 547–554, 2006.
- Jaderberg et al. (2017) Max Jaderberg, Valentin Dalibard, Simon Osindero, Wojciech M. Czarnecki, Jeff Donahue, Ali Razavi, Oriol Vinyals, Tim Green, Iain Dunning, Karen Simonyan, Chrisantha Fernando, and Koray Kavukcuoglu. Population based training of neural networks. CoRR, abs/1711.09846, 2017.
- Kaufmann et al. (2012) Emilie Kaufmann, Olivier Cappé, and Aurélien Garivier. On bayesian upper confidence bounds for bandit problems. In Artificial intelligence and statistics, pp. 592–600, 2012.
- Linke et al. (2019) Cam Linke, Nadia M Ady, Martha White, Thomas Degris, and Adam White. Adapting behaviour via intrinsic reward: A survey and empirical study. arXiv preprint arXiv:1906.07865, 2019.
- Machado et al. (2017) Marlos C. Machado, Marc G. Bellemare, Erik Talvitie, Joel Veness, Matthew J. Hausknecht, and Michael Bowling. Revisiting the arcade learning environment: Evaluation protocols and open problems for general agents. CoRR, abs/1709.06009, 2017.
- Mirolli & Baldassarre (2013) Marco Mirolli and Gianluca Baldassarre. Functions and mechanisms of intrinsic motivations. In Intrinsically Motivated Learning in Natural and Artificial Systems, pp. 49–72. Springer, 2013.
- Misra et al. (2018) Ishan Misra, Ross Girshick, Rob Fergus, Martial Hebert, Abhinav Gupta, and Laurens Van Der Maaten. Learning by asking questions. In
- Mnih et al. (2015) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015.
- Nagarajan et al. (2018) Prabhat Nagarajan, Garrett Warnell, and Peter Stone. The impact of nondeterminism on reproducibility in deep reinforcement learning. In 2nd Reproducibility in Machine Learning Workshop at ICML 2018, July 2018.
- Nair et al. (2018) Ashvin V Nair, Vitchyr Pong, Murtaza Dalal, Shikhar Bahl, Steven Lin, and Sergey Levine. Visual reinforcement learning with imagined goals. In Advances in Neural Information Processing Systems, pp. 9191–9200, 2018.
Oudeyer et al. (2007)
Pierre-Yves Oudeyer, Frdric Kaplan, and Verena V Hafner.
Intrinsic motivation systems for autonomous mental development.
IEEE Transactions on Evolutionary Computation, 11(2):265–286, 2007.
- Plappert et al. (2018) Matthias Plappert, Rein Houthooft, Prafulla Dhariwal, Szymon Sidor, Richard Y. Chen, Xi Chen, Tamim Asfour, Pieter Abbeel, and Marcin Andrychowicz. Parameter space noise for exploration. In International Conference on Learning Representations, 2018.
- Puterman (1994) Martin L. Puterman. Markov Decision Processes—Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc., 1994.
- Raj & Kalyani (2017) Vishnu Raj and Sheetal Kalyani. Taming non-stationary bandits: A Bayesian approach. arXiv preprint arXiv:1707.09727, 2017.
- Schaul et al. (2015) Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Prioritized experience replay. arXiv preprint arXiv:1511.05952, 2015.
- Schmidhuber (1991) Jürgen Schmidhuber. Curious model-building control systems. In Proc. International Joint Conference on Neural Networks, pp. 1458–1463, 1991.
- Schmidhuber (2008) Jürgen Schmidhuber. Driven by compression progress: A simple principle explains essential aspects of subjective beauty, novelty, surprise, interestingness, attention, curiosity, creativity, art, science, music, jokes. In Workshop on anticipatory behavior in adaptive learning systems, pp. 48–76. Springer, 2008.
- Such et al. (2018) Felipe Petroski Such, Vashisht Madhavan, Rosanne Liu, Rui Wang, Pablo Samuel Castro, Yulun Li, Ludwig Schubert, Marc Bellemare, Jeff Clune, and Joel Lehman. An Atari model zoo for analyzing, visualizing, and comparing deep reinforcement learning agents. arXiv preprint arXiv:1812.07069, 2018.
- Thompson (1933) William R Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285–294, 1933.
- Vinyals et al. (2019) Oriol Vinyals, Igor Babuschkin, Wojciech M. Czarnecki, Michaël Mathieu, Andrew Dudzik, Junyoung Chung, David H. Choi, Richard Powell, Timo Ewalds, Petko Georgiev, Junhyuk Oh, Dan Horgan, Manuel Kroiss, Ivo Danihelka, Aja Huang, Laurent Sifre, Trevor Cai, John P. Agapiou, Max Jaderberg, Alexander S. Vezhnevets, Rémi Leblond, Tobias Pohlen, Valentin Dalibard, David Budden, Yury Sulsky, James Molloy, Tom L. Paine, Caglar Gulcehre, Ziyu Wang, Tobias Pfaff, Yuhuai Wu, Roman Ring, Dani Yogatama, Dario Wünsch, Katrina McKinney, Oliver Smith, Tom Schaul, Timothy Lillicrap, Koray Kavukcuoglu, Demis Hassabis, Chris Apps, and David Silver. Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature, 575(7782):350–354, 2019. doi: 10.1038/s41586-019-1724-z.
- Wang (2000) Shaun S Wang. A class of distortion operators for pricing financial and insurance risks. Journal of risk and insurance, 67(1):15–36, 2000.
- Xu et al. (2018) Zhongwen Xu, Hado van Hasselt, and David Silver. Meta-gradient reinforcement learning. In Advances in Neural Information Processing Systems, pp. 2396–2407, 2018.
- Zha et al. (2019) Daochen Zha, Kwei-Herng Lai, Kaixiong Zhou, and Xia Hu. Experience replay optimization. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19, pp. 4243–4249, 2019.
Appendix A Adaptive Bandit
In this section we will briefly revisit the adaptive bandit proposed in Section 4 and provide more of the details behind the adaptability of its horizon. For clarity, we start by restating the setting and key quantities: the reference fitness , the active horizon and the observed fitness function , where defines a modulation class.
The bandit samples a modulation according to the probability that this will result in higher than average fitness (within a recent length- window):
Adapting -probabilities. For simplicity, is inferred based on the empirical data within a recent time window of the same horizon that is used to compute . Concretely, with the preferences defined as
where is the number of times that was chosen in the corresponding time window. We encode a prior preference of in the absence of other evidence, as an additional (fictitious) sample.
Adapting the horizon. As motivated in Section 4, the discrete horizon size is adapted in order to improve regression accuracy :
where is the fitness of the modulation chosen at time , and
This objective is not differentiable w.r.t. , so we perform a finite-difference step. As the horizon cannot grow beyond the amount of available data, the finite-difference is not symmetric around . Concretely, at every step we evaluate two candidates: one according to the current horizon , , and one proposing a shrinkage in the effective horizon, , where the new candidate horizon is given by:
Thus the new horizon proposes a shrinkage of up to per step, but is never allowed to shrink beyond twice the number of arms . Given a current sample of the fitness function , we probe which of these two candidates or best explains it, by comparing and . If the shorter horizon seems to explain the new data point better, we interpret this as a sign of non-stationarity in the process and propose a shrinkage proportional to the relative error prediction reduction , namely:
Factored bandits. In the case of factored sub-bandits, they each maintain their own independent horizon ; we have not investigated whether sharing it would be beneficial.
Appendix B LavaWorld Experiments
In this section we will describe in further details the experiments behind Figure 2. These were conducted on LavaWorld, a small (96 states), deterministic four-rooms-style navigation domain, with deadly lava instead of walls. We chose this domain to illustrate what our proposed adaptive mechanism (Section 4) would do under somewhat idealised conditions where the learning is tabular and we can compute the ground-truth and assess oracle performance. This investigation allows us to first see how well the bandit can deal with the kind of non-stationary arising from an RL learning process (entangled with exploration).
In this setting, we consider three modulation classes, , where can boost the logits for any of the available actions. The sets of modulations are and resulting in unique modulated (stochastic) policies for each Q-function (see Figure 8). Q-functions are look-up tables of size as this domain contains 96 unique states. The single start state is in the top-left corner, and the single rewarding state is in the top-left corner of the top-right room, and is also absorbing. We treat the discount as probability of continuation, and terminate episodes stochastically based on this, or when the agent hits lava.
We study the behaviour of the system in two settings: one stationary, one non-stationary.
In the stationary setting (Figure 2, left), we considered modulation behaviours that do not change over time. They are computed based on a ’dummy’ action-value function , as described in Section 2, but this value does not change over time. The learning process is tabular and independent of this behaviour generating value . In this case, we compute the cumulative probability that an executed policy encounters the single sparse reward (‘expected reward’) as a function of the number of episodes, where we assume that the policy will be perfect after the first reward event. The signal given to the oracle bandit is the true (stationary) expectation of this event for every modulation . Given the extreme reward sparsity and the absence of learning, there is no obvious choice for a proxy measure , so the non-oracle bandit reverts to a uniform choice over . The reference Q-values are the optimal ones for the task. The results presented are averaged over 10 runs.
Secondly, we considered a non-stationary setting (Figure 2, right) similar to the one above, but where the Q-values behind the modulated behaviour are learned over time. This is akin to the actual regime of operation this system would encounter in practice, although the learning of these values is idealised. These are initialised at zero, and the only update (given the absence of rewards) is to suppress the Q-value of any encountered transition that hits lava by ; this learning update happens instantaneously. In other words, the agent does a random walk, but over time it dies in many different ways. The reported “expected reward” is again the probability that the policy induced by encounters the final reward. In this case, a reasonable proxy for is the binary signal on whether something was learned from the last episode (by encountering a new into-lava transition), which obtains a large fraction of the oracle’s performance.
Appendix C Atari Hyper-parameters
In this section we record the setting used in our Atari experiments: hyperparameters of the agents (Tables2 and 3), environment and preprocessing (Table 1), and modulation sets for our behaviours (Table 4).
Unless specified otherwise, we use the following modulations by default: , temperature (for tie-breaking between equal-valued actions), biases , optimism , repeat probability . This corresponds to the most commonly used settings in the literature. On learning curve plots, this fixed setting is always shown in black.
The hyper-parameters used by our Atari agent are close to defaults used in the literature, with a few modification to improve learning stability in our highly distributed setting. For preprocessing and agent architecture, we use DQN settings detailed in Table 1 and Table 2. Table 3 summarizes the other hyper-parameters used by our Atari agent.
|Observation down-sampling||(84, 84)|
|Reward clipping||[-1, 1]|
|Action repeats (when )||4|
|Episode termination on loss of life||False|
|Max frames per episode||108K|
|channels||32, 64, 64|
|filter sizes||, ,|
|stride||4, 2, 1|
|for -step learning||3|
|number of quantile values||11|
|prioritization type||proportional to absolute TD error|
|prioritization importance sampling exponent||0.3|
|target network update period||1K learner updates (K environment frames)|
|replay memory size||1M transitions|
|min history to start learning||80K frames|
|samples to insertion ratio||8|
|Huber loss parameter||1|
|max gradient norm||10|
|number of actors||40|
|Modulation||Curated set||Extended set|
|Temperature||0.0001, 0.001, 0.01||0.00001, 0.0001, 0.001, 0.01, 0.1, 1, 10|
|0, 0.001, 0.01, 0.1||0, 0.001, 0.01, 0.1, 0.2, 0.5, 1|
|Repeat probability||0, 0.25, 0.5||0, 0.25, 0.5, 0.66, 0.75, 0.8, 0.9|
|Optimism||-1, 0, 1, 2, 10||-10, -2, -1, 0, 1, 2, 10|
|Action biases||0||-1, 0, 0.01, 0.1|
Appendix D Evaluation
In this section we detail the evaluation settings used for our Atari experiments, as well as the metrics used to aggregate results across games in Figure 5 in the main text.
We evaluate on a set of 15 games chosen for their different learning characteristics: Asterix, Breakout, Demon Attack, Frostbite, H.E.R.O., Ms. Pac-Man, Private Eye, Q*bert, Seaquest, Space Invaders, Star Gunner, Tennis, Venture, Yars’ Revenge and Zaxxon.
Human-normalised scores are computed following the procedure in (Mnih et al., 2015), but differing in that we evaluate online (without interrupting training), average over policies with different weights (not freezing them), and aggregate over million frames per point instead of million.
The relative rank statistic in Section 5.1 are normalized to fall between 0 and 1 for any set of outcomes for any number of seeds . For this we simply scale the raw average ranks by their minimal and maximal values: and where is the number of other outcomes this variant is jointly ranked with.
In Figure 5, we use a different metric to compare the effect of a chosen modulation early into the learning process. For this we compute , the average episode return of modulation at the end of training across all seeds , and , the modulation with highest average episode returns at the beginning of training (first 10% of the run). Based on this, we compute the normalised drop in performance resulting from committing prematurely to :
where , and the modulation class considered in the study (’s, temperatures , action repeat probabilities , and optimism ).
Appendix E Additional Atari Results
e.1 Per-Task Non-stationarity
In this section we report detailed results which were presented in aggregate in Figure 4. Specifically, these results show that (1) the most effective set of modulations varies by game, (2) different modulations are preferable on different games, and (3) the non-stationary bandit performs comparably with the best choice of modulation class on all games.
In the next few figures we give per-game performance comparing fixed-arm modulations with the adaptive bandit behaviour for modulation epsilon (Figure 9), temperature (Figure 10), action repeats (Figure 11), and optimism (Figure 12). For reference, we include the performance of a uniform bandit (dashed-line in the figures), over the same modulation set as the bandits, as well as the best parameter setting across games (reference solid black line in the figures).
e.2 Combinatorial Bandits
In this section we include additional results for the different combinatorial bandits run on the curated and extended sets (see Table 4). Most of these experiments were run on subsets of the curated/extended sets across modulations, rather than the full Cartesian product. As a convention, whenever a modulation class is omitted from the experiment name, the value for this class is set to the default reference value reported in Section C (Reference modulations). Thus, for instance if we refer to a per-class-modulation bandit, say optimism , the modulations for this class would be the ones reported in Table 4 (line 4), while all other modulation dimensions would be kept fixed to their reference values.
Figure 13 shows that the combined bandit performs competitively compared with per-factor bandits (the same adaptive bandit but restricted to one class of modulation). In particular, it is worth noting that the per-factor bandit that performs best is game dependent. Nevertheless, the combined bandit, considering modulations across many of these dimensions, manages to recover a competitive performance across most games.
In Figure 14, we include a comparison plot between the combinatorial bandit on the curated set of modulation classes , its uniform counterpart on the same set and the reference fixed arm across games. First thing to notice is that on the curated set the uniform bandit is quite competitive, validating our initial observation that the problem of tuning can be shifted a level above, by carefully curating a set of good candidates. We can additionally see that the adaptive mechanism tends to fall in-between these two extremes: an uninformed arm selection and tuned arm selection. We can see that the adaptive mechanism can recover a behaviour close to uniform in some games (H.E.R.O., Yars’ Revenge), while maintaining the ability to recover something akin to best arm identification in other games (see Asterix). Moreover there are (rare) instances, see Zaxxon, where the bandit outperforms both of these extremes.
In Figure 15 we include a plot comparing the performance of a combinatorial bandit on the full curated and extended modulation sets. These are bandits acting on across all modulation classes outlined in Table 4. As a reference, we include the performance of the per-class modulation bandits, as in Figure 13. The bias modulation class was omitted as modulating exclusively within this class leads to very poor performance as policies tend to lock into a particular preference. We can also see a negative impact on the overall performance when adding a bias set to the set of modulations the bandit operates on, as one can see from Figure 15 (magenta line). This is why we opted not to include this set in the extended bandit experiments reported in Figure 6 and restricted ourselves to the other extended modulation sets.
e.3 Other Bandit Baselines
Finally, in Figure 16 we provide a comparison to other more established bandit algorithms, UCB (Auer, 2002; Kaufmann et al., 2012) and Thompson Sampling (Thompson, 1933; Chapelle & Li, 2011), that would need to learn with modulation to use. The results in this figure are averaged across seeds, and jointly modulate across three classes (). We also include resulting learning curves for our proposed adaptation mechanism, the bandit described in Section 4, as well as uniform. The first thing to notice is that the stationary bandits, UCB and Thompson Sampling, are sometimes significantly worse than uniform, indicating that they prematurely lock into using modulations that may be good initially, but don’t help for long-term performance. We have already seen signs of this non-stationarity in the analysis in Figure 5 which shows that early commitment based on the evidence seen in the first part of training can be premature and might hinder the overall performance. In contrast, our proposed bandit can adapt to the non-stationarity present in the learning process, resulting in performance that is on par or better than these baselines in most of these games (with only one exception, in Yars’ Revenge, where it matches the performance of the uniform bandit). In that context, it is worth highlighting that the best alternative baseline (UCB, Thompson Sampling, Uniform) differs from game to game, so outperforming all of them is a significant result. Another point is that UCB and Thompson Sampling still require some hyper-parameter tuning (we report the results for the best setting we found), and thus add extra tuning complexity, while our approach is hyper-parameter-free.
e.4 Behaviour of the Non-stationary Bandit
In the previous section we saw that the bandit is able to effectively modulate the behaviour policy to give robust performance across games. In this section we dive in deeper to analyse the precise modulations applied by the bandit over time and per-game. We see that the bandit does indeed adaptively change the modulation over time and in a game-specific way.
With these results we aim to look into the choices of the bandit across the learning process, and in multiple games. Most of the effect seems to be present in early stages of training.
Epsilon schedules: Figure 17 shows the evolution of the value of by the bandit over time. We see that this gives rise to an adaptive type of epsilon schedule, where early in training (usually the first few million frames) large values are preferred, and as training progresses smaller values are preferred. This leads to a gradually increasingly greedy behaviour policy.
Action repeats decay: Figure 18 shows the bandit modulated values for action repeats. We observe that as the agent progresses it can benefit from more resolution in the policy. Thus, the agent adaptively moves to increasingly prefer low action repeat probability over time, with a prolonged period of non-preference early in training.
Seaquest: Figure 19 shows the evolution of the sampling distributions of a combined bandit with access to all modulation dimensions: , , , , and . Despite having over 7.5 million combinations of modulation values, the bandit can efficiently learn the quality of different arms. For instance, the agent quickly learns to prefer the down action over the up action, to avoid extreme left/right biases, and to avoid suppressing the fire action, which is consistent with our intuition (in Seaquest, the player must move below the sea level to receive points, avoid the left and right boundaries, and fire at incoming enemies). Moreover, as in the case of single arm bandits discussed above, the combined bandit prefers more stochastic choices of and temperature in the beginning of training and more deterministic settings later in training, and the action repeat probability decays over time.