1 Introduction
Reinforcement learning has been successfully used in numerous application domains, from robot control (DBLP:conf/cdc/GrondmanBB12, ), (DBLP:journals/tsmc/AdamBB12, ) to investment management and board games. Recent advances in the field such as (DBLP:journals/tsmc/GrondmanBLB12, ), (icml2014c1_silver14, ), (DBLP:journals/corr/ODonoghueMKM16, ), (DBLP:journals/corr/NachumNXS17, ) have enabled state of the art reinforcement learning systems to steer autonomous cars, compete with the human champions in games such as Chess and Go, as well as achieve superhuman performance in many of the old classic Atari games (DBLP:journals/corr/MnihBMGLHSK16, ), (DBLP:journals/corr/SchulmanLMJA15, ). Reinforcement learning in its most basic form includes an agent that interacts with its environment by taking actions. As a result of these actions there is a change of state and the agent might also receive a reward. The goal of the system is to learn an optimal policy, i.e. learning how to act so that it can maximize its cumulative reward.
In many practical reinforcement learning settings, we do not have access to a perfect simulator of the environment and interacting with the real environment is costly. In addition, we want to gain good rewards even early on, while the agent is still learning the environment dynamics. Modelbased reinforcement learning assumes a priori a certain model class for the environment dynamics which can be beneficial (e.g., reducing the interactions with the environment needed to learn) if the model is close to the true dynamics, but detrimental (biasing the agent’s actions) if it is not. While this restriction is removed in modelfree reinforcement learning, the architecture of the agent is also encoding prior beliefs about the environment and its transition dynamics (for example regarding the level of noise, assuming full observability or not, etc.). At the same time, if interactions with the environment are costly, learning multiple agents that represent different beliefs about the environment dynamics quickly becomes expensive. In this paper we are trying to systematically address two questions: “how can we maximize rewards while learning?” and “how to select the optimal agent configuration (and consequently policy) after a finite number of learning steps?”
In order to address these questions we consider a bandit framework on top of a set of reinforcement learning agents. Our goal then is to take actions (pull levers in bandit terminology, which corresponds to selecting one of several RL agents in this framework) so as to minimize the regret, i.e. the difference in reward between the sequence of selections that we have made and the sequence of selections performed by an oracle that always makes the best choice.
In addition to the bandit framework, we also consider augmenting the environment reward with an agent specific surrogate reward, that captures the inherent ability of the agent to understand the dynamics of the environment. This surrogate reward is inspired by the Variational Information Maximizing Exploration idea (DBLP:journals/corr/HouthooftCDSTA16, ), where a similar metric, which captures the surprise of the agent regarding the environment dynamics, is used to favor exploration of the state space (based on equation 4). However, in this work the surrogate reward is used to facilitate early selection between alternative agent architectures and gather more reward while learning.
The rest of the paper is structured as follows. Some background information on bandits and reinforcement learning, as well as an outline of the state of the art, is provided in Section 2. The proposed framework for optimizing the reward while learning and facilitating the selection of an optimal agent after a finite number of steps is elaborated in Section 3. This part introduces also the idea of an augmented surrogate reward that is both smoother and less sparse than the actual reward of the environment and discusses how such a reward can be derived from agents that employ Bayesian Neural Networks. Experimental results for several environments including some classic locomotion reinforcement learning problems as well as a few more modern Atari environments are given in Section 4. Finally section 5 provides some concluding remarks.
2 Background
In reinforcement learning we assume a Markov Decision Process (MDP), defined by its state space X, its action space U, its transition probability function
, and its reward function . At every discrete time step , given the state , the agent selects an action according to a policy . The probability that the next state falls in a region of the state space can then be expressed by . For any and , is assumed to define a valid probability density of the argument “”. After the transition to , a (possibly stochastic) reward is derived according to the reward function :. The reward depends on the current and next state, as well as the current action – but even conditioned on these it is generally drawn from a probability distribution (for non deterministic MPDs). For deterministic MDPs, the transition probability function
can be replaced by the transition function, , and the reward can be fully determined by the current state and action: , .The goal is to find an optimal policy that maximizes the expected return for every initial state . In practice, the accumulated return must be maximized using only feedback about the immediate, onestep reward. Every policy is characterized by its stateaction value function (Qfunction), , which gives the return when starting in a given state, applying a given action, and following from that point onward. For any , is unique and can be found by solving the Bellman expectation equation, while the optimal Qfunction is defined as , and satisfies the Bellman optimality equation.
Multiarm bandits can be considered a special case of reinforcement learning, in which the Markov decision process only has one state and the actions cannot change the environment, but the MDP is not deterministic in the sense that the rewards are drawn from a probability distribution. Formally, a bandit seeks to minimize the difference (regret) between the actions of an oracle that always selects the best action and the choices we make using the bandit strategy.
Different types of bandit strategies are possible, from simple greedy, to more sophisticated techniques like SoftMax (probability matching bandit), UCB1 (a bandit based on the optimism in the face of uncertainty principle) and EXP3 (adversarial bandit). For a more thorough discussion of bandit algorithms please refer to dedicated survey papers like (Vermorel2005multi, ).
3 Multiarm bandit framework
In many real world reinforcement learning settings, a good simulator of the environment is not available. Further, interactions with the environment are costly, either directly (e.g., having an immediate negative monetary impact in case of incorrect decisions) or indirectly (e.g., negatively affecting customer perception). In such cases, we are particularly motivated to get high rewards even early on while the agent is still learning the environment dynamics. In modelfree reinforcement learning, the agents architecture reflects our prior beliefs about the environment and its transition dynamics (for example regarding the level of noise, observability assumptions, etc.). Typically though, we are not certain in advance which of these prior beliefs are true and to what extent, therefore trying multiple agent architectures might be beneficial. However, if interactions with the environment are costly, learning multiple agents that represent different beliefs about the environment dynamics quickly becomes expensive. The goal of the multiarm bandit framework is to maximize rewards while learning, but to also provide a systematic method for selecting an optimal agent from a set of candidates, after a finite number of learning steps.
The bandit framework is supplemented by agent specific surrogate rewards that capture the inherent ability of the agents to model the environment. These surrogate rewards are inspired by the Variational Information Maximizing Exploration concept, where a metric capturing the surprise of an agent regarding the environment dynamics is used to promote exploration of the state space. The key idea behind introducing a surrogate reward metric is that the true environment rewards can be very sparse and noisy, especially early on while the agents are still learning (DBLP:journals/corr/HouthooftCDSTA16, ). Since we are interested in putting the bulk of our actions on agents that are at least promising in their ability to learn the dynamics of the environment (compared for the same amount of exploration), this less sparse reward is a good way to achieve that goal. One of the key contributions of our work over standard VIME is that we use the surrogate reward as a way to guide the selection of different reinforcement learning algorithms by the bandits. More specifically, the properties of this surrogate reward being smoother and capturing how “surprised” the agents are from the responses of the environment are also very beneficial, both with respect to focusing early onto the most promising agents, as well as on being able to select the best agent after a relatively small number of steps.
It has been shown in (DBLP:journals/corr/abs11035708, ) that it is beneficial for agents to take actions that maximize the reduction in uncertainty about the environment dynamics. This can be formalized as taking a sequence of actions that maximize the sum of reductions in entropy. With the history of the agents up until time step as , we can write the sum of entropy reductions as
(1) 
As indicated in (DBLP:journals/corr/HouthooftCDSTA16, ), according to information theory, the individual terms express the mutual information between the next state distribution and the model parameter distribution , namely . This implies that an agent is encouraged to take actions that are as informative as possible regarding the environment transition dynamics. This mutual information can be written as:
(2) 
where the KL divergence term is expressing the difference between the new and the old beliefs of the agent regarding the environment dynamics, and the expectation is with respect to all possible next states according to the true dynamics. Under these assumptions the above formulation can also be interpreted as information gain (DBLP:journals/corr/HouthooftCDSTA16, ).
Formally, since calculating the posterior for a dataset is not feasible, we follow VIME and approximate it through an alternative distribution , parametrized by . In this setting we seek to minimize through maximization of the variational lower bound . The latter is formulated as:
(3) 
The information gain term can then can be expressed as:
(4) 
where represents the updated and
the old parameters of the agent’s belief regarding the environment dynamics. Since Bayesian Neural Networks are expressive parameterized models that can maintain a distribution over their weights, they are excellent candidates for realizing reinforcement learning agents which can directly provide an estimate of the information gain via equation
4.In our setting we are interested in keeping a metric that captures how certain a given agent is about the environment dynamics after a fixed amount of exploration, as opposed to a metric that only favors exploration like in the original VIME paper (DBLP:journals/corr/HouthooftCDSTA16, ). The reason for this deviation is that we are on the one hand interested to explore the environment in order to learn a good policy early (similar to standard VIME), but on the other hand we want to be able to compare between the different reinforcement learning algorithms and architectures, using their “surprise” for a fixed amount of environment iterations as a metric. Therefore we consider a moving average over the normalized inverse formulation of the previous term for a specified number of environment interactions (since less “surprise” means better understanding of the environment, if the number of interactions is fixed).
4 Experiments
In order to validate the idea of using a bandit framework on top of a collection of reinforcement learning agents, we use a number of well known environments that include some traditional reinforcement learning tasks (e.g., Acrobot, Cartpole), some basic locomotion tasks (e.g., Mountain Car, Lunar Lander), as well as some classic games such as KungFu, Ms. Pacman, Space Invaders, Zaxxon, Seaquest, and Atlantis from the Atari collection (DBLP:journals/corr/BrockmanCPSSTZ16, ).
We analyze different types of bandit algorithms, each of which has to select among a set of different architectures of reinforcement learning agents: vanilla VIME for the simple locomotion tasks (with different sets of hyperparameters), and VIME combined with Deep QNetwork (DQN) or Asynchronous ActorCritic Agents (A3C) for the Atari environments (again with several hyperparameter options, one captured in each agent). In addition, we also compare the results of the best agents in our framework with literature results of DQN agents from (mnihdqn2015, ), as well as with the more recent massively parallelized form of the same framework, Gorila (DBLP:journals/corr/NairSBAFMPSBPLM15, ).
Before analyzing the behaviour of bandits on top of reinforcement learning agents, we investigate single agents. The first step is to show that the surrogate reward (consisting of the true reward and the information gain) is promising for early selection of the reinforcement learning agents. In this direction we select the part of the reward that captures how well the agent is learning the environment dynamics, the information gain, and we plot it (rescaled) together with the true environment rewards (Figure 1), for two different agents (one that is particularly good for the task, one that is able to solve it but is suboptimal). Both plots are the averages of 50 instantiations of the environment.
It is clear from Figure 1 that the true rewards of the environment (presented with green) and the information gain rewards (only the part that captures the certainty of the agent about the environment dynamics, presented with blue) are highly correlated, both for the good agent as well as for the suboptimal one. In addition, it is apparent that the information gain reward is both less sparse, taking informative values much earlier than the true reward (especially for the suboptimal agent) as well as smoother across the different instantiations of the environment. A good question then is why do we need the surrogate reward to be composed of two parts, one that captures the agent’s certainty about the environment and another that captures directly the true reward? This is necessary because later on the bandit will need to make increasingly more difficult decisions between agents that are all good and similar on their ability to model the environment and in this case capturing directly the true reward is necessary to differentiate between them.
The next step is to show that the surrogate reward allows focusing on the agents that are indeed the most promising ones in terms of their ability to collect reward early on, but also in terms of their ability to learn the environment dynamics so that collection of reward is also maximized in the long run. In order to show this property, we investigate the relation between the full surrogate reward and the true reward, especially for good agent architectures. Again we use the same setup as before, i.e. the graphs show results that are averages of 50 environment instantiations. In order to interpret the result one has to keep in mind that the surrogate reward is expressing a scaled version of the certainty of the agent about the environment dynamics (it is always increasing), while the raw environment rewards can be negative, either in the early stage of learning only or always. In all cases, the best agents are the ones that have the highest values on Figure 2. We can see that the three best agents in the figures with the doted lines (the surrogate rewards) are also the best ones in terms of the actual environment reward (solid lines) for both cases (note that for some environments highest is the largest positive value, and for some others it is the least negative one).


Having established that the surrogate reward is correlated with the environment reward and also has the desirable properties of being smoother and less sparse, we proceed to validate the core part of the bandit framework. Here we seek to answer the three questions: 1) how do the different bandits compare to each other, 2) whether a bandit framework is able to assign most of the actions on promising agents, as well as 3) if a bandit can select the best agent after a finite number of steps. For comparison we add an oracle that always selects the best agent, as well as a baseline that picks the worst agent.
The basic setup of the framework is that we define a window of iterations (typically 10 for locomotion environments and 20 for Atari) after which the bandit reviews the choice of reinforcement learning agent, and can select one out of several alternatives. The selected agent is then allowed to interact with the environment, collecting rewards in the process. The agents tested are vanilla VIME for the simple locomotion tasks (with different sets of hyperparameters taken from the literature as well as slight modifications with increased or decreased layer sizes, one more or one less layers, etc), and VIME combined with Deep QNetwork (DQN) or Asynchronous ActorCritic Agents (A3C) for the Atari environments again based on literature or slightly modified variations. The bandits we are using for selection between the agents are Epsilon Greedy, Softmax, UCB1 and Exp3, while uniform, worst and best (oracle) are added as reference points.
First, let’s look at the selection behaviour of the bandits after learning. From Figure 3
we can see that the UCB1 algorithm was able to pick the best agent architecture for all 200 instantiations of the environments , followed closely by Epsilon Greedy in the case of Mountain Car. However, for Lunar Lander the gap between the number of correct picks of the UCB1 algorithm and the next best one (SoftMax bandit in that case) is quite large. This performance gap can be reduced if we select better hyper parameters for the nonUCB algorithms, but that is not easy in real world settings, where a simulator of the environment is not available. A possible reason that algorithms such as SoftMax are not doing very well for Lunar Lander is the large number of different possible initial conditions of the game (different starting points of the space shuttle that has to be landed) which makes estimations of reliable expectations of the reward probability associated with each bandit action challenging. For Epsilon Greedy, although it does not calculate posterior probabilities, the large number of combinations makes random exploration less effective as well.
For Atari games (Space Invaders is shown as a representative example, Figure 5
) one observation is that all bandit algorithms have a somewhat tougher job of selecting reliably the best agent. To some extent this is because two or more agents are typically close in terms of performance, but also because the variance of of the agent rewards over different instantiations of the environment can be significant, even for a well trained agent in some of these games. In fact it is not uncommon for one agent to have the potential for high reward, while another one achieves more reliably a certain level of lower, yet relatively good, reward. In such scenarios we might either select algorithms that best capture our preference with respect to the variance of the rewards or allow for a diversification strategy by keeping all the agents that are not strictly inferior to some other agent.
The remaining question to be answered is whether the bandit framework is efficient in terms of collecting reward while learning. For this reason we will compare the cumulative rewards collected by the different bandit algorithms against 3 baselines: 1) a strategy of uniform alternation between the agents, 2) selecting the worst agent and stay with it, and 3) the oracle case of always selecting the best agent (Figures 4 and 6). These 3 baselines will show us different extreme cases that are valuable for comparisons. The uniform strategy shows us if there is any benefit from the bandit at all, compared to the case of naively alternating between agents. The baseline of picking the worst bandit and staying with it shows if the alternation between agents is justified compared to exploring deeply the environment with a suboptimal RL agent. In this approach we have to continue with an agent that does not map well to the problem but we do not waste any time in exploring other agents, so all the effort is focused. Unfortunately, in many interesting cases the worst agent fails completely to solve the problem at hand, therefore some exploration is necessary. Finally, the oracle also does not spend time exploring so it provides an estimation of the performance gap between the bandit framework and the ideal but unrealistic case.
Games  DQN Score  Gorila Score  Human Pro Score  Best Agent Score 

Atlantis  85641 17600  100069.16  29028  217810 7256.4 
KungFu Master  23270 5955  27543.33  22736  29860 6793.1 
Ms. Pacman  2311 525  3233.50  15693  5708.0 860.1 
Seaquest  5286 1310  13169.06  20182  17214 2411.5 
Space Invaders  1976 893  1883.41  1652  3697.5 2876.1 
Zaxxon  4977 1235  7129.33  9173  30610 8169.0 
In order to evaluate the reinforcement learning agent that was selected from the bandit framework in this setting we compare the results of that agent with DQN agents (mnihdqn2015, ), as well as with the more recent massively parallelized form of the same framework, Gorila (DBLP:journals/corr/NairSBAFMPSBPLM15, ). For this comparison we selected 6 Atari games spanning the full range, from ones where the algorithms are much better than humans, to some for which human players are much better. The goal here is not to beat the state of the art achieved by some of the recent algorithms e.g. (DBLP:journals/corr/NachumNXS17, )
but to show that the bandit framework is indeed competitive in this setting. In principle, combining the proposed approach with more recent algorithms than the standard A3C and running the training for a longer number of epochs can improve the results further. Nevertheless, the agent selected by the bandit (for Atari typically a version of A3C enhanced with VIME) does better than DQN and Gorila in all 6 Atari games and it either increases the difference when it already is better than the pro human player or it at least reduces the gap when it is not. However, the exact architecture is not always the same, as the optimal set of hyperparameters varies in the different games.
5 Conclusions
We have introduced a bandit framework that offers a principled way of selecting between different reinforcement learning agent architectures while maximizing rewards during the learning process, by focusing the exploration on the most promising agents even early on. Furthermore, the proposed framework was shown to reliably select the best agent (in terms of future expected rewards) after a finite number of steps. In order to achieve these goals a composite surrogate reward was used in combination to the bandit, comprising both the true environment reward as well as a term inspired by VIME (DBLP:journals/corr/HouthooftCDSTA16, ), that captures the certainty of the agents regarding the environment dynamics, for a given amount of environment interactions. Under this setting, UCB proved an effective and robust algorithm for selecting the best agent after a finite amount of training steps. Experimental results on several classic locomotion and computer game environments show that the bandit strategies outperform both a single nonoptimal agent, as well as uniform alternation between the agents.
References
 (1) Sander Adam, Lucian Busoniu, and Robert Babuska. Experience replay for realtime reinforcement learning control. IEEE Trans. Systems, Man, and Cybernetics, Part C, 42(2):201–212, 2012.
 (2) Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym. CoRR, abs/1606.01540, 2016.
 (3) Ivo Grondman, Lucian Busoniu, and Robert Babuska. Model learning actorcritic algorithms: Performance evaluation in a motion control task. In Proceedings of the 51th IEEE Conference on Decision and Control, CDC 2012, December 1013, 2012, Maui, HI, USA, pages 5272–5277, 2012.
 (4) Ivo Grondman, Lucian Busoniu, Gabriel A. D. Lopes, and Robert Babuska. A survey of actorcritic reinforcement learning: Standard and natural policy gradients. IEEE Trans. Systems, Man, and Cybernetics, Part C, 42(6):1291–1307, 2012.
 (5) Rein Houthooft, Xi Chen, Yan Duan, John Schulman, Filip De Turck, and Pieter Abbeel. Curiositydriven exploration in deep reinforcement learning via bayesian neural networks. CoRR, abs/1605.09674, 2016.
 (6) Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, Alex Graves, Timothy P. Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. CoRR, abs/1602.01783, 2016.
 (7) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Humanlevel control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
 (8) Ofir Nachum, Mohammad Norouzi, Kelvin Xu, and Dale Schuurmans. Bridging the gap between value and policy based reinforcement learning. CoRR, abs/1702.08892, 2017.
 (9) Arun Nair, Praveen Srinivasan, Sam Blackwell, Cagdas Alcicek, Rory Fearon, Alessandro De Maria, Vedavyas Panneershelvam, Mustafa Suleyman, Charles Beattie, Stig Petersen, Shane Legg, Volodymyr Mnih, Koray Kavukcuoglu, and David Silver. Massively parallel methods for deep reinforcement learning. CoRR, abs/1507.04296, 2015.
 (10) Brendan O’Donoghue, Rémi Munos, Koray Kavukcuoglu, and Volodymyr Mnih. PGQ: combining policy gradient and qlearning. CoRR, abs/1611.01626, 2016.
 (11) John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, and Pieter Abbeel. Trust region policy optimization. CoRR, abs/1502.05477, 2015.

(12)
David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and
Martin Riedmiller.
Deterministic policy gradient algorithms.
In Tony Jebara and Eric P. Xing, editors,
Proceedings of the 31st International Conference on Machine Learning (ICML14)
, pages 387–395. JMLR Workshop and Conference Proceedings, 2014.  (13) Yi Sun, Faustino J. Gomez, and Jürgen Schmidhuber. Planning to be surprised: Optimal bayesian exploration in dynamic environments. CoRR, abs/1103.5708, 2011.
 (14) Joannes Vermorel and Mehryar Mohri. Multiarmed bandit algorithms and empirical evaluation. In European conference on machine learning, pages 437–448. Springer, 2005.