Multi-armed bandits environments for OpenAI Gym
In recent years deep reinforcement learning (RL) systems have attained superhuman performance in a number of challenging task domains. However, a major limitation of such applications is their demand for massive amounts of training data. A critical present objective is thus to develop deep RL methods that can adapt rapidly to new tasks. In the present work we introduce a novel approach to this challenge, which we refer to as deep meta-reinforcement learning. Previous work has shown that recurrent networks can support meta-learning in a fully supervised context. We extend this approach to the RL setting. What emerges is a system that is trained using one RL algorithm, but whose recurrent dynamics implement a second, quite separate RL procedure. This second, learned RL algorithm can differ from the original one in arbitrary ways. Importantly, because it is learned, it is configured to exploit structure in the training domain. We unpack these points in a series of seven proof-of-concept experiments, each of which examines a key aspect of deep meta-RL. We consider prospects for extending and scaling up the approach, and also point out some potentially important implications for neuroscience.READ FULL TEXT VIEW PDF
Multi-armed bandits environments for OpenAI Gym
Recent advances have allowed long-standing methods for reinforcement learning (RL) to be newly extended to such complex and large-scale task environments as Atari (Mnih et al., 2015) and Go (Silver et al., 2016)
. The key enabling breakthrough has been the development of techniques allowing the stable integration of RL with non-linear function approximation through deep learning(Mnih et al., 2015; LeCun et al., 2015). The resulting deep RL methods are attaining human- and often superhuman-level performance in an expanding list of domains (Mnih et al., 2015; Silver et al., 2016; Jaderberg et al., 2016). However, there are at least two aspects of human performance that they starkly lack. First, deep RL typically requires a massive volume of training data, whereas human learners can attain reasonable performance on any of a wide range of tasks with comparatively little experience. Second, deep RL systems typically specialize on one restricted task domain, whereas human learners can flexibly adapt to changing task conditions. Recent critiques (e.g., Lake et al., 2016) have invoked these differences as posing a direct challenge to current deep RL research.
In the present work, we outline a framework for meeting these challenges, which we refer to as deep meta-reinforcement learning, a label that is intended to both link it with and distinguish it from previous work employing the term “meta-reinforcement learning” (e.g. Schmidhuber et al., 1996; Schweighofer and Doya, 2003, discussed later)
. The key concept is to use standard deep RL techniques to train a recurrent neural network in such a way that the recurrent network comes to implement its own, free-standing RL procedure. As we shall illustrate, under the right circumstances, the secondary learned RL procedure can display an adaptiveness and sample efficiency that the original RL procedure lacks.
The following sections review previous work employing recurrent neural networks in the context of meta-learning and describe the general approach for extending such methods to the RL setting. We then present seven proof-of-concept experiments, each of which highlights an important ramification of the deep meta-RL setup by characterizing agent performance in light of this framework. We close with a discussion of key challenges for next-step research, as well as some potential implications for neuroscience.
Flexible, data-efficient learning naturally requires the operation of prior biases. In general terms, such biases can derive from two sources; they can either be engineered into the learning system (as, for example, in convolutional networks), or they can themselves be acquired through learning. The second case has been explored in the machine learning literature under the rubric ofmeta-learning (Thrun and Pratt, 1998; Schmidhuber et al., 1996).
In one standard setup, the learning agent is confronted with a series of tasks that differ from one another but also share some underlying set of regularities. Meta-learning is then defined as an effect whereby the agent improves its performance in each new task more rapidly, on average, than in past tasks (Thrun and Pratt, 1998). At an architectural level, meta-learning has generally been conceptualized as involving two learning systems: one lower-level system that learns relatively quickly, and which is primarily responsible for adapting to each new task; and a slower higher-level system that works across tasks to tune and improve the lower-level system.
A variety of methods have been pursued to implement this basic meta-learning setup, both within the deep learning community and beyond (Thrun and Pratt, 1998). Of particular relevance here is an approach introduced by Hochreiter and colleagues (Hochreiter et al., 2001)
, in which a recurrent neural network is trained on a series of interrelated tasks using standard backpropagation. A critical aspect of their setup is that the network receives, on each step within a task, an auxiliary input indicating the target output for the preceding step. For example, in a regression task, on each step the network receives as input anx value for which it is desired to output the corresponding y, but the network also receives an input disclosing the target y value for the preceding step (see Hochreiter et al., 2001; Santoro et al., 2016). In this scenario, a different function is used to generate the data in each training episode, but if the functions are all drawn from a single parametric family, then the system gradually tunes into this consistent structure, converging on accurate outputs more and more rapidly across episodes.
One interesting aspect of Hochreiter’s method is that the process that underlies learning within each new task inheres entirely in the dynamics of the recurrent network, rather than in the backpropagation procedure used to tune that network’s weights. Indeed, after an initial training period, the network can improve its performance on new tasks even if the weights are held constant (see also Prokhorov et al., 2002; Cotter and Conwell, 1990; Younger et al., 1999). A second important aspect of the approach is that the learning procedure implemented in the recurrent network is fit to the structure that spans the family of tasks on which the network is trained, embedding biases that allow it to learn efficiently when dealing with tasks from that family.
only addressed supervised learning (i.e. the auxiliary input provided on each step explicitly indicated the target output on the previous step, and the network was trained using explicit targets). In the present work we consider the implications of applying the same approach in the context of reinforcement learning. Here, the tasks that make up the training series are interrelated RL problems, for example, a series of bandit problems varying only in their parameterization. Rather than presenting target outputs as auxiliary inputs, the agent receives inputs indicating the action output on the previous step and, critically, the quantity of reward resulting from that action. The same reward information is fed in parallel to a deep RL procedure, which tunes the weights of the recurrent network.
It is this setup, as well as its result, that we refer to as deep meta-RL (although from here on, for brevity, we will often simply call it meta-RL, with apologies to authors who have used that term previously). As in the supervised case, when the approach is successful, the dynamics of the recurrent network come to implement a learning algorithm entirely separate from the one used to train the network weights. Once again, after sufficient training, learning can occur within each task even if the weights are held constant. However, here the procedure the recurrent network implements is itself a full-fledged reinforcement learning algorithm, which negotiates the exploration-exploitation tradeoff and improves the agent’s policy based on reward outcomes. A key point, which we will emphasize in what follows, is that this learned RL procedure can differ starkly from the algorithm used to train the network’s weights. In particular, its policy update procedure (including features such as the effective learning rate of that procedure), can differ dramatically from those involved in tuning the network weights, and the learned RL procedure can implement its own approach to exploration. Critically, as in the supervised case, the learned RL procedure will be fit to the statistics spanning the multi-task environment, allowing it to adapt rapidly to new task instances.
Let us write as
a distribution (the prior) over Markov Decision Processes (MDPs). We want to demonstrate that meta-RL is able to learn a prior-dependent RL algorithm, in the sense that it will perform well on average on MDPs drawn fromor slight modifications of . An appropriately structured agent, embedding a recurrent neural network, is trained by interacting with a sequence of MDP environments (also called tasks) through episodes. At the start of a new episode, a new MDP task and an initial state for this task are sampled, and the internal state of the agent (i.e., the pattern of activation over its recurrent units) is reset. The agent then executes its action-selection strategy in this environment for a certain number of discrete time-steps. At each step an action is executed as a function of the whole history of the agent interacting in the MDP during the current episode (set of states , actions , and rewards observed since the beginning of the episode, when the recurrent unit was reset). The network weights are trained to maximize the sum of observed rewards over all steps and episodes.
After training, the agent’s policy is fixed (i.e. the weights are frozen, but the activations are changing due to input from the environment and the hidden state of the recurrent layer), and it is evaluated on a set of MDPs that are drawn either from the same distribution or slight modifications of that distribution (to test the generalization capacity of the agent). The internal state is reset at the beginning of the evaluation of any new episode. Since the policy learned by the agent is history-dependent (as it makes uses of a recurrent network), when exposed to any new MDP environment, it is able to adapt and deploy a strategy that optimizes rewards for that task.
In order to evaluate the approach to learning that we have just described, we conducted a series of six proof-of-concept experiments, which we present here along with a seventh experiment originally reported in a related paper (Mirowski et al., 2016). One particular point of interest in these experiments was to see whether meta-RL could be used to learn an adaptive balance between exploration and exploitation, as demanded of any fully-fledged RL procedure. A second and still more important focus was on the question of whether meta-RL can give rise to learning that gains efficiency by capitalizing on task structure.
In order to examine these questions, we performed four experiments focusing on bandit tasks and two additional experiments focusing on Markov decision problems. All of our experiments (as well as the additional experiment we report) employ a common set of methods, with minor implementational variations. In all experiments, the agent architecture centers on a recurrent neural network (LSTM; Hochreiter and Schmidhuber, 1997) feeding into a soft-max output representing discrete actions. As detailed below, the parameters of this network core, as well as some other architectural details, varied across experiments (see Figure 1 and Table 1). However, it is important to emphasize that comparisons between specific architectures are outside the scope of this paper. Our main aim is to illustrate and validate the meta-RL framework in a more general way. To this end, all experiments used the high-level task setup previously described: Both training and testing were organized into fixed-length episodes, each involving a task randomly sampled from a predetermined task distribution, with the LSTM hidden state initialized at the beginning of each episode. Task-specific inputs and action outputs are described in conjunction with individual experiments. In all experiments except where specified, the input included a scalar indicating the reward received on the preceding time-step as well as a one-hot representation of the action sampled on that time-step.
). Details of training, including the use of entropy regularization and a combined policy and value estimate loss, closely follow the methods detailed inMirowski et al. (2016), with the exception that our experiments used a single thread unless otherwise noted. For a full listing of parameters refer to Table 1.
|Parameter||Exps. 1 & 2||Exp. 3||Exp. 4||Exp. 5||Exp. 6|
|Input||, ,||, ,||, ,||, , ,||, ,|
List of hyperparameters.= coefficient of entropy regularization loss; in Exps. 1-4, is annealed from 1.0 to 0.0 over the course of training. = coefficient of value function loss (Mirowski et al., 2016). = reward, = last action, = current time step, = current observation. Exp. 1: Bandits with independent arms (Section 3.1.1); Exp. 2: Bandits with dependent arms I (Section 3.1.2); Exp. 3: Bandits with dependent arms II (Section 3.1.3); Exp. 4: Restless bandits (Section 3.1.4); Exp. 5: The “Two-Step Task” (Section 3.2.1); Exp. 6: Learning abstract task structure (Section 3.2.2).
As an initial setting for evaluating meta-RL, we studied a series of bandit problems. Except for a very limited set of bandit environments, it is intractable to compute the (prior-dependent) Bayesian-optimal strategy. Here we demonstrate that a recurrent system trained on a set of bandit environments drawn i.i.d. from a given distribution of environments produces a bandit algorithm which performs well on problems drawn from that distribution, and to a certain extent generalizes to related distributions. Thus, meta-RL learns a prior-dependent bandit algorithm.
The specific bandit instantiation of the general meta-RL procedure described in Section 2.3 is defined as follows. Let be a training distribution over bandit environments. The meta-RL system is trained on a sequence of bandit environments through episodes. At the start of a new episode, its LSTM state is reset and a bandit task is sampled. A bandit task is defined as a set of distributions – one for each arm – from which rewards are sampled. The agent plays in this bandit environment for a certain number of trials and is trained to maximize observed rewards. After training, the agent’s policy is evaluated on a set of bandit tasks that are drawn from a test distribution , which can either be the same as or a slight modification of it.
We evaluate the resulting performance of the learned bandit algorithm by the cumulative regret, a measure of the loss (in expected rewards) suffered when playing sub-optimal arms. Writing the expected reward of arm in bandit environment , and (where is one optimal arm) the optimal expected reward, we define the cumulative regret (in environment ) as , where is the arm (action) chosen at time . In experiment 4 (Restless bandits; Section 3.1.4), also depends on . We report the performance (average over bandit environments drawn from the test distribution) either in terms of the cumulative regret: or in terms of number of sub-optimal pulls: .
We first consider a simple two-armed bandit task to examine the behavior of meta-RL under conditions where theoretical guarantees exist and general purpose algorithms apply. The arm distributions are independent Bernoulli distributions (rewards are
with probabilityand with probability ), where the parameters of each arm ( and ) are sampled independently and uniformly over . We denote by the corresponding distribution over these independent bandit environments (where the subscript stands for independent arms).
At the beginning of each episode, a new bandit task is sampled and held constant for 100 trials. Training lasted for 20,000 episodes. The network is given as input the last reward, last action taken, and the trial number , subsequently producing the action for the next trial (Figure 1). After training, we evaluated on 300 new episodes with the learning rate set to zero (the learned policy is fixed).
Across model instances, we randomly sampled learning rate and discount, following Mnih et al. (2016). For all figures, we plotted the average of the top 5 runs of 100 randomly sampled hyperparameter settings, where the top agents were selected from the first half of the 300 evaluation episodes and performance was plotted for the second half. We measured the cumulative expected regret across the episode, comparing with several algorithms tailored for this independent bandit setting: Gittins indices (Gittins, 1979) (which is Bayesian optimal in the finite-horizon case), UCB (Auer et al., 2002)
(which comes with theoretical finite-time regret guarantees), and Thompson sampling(Thompson, 1933) (which is asymptotically optimal in this setting: see Kaufmann et al., 2012b). Model simulations were conducted with the PymaBandits toolbox from (Kaufmann et al., 2012a) and custom Matlab scripts.
As shown in Figure 2a (green line; “Independent”), meta-RL outperforms both Thompson sampling (gray dashed line) and UCB (light gray dashed line), although it performs less well compared to Gittins (black dashed line). To verify the critical importance of providing reward information to the LSTM, we removed this input, leaving all other inputs as before. As expected, performance was at chance levels on all bandit tasks.
As we have emphasized, a key property of meta-RL is that it gives rise to a learned RL algorithm that exploits consistent structure in the training distribution. In order to garner empirical evidence for this point, we tested the agent from our first experiment in a more structured bandit task. Specifically, we trained the system on two-arm bandits in which arm reward distributions are correlated. In this setting, unlike the one studied in the previous section, experience with either arm provides information about the other. Standard bandit algorithms, including UCB and Thompson sampling, perform suboptimally in this setting, as they are not designed to exploit such correlations. In some cases it is possible to tailor algorithms for specific arm structures (see for example Lattimore and Munos, 2014), but extensive problem-specific analysis is typically required. Our approach aims to learn a structure-dependent bandit algorithm directly from experience with the target bandit domain.
We consider Bernoulli distributions where the parameters of the two arms are correlated in the sense that . We consider several training and test distributions. The uniform means that
(uniform distribution over the unit interval). Theeasy means that (uniform distribution over those two possible values), and similarly we call medium when and hard when . We denote by , , , and the corresponding induced distributions over bandit environments. In addition we also considered the independent uniform distribution (as in the previous section, ) where independently. Agents were both trained and tested on those five distributions over bandit environments (among which four correspond to correlated distributions: , , and ; and one to the independent case: ). As a validation of the names given to the task distributions (, , ), results show that the easy task is easier to learn than the medium which itself is easier than the hard one (Figure 2f). This is compatible with the general notion that the hardness of a bandit problem is inversely proportional to the difference between the expected reward of the optimal and sub-optimal arms. We again note that withholding the reward input to the LSTM resulted in chance performance on even the easiest bandit task, as should be expected.
Figure 2f reports the results of all possible training-testing regimes. From observing the cumulative expected regrets, we make the following observations: i) agents trained in structured environments (, , , and ) develop prior knowledge that can be used effectively when tested on structured distributions – performing comparably to Gittins (Figure 2c-f), and superiorly compared to agents trained on independent arms () in all structured tasks at test (Figure 2f). This is because an agent trained on independent rewards () has not learned to exploit the reward correlations that are useful in those structured tasks. ii) Conversely, previous training on any structured distribution (, , , or ) hurts performance when agents are tested on an independent distribution (; Figure 2f). This makes sense, as training on correlated arms may produce a policy that relies on specific reward structure, thereby impacting performance in problems where no such structure exists. iii) Whilst the previous results emphasize the point that meta-RL gives rise to a separate learnt RL algorithm that implements prior-dependent bandit strategies, results also provide evidence that there is some generalization beyond the exact training distribution encountered (Figure 2f). For example, agents trained on the distributions and perform well when tested over a much wider structured distribution (i.e. ). Further, our evidence suggests that there is generalization from training on the easier tasks (,) to testing on the hardest task (; Figure 2e), with similar or even marginally superior performance as compared to training on the hard distribution itself(Figure 2f). In contrast, training on the hard distribution results in relatively poor generalization to other structured distributions (, , ), suggesting that training purely on hard instances may result in a learned RL algorithm that is more constrained by prior knowledge, perhaps due to the difficulty of solving the original problem.
In the previous experiment, the agent could outperform standard bandit algorithms by making use of learned dependencies between arms. However, it could do this while always choosing what it believes to be the highest-paying arm. We next examine a problem where information can be gained by paying a short-term reward cost. Similar problems have been examined before as providing a challenge to standard bandit algorithms (see e.g. Russo and Van Roy, 2014). In contrast, humans and animals make decisions that sacrifice immediate reward for information gain (e.g. Bromberg-Martin and Hikosaka, 2009).
In this experiment, the agent was trained on 11-armed bandits with strong dependencies between arms. All arms had deterministic payouts. Nine “non-target” arms had reward , and one “target” arm had reward . Meanwhile, arm was always “informative”, in that the target arm was indexed by 10 times ’s reward (e.g. a reward of 0.2 on indicated that was the target arm). Thus, ’s payouts ranged from 0.1 to 1. In each episode, the index of the target arm was randomly assigned. On the first trial of each episode, the agent could not know which arm was the target, so the informative arm returned expected reward 0.55 and every target arm returned expected reward 1.4. Choosing the informative arm thus meant foregoing immediate reward, but with the compensation of valuable information. Episodes were five steps long. Again, the reward on the previous trial was provided as an additional observation to the agent. To facilitate learning, this was encoded in 1-hot format.
Results are shown in Figure 3. The agent learned the optimal long-run strategy of sampling the informative arm once, despite the short-term cost, and then using the resulting information to exploit the high-value target arm. Thompson sampling, if supplied the true prior, searched potential target arms and exploited the target if found. UCB performed worse because it sampled every arm once even if the target arm was found early.
In previous experiments we considered stationary problems where the agent’s actions yielded information about task parameters that remained fixed throughout each episode. Next, we consider a bandit problem in which reward probabilities change over the course of an episode, with different rates of change (volatilities) in different episodes. To perform well, the agent must not only track the best arm, but also infer the volatility of the episode and adjust its own learning rate accordingly. In such an environment, learning rates should be higher when the environment is changing rapidly, because past information becomes irrelevant more quickly (Sutton and Barto, 1998; Behrens et al., 2007).
We tested whether meta-RL would learn such a flexible RL policy using a two-armed Bernoulli bandit task with reward probabilities and 1-. The value of changed slowly in “low vol” episodes and quickly in “high vol” episodes. The agent had no way of knowing which type of episode it was in, except for its reward history within the episode. Figure 4a shows example “low vol” and “high vol” episodes. Reward magnitude was fixed at 1, and episodes were 100 steps long. UCB and Thompson sampling were again implemented for comparison. The confidence bound term in UCB had parameter which was set to 1, selected empirically for good performance on our data set. Thompson sampling’s posterior update included knowledge of the Gaussian random walk, but with a fixed volatility for all episodes.
As in the previous experiment, meta-RL achieved lower regret in test than Thompson sampling, UCB, or the Rescorla-Wagner (R-W) learning rule (Figure 4b; Rescorla et al., 1972) with the best fixed learning rate (=0.5). To test whether the agent adjusted its effective learning rate to match environments with different volatility levels, we fit R-W models to the agent’s behavior, concatenating episodes into blocks of 10, where each block consisted of only “low vol” or only “high vol” episodes. We considered four different models encompassing different combinations of three parameters: learning rate , softmax inverse temperature , and a lapse rate
to account for unexplained choice variance not related to estimated valueEconomides et al. (2015). Model “b” included only , “ab” included and , “be” included and , and “abe” included all three. All parameters were estimated separately on each block of 10 episodes. In models where and were not free, they were fixed to 0 and 0.5, respectively. Model comparison by Bayesian Information Criterion (BIC) indicated that meta-RL’s behavior was better described by a model with different learning rates for each block than a model with a fixed learning rate across blocks. As a control, we performed the same model comparison on the behavior produced by the best R-W agent, finding no benefit of allowing different learning rates across episodes (models “abe” and “ab” vs “be” and “b”; Figure 4c-d). In these models, the parameter estimates for meta-RL’s behavior were strongly related to the volatility of the episodes, indicating that meta-RL adjusted its learning rate to the volatility of the episode, whereas model fitting the R-W behavior simply recovered the fixed parameters (Figure 4e-f).
The foregoing experiments focused on bandit tasks in which actions do not affect the task’s underlying state. We turn now to MDPs where actions do influence state. We begin with a task derived from the neuroscience literature and then turn to a task, originally studied in the context of animal learning, which requires learning of abstract task structure. As in the previous experiments, our focus is on examining how meta-RL adapts to invariances in task structure. We wrap up by reviewing an experiment recently reported in a related paper (Mirowski et al., 2016), which demonstrates how meta-RL can scale to large-scale navigation tasks with rich visual inputs.
Here we examine meta-RL in a setting that has been widely used in the neuroscience literature to distinguish the contribution of different systems viewed to support decision making (Daw et al., 2005). Specifically, this paradigm – known as the “two-step task” (Daw et al., 2011) – was developed to dissociate a model-free system that caches values of actions in states (e.g. TD(1) Q-learning; see Sutton and Barto, 1998), from a model-based system which learns an internal model of the environment and evaluates the value of actions at the time of decision-making through look-ahead planning (Daw et al., 2005). Our interest was in whether meta-RL would give rise to behavior emulating a model-based strategy, despite the use of a model-free algorithm (in this case A2C) to train the system weights.
We used a modified version of the two-step task, designed to bolster the utility of model-based over model-free control (see Kool et al., 2016). The task’s structure is diagrammed in Figure 5a. From the first-stage state , action leads to second-stage states and with probability 0.75 and 0.25, respectively, while action leads to and with probabilities 0.25 and 0.75. One second-stage state yielded a reward of 1.0 with probability 0.9 (and otherwise zero); the other yielded the same reward with probability 0.1. The identity of the higher-valued state was assigned randomly for each episode. Thus, the expected values for the two first-stage actions were either = 0.9 and = 0.1, or = 0.1 and
= 0.9. All three states were represented by one-hot vectors, with the transition model held constant across episodes: i.e. only the expected value of the second stage states changed from episode to episode.
We applied the conventional analysis used in the neuroscience literature to dissociate model-free from model-based control (Daw et al., 2011). This focuses on the “stay probability,” that is, the probability with which a first-stage action is selected at trial following a second-stage reward at trial , as a function of whether trial involved a common transition (e.g. action at state led to ) or rare transition (action at state led to ). Under the standard interpretation (see Daw et al., 2011), model-free control – à la TD(1) – predicts that there should be a main effect of reward: First-stage actions will tend to be repeated if followed by reward, regardless of transition type, and such actions will tend not to be repeated (choice switch) if followed by non-reward (Figure 5b). In contrast, model-based control predicts an interaction between the reward and transition type, reflecting a more goal-directed strategy, which takes the transition structure into account. Intuitively, if you receive a second-stage reward (e.g. at ) following a rare transition (i.e. having taken action at state ), to maximize your chances of getting to this reward on the next trial based on your knowledge of the transition structure, the optimal first stage action is (i.e. switch).
The results of the stay-probability analysis performed on the agent’s choices show a pattern conventionally interpreted as implying the operation of model-based control (Figure 5c). As in previous experiments, when reward information was withheld at the level of network input, performance was at chance levels.
If interpreted following standard practice in neuroscience, the behavior of the model in this experiment reflects a surprising effect: training with model-free RL gives rise to behavior reflecting model-based control. We hasten to note that different interpretations of the observed pattern of behavior are available (Akam et al., 2015), a point to which we will return below. However, notwithstanding this caveat, the results of the present experiment provide a further illustration of the point that the learning procedure that emerges from meta-RL can differ starkly from the original RL algorithm used to train the network weights, and takes a form that exploits consistent task structure.
In the final experiment we conducted, we took a step towards examining the scalabilty of meta-RL, by studying a task that involves rich visual inputs, longer time horizons and sparse rewards. Additionally, in this experiment we studied a meta-learning task that requires the system to tune into an abstract task structure, in which a series of objects play defined roles which the system must infer.
The task was adapted from a classic study of animal behavior, conducted by Harlow (1949). On each trial in the original task, Harlow presented a monkey with two visually contrasting objects. One of these covered a small well containing a morsel of food; the other covered an empty well. The animal chose freely between the two objects and could retrieve the food reward if present. The stage was then hidden and the left-right positions of the objects were randomly reset. A new trial then began, with the animal again choosing freely. This process continued for a set number of trials using the same two objects. At completion of this set of trials, two entirely new and unfamiliar objects were substituted for the original two, and the process began again. Importantly, within each block of trials, one object was chosen to be consistently rewarded (regardless of its left-right position), with the other being consistently unrewarded. What Harlow (Harlow, 1949) observed was that, after substantial practice, monkeys displayed behavior that reflected an understanding of the task’s rules. When two new objects were presented, the monkey’s first choice between them was necessarily arbitrary. But after observing the outcome of this first choice, the monkey was at ceiling thereafter, always choosing the rewarded object.
We anticipated that meta-RL should give rise to the same pattern of abstract one-shot learning. In order to test this, we adapted Harlow’s paradigm into a visual fixation task, as follows. A 84x84 pixel input represented a simulated computer screen (see Figure 6
a-c). At the beginning of each trial, this display was blank except for a small central fixation cross (red crosshairs). The agent selected discrete left-right actions which shifted its view approximately 4.4 degrees in the corresponding direction, with a small momentum effect (alternatively, a no-op action could be selected). The completion of a trial required performing two tasks: saccading to the central fixation cross, followed by saccading to the correct image. If the agent held the fixation cross in the center of the field of view (within a tolerance of 3.5 degrees visual angle) for a minimum of four time steps, it received a reward of 0.2. The fixation cross then disappeared and two images – drawn randomly from the ImageNet dataset(Deng et al., 2009) and resized to 34x34 – appeared on the left and right side of the display (Figure 6b). The agent’s task was then to “select” one of the images by rotating until the center of the image aligned with the center of the visual field of view (within a tolerance of 7 degrees visual angle). Once one of the images was selected, both images disappeared and, after an intertrial interval of 10 time-steps, the fixation cross reappeared, initiating the next trial. Each episode contained a maximum of 10 trials or 3600 steps. Following Mirowski et al. (2016), we implemented an action repeat of 4, meaning that selecting an image took a minimum of three independent decisions (twelve primitive actions) after having completed the fixation. It should be noted, however, that the rotational position of the agent was not limited; that is, 360 degree rotations could occur, while the simulated computer screen only subtended 65 degrees.
Although new ImageNet images were chosen at the beginning of each episode (sampled with replacement from a set of 1000 images), the same images were re-used across all trials within an episode, though in randomly varying left-right placement, similar to the objects in Harlow’s experiment. And as in that experiment, one image was arbitrarily chosen to be the “rewarded” image throughout the episode. Selection of this image yielded a reward of 1.0, while the other image yielded a reward of -1.0. During test, the A3C learning rate was set to zero and ImageNet images were drawn from a separate held-out set of 1000, never presented during training.
A grid search was conducted for optimal hyperparameters. At perfect performance, agents can complete one trial per 20-30 steps and achieve a maximum expected reward of 9 per 10 trials. Given the nature of the task – which requires one-shot image-reward memory together with maintenance of this information over a relatively long timescale (i.e. over fixation-cross selections and across trials) – we assessed the performance of not only a convolutional-LSTM architecture which receives reward and action as additional input (see Figure 1b and Table 1), but also a convolutional-stacked LSTM architecture used in a navigation task discussed below (see Figure 1c).
Agent performance is illustrated in Figure 6d-f. Whilst the single LSTM agent was relatively successful at solving the task, the stacked-LSTM variant exhibited much better robustness. That is, 43% of random seeds of the best hyperparameter set performed at ceiling (Figure 6e), compared to 26% of the single LSTM.
Like the monkeys in Harlow’s experiment (Harlow, 1949), the networks converge on an optimal policy: Not only does the agent successfully fixate to begin each trial, but starting on the second trial of each episode it invariably selects the rewarded image, regardless of which image it selected on the first trial(Figure 6f). This reflects an impressive form of one-shot learning, which reflects an implicit understanding of the task structure: After observing one trial outcome, the agent binds a complex, unfamiliar image to a specific task role.
Further experiments, reported elsewhere (Wang et al., 2017), confirmed that the same recurrent A3C system is also able to solve a substantially more difficult version of the task. In this task, only one image – which was randomly designated to be either the rewarding item to be selected, or the unrewarding item to be avoided – was presented on every trial during an episode, with the other image presented being novel on every trial.
The experiments using the Harlow task demonstrate the capacity of meta-RL to operate effectively within a visually rich environment, with relatively long time horizons. Here we consider related experiments recently reported within the navigation domain (Mirowski et al., 2016) (see also Jaderberg et al., 2016), and discuss how these can be recast as examples of meta-RL – attesting to the scaleability of this principle to more typical MDP settings that pose challenging RL problems due to dynamically changing sparse rewards.
Specifically, we consider a setting where the environment layout is fixed but the goal changes location randomly each episode (Figure 7; Mirowski et al., 2016). Although the layout is relatively simple, the Labyrinth environment (see for details Mirowski et al., 2016) is richer and more finely discretized (cf VizDoom), resulting in long time horizons; a trained agent takes approximately 100 steps (10 seconds) to reach the goal for the first time in a given episode. Results show that a stacked LSTM architecture (Figure 1c), that receives reward and action as additional inputs equivalent to that used in our Harlow experiment achieves near-optimal behavior – showing one-shot memory for the goal location after an initial exploratory period, followed by repeated exploitation (see Figure 7c). This is evidenced by a substantial decrease in latency to reach the goal for the first time (~100 timesteps) compared to subsequent visits (~30 timesteps). Notably, a feedforward network (see Figure 7c), that receives only a single image as observation, is unable to solve the task (i.e. no decrease in latency between successive goal rewards). Whilst not interpreted as such in Mirowski et al. (2016), this provides a clear demonstration of the effectiveness of meta-RL: a separate RL algorithm with the capability of one-shot learning emerges through training with a fixed and more incremental RL algorithm (i.e. policy gradient). Meta-RL can be viewed as allowing the agent to infer the optimal value function following initial exploration (see Figure 7d) – with the additional LSTM providing information about the currently relevant goal location to the LSTM that outputs the policy over the extended timeframe of the episode. Taken together, meta-RL allows a base model-free RL algorithm to solve a challenging RL problem that might otherwise require fundamentally different approaches (e.g. based on successor representations or fully model-based RL).
We have already touched on the relationship between deep meta-RL and pioneering work by Hochreiter et al. (2001) using recurrent networks to perform meta-learning in the setting of full supervision (see also Prokhorov et al., 2002; Cotter and Conwell, 1990; Younger et al., 1999). That approach was recently extended in Santoro et al. (2016), which demonstrated the utility of leveraging an external memory structure. The idea of crossing meta-learning with reinforcement learning has been previously discussed by Schmidhuber et al. (1996). That work, which appears to have introduced the term “meta-RL,” differs from ours in that it did not involve a neural network implementation. More recently, however, there has been a surge of interest in using neural networks to learn optimization procedures, using a range of innovative meta-learning techniques (Li and Malik, 2016; Zoph and Le, 2016; Andrychowicz et al., 2016; Chen et al., 2016). Recent work by Chen et al. (2016) is particularly close in spirit to the work we have presented here, and can be viewed as treating the case of “infinite bandits” using a meta-learning strategy broadly analogous to the one we have pursued.
The present research also bears a close relationship with a different body of recent work that has not been framed in terms of meta-learning. A number of studies have used deep RL to train recurrent neural networks on navigation tasks, where the structure of the task (e.g., goal location or maze configuration) varies across episodes (Mirowski et al., 2016; Jaderberg et al., 2016). The final experiment that we presented above, drawn from (Mirowski et al., 2016), is one example. To the extent that such experiments involve the key ingredients of deep meta-RL – a neural network with memory, trained through RL on a series of interrelated tasks – they are almost certain to involve the kind of meta-learning we have described in the present work. This related work provides an indication that meta-RL can be fruitfully applied to larger scale problems than the ones we have studied in our own experiments. Importantly, it indicates that a key ingredient in scaling the approach may be to incorporate memory mechanisms beyond those inherent in unstructured recurrent neural networks (see Mirowski et al., 2016; Santoro et al., 2016; Graves et al., 2016; Weston et al., 2014). Our work, for its part, suggests that there is untapped potential in deep recurrent RL agents to meta-learn quite abstract aspects of task structure, and to discover strategies that exploit such structure toward rapid, flexible adaptation.
During completion of the present research, closely related work was reported by Duan et al. (2016). Like us, Duan and colleagues use deep RL to train a recurrent network on a series of interrelated tasks, with the result that the network dynamics learn a second RL procedure which operates on a faster time-scale than the original algorithm. They compare the performance of these learned procedures against conventional RL algorithms in a number of domains, including bandits and navigation. An important difference between this parallel work and our own is the former’s primary focus on relatively unstructured task distributions (e.g., uniformly distributed bandit problems and random MDPs); our main interest, in contrast, has been in structured task distributions (e.g., dependent bandits and the task introduced by Harlow, 1949), because it is in this setting where the system can learn a biased – and therefore efficient – RL procedure that exploits regular task structure. The two perspectives are, in this regard, nicely complementary.
A current challenge in artificial intelligence is to design agents that can adapt rapidly to new tasks by leveraging knowledge acquired through previous experience with related activities. In the present work we have reported initial explorations of what we believe is one promising avenue toward this goal. Deep meta-RL involves a combination of three ingredients: (1) Use of a deep RL algorithm to train a recurrent neural network, (2) a training set that includes a series of interrelated tasks, (3) network input that includes the action selected and reward received in the previous time interval. The key result, which emerges naturally from the setup rather than being specially engineered, is that the recurrent network dynamics learn to implement a second RL procedure, independent from and potentially very different from the algorithm used to train the network weights. Critically, this learned RL algorithm is tuned to the shared structure of the training tasks. In this sense, the learned algorithm builds in domain-appropriate biases, which can allow it to operate with greater efficiency than a general-purpose algorithm. This bias effect was particularly evident in the results of our experiments involving dependent bandits (sections 3.1.2 and 3.1.3), where the system learned to take advantage of the task’s covariance structure; and in our study of Harlow’s animal learning task (section 3.2.2), where the recurrent network learned to exploit the task’s structure in order to display one-shot learning with complex novel stimuli.
One of our experiments (section 3.2.1) illustrated the point that a system trained using a model-free RL algorithm can develop behavior that emulates model-based control. A few further comments on this result are warranted. As noted in our presentation of the simulation results, the pattern of choice behavior displayed by the network has been considered in the cognitive and neuroscience literatures as reflecting model-based control or tree search. However, as has been remarked in very recent work, the same pattern can arise from a model-free system with an appropriate state representation (Akam et al., 2015). Indeed, we suspect this may be how our network in fact operates. However, other findings suggest that a more explicitly model-based control mechanism can emerge when a similar system is trained on a more diverse set of tasks. In particular, Ilin et al. (2007) showed that recurrent networks trained on random mazes can approximate dynamic programming procedures (see also Silver et al., 2017; Tamar et al., 2016). At the same time, as we have stressed, we consider it an important aspect of deep meta-RL that it yields a learned RL algorithm that capitalizes on invariances in task structure. As a result, when faced with widely varying but still structured environments, deep meta-RL seems likely to generate RL procedures that occupy a grey area between model-free and model-based RL.
The two-step decision problem studied in Section 3.2.1 was derived from neuroscience, and we believe deep meta-RL may have important implications in that arena (Wang et al., 2017). The notion of meta-RL has been discussed previously in neuroscience but only in a narrow sense, according to which meta-learning adjusts scalar hyperparameters such as the learning rate or softmax inverse temperature (Lee and Wang, 2009; Schweighofer and Doya, 2003; Soltani et al., 2006; Khamassi et al., 2011; Kobayashi et al., 2009; Khamassi et al., 2013). In recent work (Wang et al., 2017) we have shown that deep meta-RL can account for a wider range of experimental observations, providing an integrative framework for understanding the respective roles of dopamine and the prefrontal cortex in biological reinforcement learning.
We would like the thank the following colleagues for useful discussion and feedback: Nando de Freitas, David Silver, Koray Kavukcuoglu, Daan Wierstra, Demis Hassabis, Matt Hoffman, Piotr Mirowski, Andrea Banino, Sam Ritter, Neil Rabinowitz, Peter Dayan, Peter Battaglia, Alex Lerchner, Tim Lillicrap and Greg Wayne.
Midbrain dopamine neurons signal preference for advance information about upcoming rewards.Neuron, 63(1):119–126, 2009.