Deep reinforcement learning methods attain super-human performance in a wide range of environments. Such methods are grossly inefficient, often taking orders of magnitudes more data than humans to achieve reasonable performance. We propose Neural Episodic Control: a deep reinforcement learning agent that is able to rapidly assimilate new experiences and act upon them. Our agent uses a semi-tabular representation of the value function: a buffer of past experience containing slowly changing state representations and rapidly updated estimates of the value function. We show across a wide range of environments that our agent learns significantly faster than other state-of-the-art, general purpose deep reinforcement learning agents.READ FULL TEXT VIEW PDF
Neural Episodic Control implementation with TensorFlow
Model-Free Episodic Control algorithm written in C++ and tested in Unreal Engine 4 in Pong Game
Deep reinforcement learning agents have achieved state-of-the-art results in a variety of complex environments (Mnih et al., 2015, 2016), often surpassing human performance (Silver et al., 2016). Although the final performance of these agents is impressive, these techniques usually require several orders of magnitude more interactions with their environment than a human in order to reach an equivalent level of expected performance. For example, in the Atari 2600 set of environments (Bellemare et al., 2013), deep Q-networks (Mnih et al., 2016) require more than 200 hours of gameplay in order to achieve scores similar to those a human player achieves after two hours (Lake et al., 2016).
The glacial learning speed of deep reinforcement learning has several plausible explanations and in this work we focus on addressing these:
2. Environments with a sparse reward signal can be difficult for a neural network to model as there may be very few instances where the reward is non-zero. This can be viewed as a form of class imbalance where low-reward samples outnumber high-reward samples by an unknown number. Consequently, the neural network disproportionately underperforms at predicting larger rewards, making it difficult for an agent to take the most rewarding actions.
3. Reward signal propagation by value-bootstrapping techniques, such as Q-learning, results in reward information being propagated one step at a time through the history of previous interactions with the environment. This can be fairly efficient if updates happen in reverse order in which the transitions occur. However, in order to train on uncorrelated minibatches DQN-style, algorithms train on randomly selected transitions, and, in order to further stabilise training, require the use of a slowly updating target network further slowing down reward propagation.
In this work we shall focus on addressing the three concerns listed above; we must note, however, that other recent advances in exploration (Osband et al., 2016), hierarchical reinforcement learning (Vezhnevets et al., 2016)2016; Fernando et al., 2017) also make substantial contributions to improving data efficiency in deep reinforcement learning over baseline agents.
In this paper we propose Neural Episodic Control (NEC), a method which tackles the limitations of deep reinforcement learning listed above and demonstrates dramatic improvements on the speed of learning for a wide range of environments. Critically, our agent is able to rapidly latch onto highly successful strategies as soon as they are experienced, instead of waiting for many steps of optimisation (e.g., stochastic gradient descent) as is the case with DQN (Mnih et al., 2015) and A3C (Mnih et al., 2016).
Our work is in part inspired by the hypothesised role of the Hippocampus in decision making (Lengyel & Dayan, 2007; Blundell et al., 2016) and also by recent work on one-shot learning (Vinyals et al., 2016) and learning to remember rare events with neural networks (Kaiser et al., 2016). Our agent uses a semi-tabular representation of its experience of the environment possessing several of the features of episodic memory such as long term memory, sequentiality, and context-based lookups. The semi-tabular representation is an append-only memory that binds slow-changing keys to fast updating values and uses a context-based lookup on the keys to retrieve useful values during action selection by the agent. Thus the agent’s memory operates in much the same way that traditional table-based RL methods map from state and action to value estimates. A unique aspect of the memory in contrast to other neural memory architectures for reinforcement learning (explained in more detail in Section 3) is that the values retrieved from the memory can be updated much faster than the rest of the deep neural network. This helps alleviate the typically slow weight updates of stochastic gradient descent applied to the whole network and is reminiscent of work on fast weights (Ba et al., 2016; Hinton & Plaut, 1987), although the architecture we present is quite different. Another unique aspect of the memory is that unlike other memory architectures such as LSTM and the differentiable neural computer (DNC; Graves et al., 2016), our architecture does not try to learn when to write to memory, as this can be slow to learn and take a significant amount of time. Instead, we elect to write all experiences to the memory, and allow it to grow very large compared to existing memory architectures (in contrast to Oh et al. (2015); Graves et al. (2016) where the memory is wiped at the end of each episode). Reading from this large memory is made efficient using kd-tree based nearest neighbour (Bentley, 1975).
The remainder of the paper is organised as follows: in Section 2 we review deep reinforcement learning, in Section 3 the Neural Episodic Control algorithm is described, in Section 4 we report experimental results in the Atari Learning Environment, in Section 5 we discuss other methods that use memory for reinforcement learning, and finally in Section 6 we outline future work and summarise the main advantages of the NEC algorithm.
The action-value function of a reinforcement learning agent (Sutton & Barto, 1998) is defined as , where is the initial action taken by the agent in the initial state and the expectation denotes that the policy is followed thereafter. The discount factor trades off favouring short vs. long term rewards.
-greedy policy based upon this value function to trade-off exploration and exploitation: with probabilitythe agent picks an action uniformly at random, otherwise it picks the action .
In DQN, the action-value function
is parameterised by a convolutional neural network that takes a 2D pixel representation of the state
as input, and outputs a vector containing the value of each action at that state. When the agent observes a transition, DQN stores thetuple in a replay buffer, the contents of which are used for training. This neural network is trained by minimizing the squared error between the network’s output and the -learning target , for a subset of transitions sampled at random from the replay buffer. The target network is an older version of the value network that is updated periodically. The use of a target network and uncorrelated samples from the replay buffer are critical for stable training.
A number of extensions have been proposed that improve DQN. Double DQN (Van Hasselt et al., 2016) reduces bias on the target calculation. Prioritised Replay (Schaul et al., 2015b) further improves Double DQN by optimising the replay strategy. Several authors have proposed methods of improving reward propagation and the back up mechanism of learning (Harutyunyan et al., 2016; Munos et al., 2016; He et al., 2016) by incorporating on-policy rewards or by adding constraints to the optimisation. Q( (Harutyunyan et al., 2016) and Retrace() (Munos et al., 2016) change the form of the Q-learning target to incorporate on-policy samples and fluidly switch between on-policy learning and off-policy learning. Munos et al. (2016) show that by incorporating on-policy samples allows an agent to learn faster in Atari environments, indicating that reward propagation is indeed a bottleneck to efficiency in deep reinforcement learning.
A3C (Mnih et al., 2016) is another well known deep reinforcement learning algorithm that is very different from DQN. It is based upon a policy gradient, and learns both a policy and its associated value function, which is learned entirely on-policy (similar to the case of Q()). Interestingly, Mnih et al. (2016) also added an LSTM memory to the otherwise convolutional neural network architecture to give the agent a notion of memory, although this did not have significant impact on the performance on Atari games.
Our agent consists of three components: a convolutional neural network that processes pixel images , a set of memory modules (one per action), and a final network that converts read-outs from the action memories into values. For the convolutional neural network we use the same architecture as DQN (Mnih et al., 2015).
For each action , NEC has a simple memory module , where and are dynamically sized arrays of vectors, each containing the same number of vectors. The memory module acts as an arbitrary association from keys to corresponding values, much like the dictionary data type found in programs. Thus we refer to this kind of memory module as a differentiable neural dictionary (DND). There are two operations possible on a DND: lookup and write, as depicted in Figure 1.
Performing a lookup on a DND maps a key to an output value :
where is the th element of the array and
where is the th element of the array and is a kernel between vectors and , e.g., Gaussian or inverse kernels. Thus the output of a lookup in a DND is a weighted sum of the values in the memory, whose weights are given by normalised kernels between the lookup key and the corresponding key in memory. To make queries into very large memories scalable we shall make two approximations in practice: firstly, we shall limit (1) to the top -nearest neighbours (typically ). Secondly, we use an approximate nearest neighbours algorithm to perform the lookups, based upon kd-trees (Bentley, 1975).
After a DND is queried, a new key-value pair is written into the memory. The key written corresponds to the key that was looked up. The associated value is application-specific (below we specify the update for the NEC agent). Writes to a DND are append-only: keys and values are written to the memory by appending them onto the end of the arrays and respectively. If a key already exists in the memory, then its corresponding value is updated, rather than being duplicated.
Figure 2 shows a DND as part of the NEC agent for a single action, whilst Algorithm 1 describes the general outline of the NEC algorithm. The pixel state is processed by a convolutional neural network to produce a key . The key is then used to lookup a value from the DND, yielding weights in the process for each element of the memory arrays. Finally, the output is a weighted sum of the values in the DND. The values in the DND, in the case of an NEC agent, are the values corresponding to the state that originally resulted in the corresponding key-value pair to be written to the memory. Thus this architecture produces an estimate of for a single given action . The architecture is replicated once for each action the agent can take, with the convolutional part of the network shared among each separate DND . The NEC agent acts by taking the action with the highest -value estimate at each time step. In practice, we use -greedy policy during training with a low .
As an NEC agent acts, it continually adds new key-value pairs to its memory. Keys are appended to the memory of the corresponding action, taking the value of the query key encoded by the convolutional neural network. We now turn to the question of an appropriate corresponding value. In Blundell et al. (2016), Monte Carlo returns were written to memory. We found that a mixture of Monte Carlo returns (on-policy) and off-policy backups worked better and so for NEC we elect to use -step -learning as in Mnih et al. (2016) (see also Watkins, 1989; Peng & Williams, 1996). This adds the following on-policy rewards and bootstraps the sum of discounted rewards for the rest of the trajectory, off-policy. The -step -value estimate is then
The bootstrap term of (3), is found by querying all memories for each action and taking the highest estimated -value returned. Note that the earliest such values can be added to memory is steps after a particular pair occurs.
When a state-action value is already present in a DND (i.e the exact same key is already in ), the corresponding value present in , , is updated in the same way as the classic tabular -learning algorithm:
where is the learning rate of the update. If the state is not already present is appended to and is appended to . Note that our agent learns the value function in much the same way that a classic tabular -learning agent does, except that the -table grows with time. We found that could take on a high value, allowing repeatedly visited states with a stable representation to rapidly update their value function estimate. Additionally, batching up memory updates (e.g., at the end of the episode) helps with computational performance. We overwrite the item that has least recently shown up as a neighbour when we reach the memory’s maximum capacity.
Agent parameters are updated by minimising the loss between the predicted value for a given action and the estimate on randomly sampled mini-batches from a replay buffer. In particular, we store tuples in the replay buffer, where is the horizon of the -step Q rule, and plays the role of the target network seen in DQN (our replay buffer is significantly smaller than DQN’s). These -tuples are then sampled uniformly at random to form minibatches for training. Note that the architecture in Figure 2
is entirely differentiable and so we can minimize this loss by gradient descent. Backpropagation updates the the weights and biases of the convolutional embedding network and the keys and values of each action-specific memory using gradients of this loss, using a lower learning rate than is used for updating pairs after queries ().
We investigated whether neural episodic control allows for more data efficient learning in practice in complex domains. As a problem domain we chose the Atari Learning Environment(ALE; Bellemare et al., 2013). We tested our method on the 57 Atari games used by Schaul et al. (2015a), which form an interesting set of tasks as they contain diverse challenges such as sparse rewards and vastly different magnitudes of scores across games. Most common algorithms applied in these domains, such as variants of DQN and A3C, require in the thousands of hours of in-game time, i.e. they are data inefficient.
We consider 5 variants of A3C and DQN as baselines as well as MFEC (Blundell et al., 2016). We compare to the basic implementations of A3C (Mnih et al., 2016) and DQN (Mnih et al., 2015). We also compare to two algorithms incorporating returns (Sutton, 1988) aiming at more data efficiency by faster propagation of credit assignments, namely (Harutyunyan et al., 2016) and (Munos et al., 2016). We also compare to DQN with Prioritised Replay, which improves data efficiency by replaying more salient transitions more frequently. We did not directly compare to DRQN (Hausknecht & Stone, 2015) nor FRMQN (Oh et al., 2016) as results were not available for all Atari games. Note that in the case of DRQN, reported performance is lower than that of Prioritised Replay.
All algorithms were trained using discount rate , except MFEC that uses . In our implementation of MFEC we used random projections as an embedding function, since in the original publication it obtained better performance on the Atari games tested.
In terms of hyperparameters for NEC, we chose the same convolutional architecture as DQN, and store up to
memories per action. We used the RMSProp algorithm(Tieleman & Hinton, 2012) for gradient descent training. We apply the same preprocessing steps as (Mnih et al., 2015), including repeating each action four times. For the -step estimates we picked a horizon of . Our replay buffer stores the only last states (as opposed to for DQN) observed and their -step estimates. We do one replay update for every 16 observed frames with a minibatch of size 32. We set the number of nearest neighbours
in all our experiments. For the kernel function we chose a function that interpolates between the mean for short distances and weighted inverse distance for large distances, more precisely:
Intuitively, when all neighbours are far away we want to avoid putting all weight onto one data point. A Gaussian kernel, for example, would exponentially suppress all neighbours except for the closest one. The kernel we chose has the advantage of having heavy tails. This makes the algorithm more robust and we found it to be less sensitive to kernel hyperparameters. We set .
In order to tune the remaining hyperparameters (SGD learning-rate, fast-update learning-rate in Equation 4, dimensionality of the embeddings, in Equation 3, and -greedy exploration-rate) we ran a hyperparameter sweep on six games: Beam Rider, Breakout, Pong, Q*Bert, Seaquest and Space Invaders. We picked the hyperparameter values that performed best on the median for this subset of games (a common cross validation procedure described by Bellemare et al. (2013), and adhered to by Mnih et al. (2015)).
|Frames||Nature DQN||Q||Retrace||Prioritised Replay||A3C||NEC||MFEC|
|Frames||Nature DQN||Q||Retrace||Prioritised Replay||A3C||NEC||MFEC|
Data efficiency results are summarised in Table 1. In the small data regime (less than 20 million frames) NEC clearly outperforms all other algorithms. The difference is especially pronounced before 5 million frames have been observed. Only at 40 million frames does DQN with Prioritised Replay outperform NEC on average; note that this corresponds to 185 hours of gameplay.
In order to provide a more detailed picture of NEC’s performance, Figures 3 to 7 show learning curves on 6 games (Alien, Bowling, Boxing, Frostbite, HERO, Ms. Pac-Man, Pong), where several stereotypical cases of NEC’s performance can be observed. All learning curves show the average performance over different initial random seeds. We evaluate MFEC and NEC every frames, and the other algorithms are evaluated every million steps.
Across most games, NEC is significantly faster at learning in the initial phase (see also Table 1), only comparable to MFEC, which also uses an episodic-like -function.
NEC also outperforms MFEC on average (see Table 2). In contrast with MFEC, NEC uses the reward signal to learn an embedding adequate for value interpolation. This difference is especially significant in games where a few pixels determine the value of each action. The simpler version of MFEC uses an approximation to
distances in pixel-space by means of random projections, and cannot focus on the small but most relevant details. Another version of MFEC calculated distances on the latent representation of a variational autoencoder(Kingma & Welling, 2013) trained to model frames. This latent representation does not depend on rewards and will be subject to irrelevant details like, for example, the display of the current score.
A3C, DQN and related algorithms require rewards to be clipped to the range for training stability111See Pop–Art (van Hasselt et al., 2016) for a DQN-like algorithm that does not require reward-clipping. NEC also outperforms Pop–Art. (Mnih et al., 2015). NEC and MFEC do not require reward clipping, which results in qualitative changes in behaviour and better performance relative to other algorithms on games requiring clipping (Bowling, Frostbite, H.E.R.O., Ms. Pac-Man, Alien out of the seven shown).
Alien and Ms. Pac-Man both involve controlling a character, where there is an easy way to collect small rewards by collecting items of which there are plenty, while avoiding enemies, which are invulnerable to the agent. On the other hand the agent can pick up a special item making enemies vulnerable, allowing the agent to attack them and get significantly larger rewards than from collecting the small rewards. Agents trained using existing parametric methods tend to show little interest in this as clipping implies there is no difference between large and small rewards. Therefore, as NEC does not need reward clipping, it can strongly outperform other algorithms, since NEC is maximising the non-clipped score (the true score). This can also be seen when observing the agents play: parametric methods will tend to collect small rewards, while NEC will try to actively make the enemies vulnerable and attack them to get large rewards.
NEC also outperforms the other algorithms on Pong and Boxing where reward clipping does not affect any of the algorithms as all original rewards are in the range ; as can be expected, NEC does not outperform others in terms of maximally achieved score, but it is vastly more data efficient.
In Figure 10
we show a chart of human-normalised scores across all 57 Atari games at 10 million frames comparing to Prioritised Replay and MFEC. We rank the games independently for each algorithm, and on the y-axis the deciles are shown.
We can see that NEC gets to a human level performance in about of the games within million frames. As we can see NEC outperforms MFEC and Prioritised Replay.
There has been much recent work on memory architectures for neural networks (LSTM; Hochreiter & Schmidhuber, 1997), DNC (Graves et al., 2016), memory networks (Sukhbaatar et al., 2015; Miller et al., 2016)
). Recurrent neural network representations of memory (LSTMs and DNCs) are trained by truncated backpropagation through time, and are subject to the same slow learning of non-recurrent neural networks.
Some of these models have been adapted to their use in RL agents (LSTMs; Bakker et al., 2003; Hausknecht & Stone, 2015), DNCs (Graves et al., 2016), memory networks (Oh et al., 2016). However, the contents of these memories is typically reset at the beginning of every episode. This is appropriate when the goal of the memory is tracking previous observations in order to maximise rewards in partially observable or non-Markovian environments. Therefore, these implementations can be thought of as a type of working memory, and solve a different problem than the one addressed in this work.
RNNs can learn to quickly write highly rewarding states into memory and may even be able to learn entire reinforcement learning algorithms (Wang et al., 2016; Duan et al., 2016). However, doing so can take an arbitrarily long time and the learning time likely scales strongly with the complexity of the task.
The work of Oh et al. (2016) is also reminiscent of the ideas presented here. They introduced (FR)MQN, an adaptation of memory networks used in the top layers of a -network.
Kaiser et al. (2016)
introduced a differentiable layer of key-value pairs that can be plugged into a neural network. This layer uses cosine similarity to calculate a weighted average of the values associated with themost similar memories. Their use of a moving average update rule is reminiscent of the one presented in Section 3. The authors reported results on a set of supervised tasks, however they did not consider applications to reinforcement learning. Other deep RL methods keep a history of previous experience. Indeed, DQN itself has an elementary form of memory: the replay buffer central to its stable training can be viewed as a memory that is frequently replayed to distil the contents into DQN’s value network. Kumaran et al. (2016) suggest that training on replayed experiences from the replay buffer in DQN is similar to the replay of experiences from episodic memory during sleep in animals. DQN’s replay buffer differs from most other work on memory for deep reinforcement learning in its sheer scale: it is common for DQN’s replay buffer to hold millions of tuples. The use of local regression techniques for -function approximation has been suggested before: Santamaría et al. (1997)
proposed the use of k-nearest-neighbours regression with a heuristic for adding memories based on the distance to previous memories.Munos & Moore (1998) proposed barycentric interpolators to model the value function and proved their convergence to the optimal value function under mild conditions, but no empirical results were presented. Gabel & Riedmiller (2005) also suggested the use of local regression, under the paradigm of case-based-reasoning that included heuristics for the deletion of stored cases. Blundell et al. (2016, MFEC) recently used local regression for -function estimation using the mean of the k-nearest neighbours, except in the case of an exact match of the query point, in which case the stored value was returned. They also propose the use of the latent variable obtained from a variational autoencoder (Rezende et al., 2014) as an embedding space, but showed random projections often obtained better results. In contrast with the ideas presented here, none of the local-regression work aforementioned uses the reward signal to learn an embedding space of covariates in which to perform the local-regression. We learn this embedding space using temporal-difference learning; a crucial difference, as we showed in the experimental comparison to MFEC.
We have proposed Neural Episodic Control (NEC): a deep reinforcement learning agent that learns significantly faster than other baseline agents on a wide range of Atari 2600 games. At the core of NEC is a memory structure: a Differentiable Neural Dictionary (DND), one for each potential action. NEC inserts recent state representations paired with corresponding value functions into the appropriate DND.
Our experiments show that NEC requires an order of magnitude fewer interactions with the environment than agents previously proposed for data efficiency, such as Prioritised Replay (Schaul et al., 2015b) and Retrace() (Munos et al., 2016). We speculate that NEC learns faster through a combination of three features of the agent: the memory architecture (DND), the use of -step estimates, and a state representation provided by a convolutional neural network.
The memory architecture, DND, rapidly integrates recent experience—state representations and corresponding value estimates—allowing this information to be rapidly integrated into future behaviour. Such memories persist across many episodes, and we use a fast approximate nearest neighbour algorithm (kd-trees) to ensure that such memories can be efficiently accessed. Estimating -values by using the -step value function interpolates between Monte Carlo value estimates and backed up off-policy estimates. Monte Carlo value estimates reflect the rewards an agent is actually receiving, whilst backed up off-policy estimates should be more representative of the value function at the optimal policy, but evolve much slower. By using both estimates, NEC can trade-off between these two estimation procedures and their relative strengths and weaknesses (speed of reward propagation vs optimality). Finally, by having a slow changing, stable representation provided by a convolutional neural network, keys stored in the DND remain relative stable.
Our work suggests that non-parametric methods are a promising addition to the deep reinforcement learning toolbox, especially where data efficiency is paramount. In our experiments we saw that at the beginning of learning NEC outperforms other agents in terms of learning speed. We saw that later in learning Prioritised Replay has higher performance than NEC. We leave it to future work to further improve NEC so that its long term final performance is significantly superior to parametric agents. Another avenue of further research would be to apply the method discussed in this paper to a wider range of tasks such as visually more complex 3D worlds or real world tasks where data efficiency is of great importance due to the high cost of acquiring data.
The authors would like to thank Daniel Zoran, Dharshan Kumaran, Jane Wang, Dan Belov, Ruiqi Guo, Yori Zwols, Jack Rae, Andreas Kirsch, Peter Dayan, David Silver and many others at DeepMind for insightful discussions and feedback. We also thank Georg Ostrovski, Tom Schaul, and Hubert Soyer for providing baseline learning curves.
Journal of Artificial Intelligence Research, 47:253–279, 06 2013.
International Conference on Machine Learning, 2016.
|Kung Fu Master||6634.6||17166.1||12906.5||30568.1||21456.2||13874.7||18065.8|
|Name This Game||2745.1||6380.2||4845.1||5532.0||7525.0||5378.5||5227.8|
|Wizard of Wor||876||401.4||12803.1||8480.7||1146.6||526.8||420.4|