Despite recent successes seen in reinforcement learning (RL)(Mnih2015, ; Silver2016, ), some important gulfs remain between many sophisticated reward-driven learning algorithms, and the behavioural flexibility observed in biological agents. Humans and non-human animals seem especially apt at the efficient transfer of knowledge between different tasks, and the adaptive reuse of successful past behaviours in new situations, an ability that has also sparked renewed interest in machine learning in recent years.
Several frameworks have been proposed to help move the two forms of learning closer together, by incorporating transfer capabilities into RL agents. Here we focus on two such ideas: factored representations, that give rise to behavioral flexibility by abstracting away general aspects of a task and combining it with novel elements later (Dayan2008, ) and nonparametric, or memory-based approaches (Pritzel2017, ; Gershman2017, ; Botvinick2019, ; Lengyel2007, ) that may help transfer learning by reusing information in a principled way, based on similarity between the agent’s current situation, and situations observed in the past.
Our approach combines these ideas into a framework for flexible learning that explains some empirically observed neural signatures of transfer learning. In particular, we focus mostly (but see section 5) on a specific instance of the transfer learning problem, where the agent acts in an environment with fixed dynamics, but changing reward function or goal locations. This setting is especially useful for developing intuitions about how an algorithm balances sharing/retaining knowledge about the environment, while specializing its policy for the current task at hand, a central challenge in continual learning that have also been examined in terms of stability-plasticity trade-off (Carpenter1987, ), or catastrophic forgetting (McCloskey1989, ; Ans1997, ).
Dayan’s SR (Dayan2008, ) is well-suited for transfer learning in settings with fixed dynamics, as the decomposition of value into representations of expected outcomes and corresponding rewards allows us to quickly recompute value functions under new reward settings. Importantly, however, SR also suffers from important limitations when applied in transfer learning scenarios, as the representation of future states still implicitly encodes the current reward function through its dependence on the current behavioural policy, which was tuned to exploit these rewards. This can make it difficult to approximate the new optimal value function following a large change in the environment’s rewards, as states that are not en route to previous goals/rewards will be poorly represented. In such cases the agent will stick to visiting old reward locations that are no longer the most desirable, or take suboptimal routes to new rewards.
To overcome this limitation, we combine the SR representation with a nonparametric clustering of the space of tasks (in this case the space of possible reward functions), and compress the representation of policies for similar environments into common successor maps. We provide a simple approximation to the corresponding hierarchical inference problem by evaluating reward function similarity on a diffused reward representation, which allows us to link the policies of similar environments without imposing any limitations on the precision or entropy of the policy being executed on a specific task. This similarity-based policy recall, operating at the task level, allows us to outperform baselines and previous methods in simple navigation settings. Our approach also naturally handles unsignalled, smooth changes in reward functions, while also imposing reasonable limits on storage complexity. Further, the principles of our approach should readily extend to settings with different types of factorizations.
We also aim to build a learning system whose components are neuroscientifically grounded. Multiple parallel predictive representations in the brain have previously been proposed in the context of simple associative learning in amygdala and hippocampal circuits(Courville2004c, ; Madarasz2016, ), as well as in the framework of nonparametric clustering of experiences into latent contexts (Gershman2010, ). Simultaneous representation of dynamic and diverse reward averages have also been reported in the anterior cingulate cortex (Meder2017, )
and other cortical areas, and a representation of a probability distribution over latent contexts has been observed in the human orbitofrontal cortex(Chan2016, ). The hippocampus itself has long been regarded as serving both as a pattern separator, as well as an autoassociative network with attractor dynamics enabling pattern completion (Yassa2011, ; Rolls1996, ). This balance between generalizing over similar experiences and tasks (by compressing them into a shared representation) while also maintaining specialization is also a key feature or our proposed hippocampal maps. Finally awake hippocampal replay has been shown to be important for spatial memory and performance in cases where animals have to adjust their choice of routes (Jadhav2012, ).
On the neurobiological level we thus aim to offer a framework that binds these ideas into a common representation, and links two putative, but disparate functions of the hippocampal formation: as a prospective map of space (Grieves2016, ; Brown2016, ; Stachenfeld2017, ), and as an efficient memory processing mechanism, in this case compressing experiences in a manner that helps optimal action choice. We simulate two different rodent spatial navigation tasks: in the first we show that our model gives insights into the fast, "flickering" remapping of hippocampal maps (Jezek2011, ; Kay2019, ), seen when rodents navigate to changing reward locations (Boccara2019, ; Dupret2013, ). In the second task, we provide a quantitative account of trajectory-dependent hippocampal representations (so-called splitter cells) (Grieves2016, ) during learning. Our model thus links together these phenomena as manifestations of a common underlying learning and control strategy.
2.1 Reinforcement Learning, Transfer and the Successor Representation
In RL problems an agent interacts with an environment by taking actions, and receiving observations and rewards. Formally, an MDP can be defined as the tuple , specifying a set of states , actions , the state transition dynamics , a reward function , and the discount factor , that reduces the weight of rewards obtained further in the future. For the transfer learning setting, we will consider a family of MDPs with shared dynamics but different reward functions, , where the reward functions themselves are determined by some stochastic process .
Instead of solving an MDP by applying dynamic programming to value functions, as in Q-learning (Watkins1989, ), it is possible for example to compute expected discounted sums over future state occupations as proposed by Dayan’s SR framework. Namely, the successor representation maintains an expectation over future state occupancies given a policy :
We will make the simplifying assumption that rewards are functions only of the arrival state s’. This allows us to represent value functions purely in terms of future state occupancies, rather than future state-action pairs, which is more in line with what is known about prospective representations in the hippocampus (Brown2016, ; Stachenfeld2017, ; ODoherty2004, ). Our proposed modifications to the representation, however, extend to successor features predicting state-action pairs as well.
In the tabular case, or if the rewards are linear in the state representation, the successor representation can be used to compute the action- value function Q(s, a) exactly, given knowledge of the current reward weights :
We can therefore apply the Bellman updates to this representation, as follows, and reap the benefits of policy iteration.
is a one-hot encoding of state.
3 Motivation and learning algorithm
Our algorithm, the Bayesian Successor Representation (BSR, Fig. 1a, Algorithm 1 in Supplementary Information) extends the successor temporal difference (TD) learning framework in the first instance by using multiple successor maps. The agent then maintains a belief distribution , over these maps, and samples one at every step, according to these belief weights. This sampled map is used to generate an action, and receives TD updates, while the reward and observation signals are used to perform inference over . Though the storage complexity of using several maps is clearly larger than that of using a single one, we limit the performing of TD updates (direct updates from the most recent step, or replay updates of context-specific past transitions) to the sampled map, such that the computational cost of the updates is unchanged relative to the use of a single map.
Our model approximates inference over a hierarchical, nonparametric mixture model of reward functions (Fig. 1b) with a simple amortization of the inference over the component distributions for average reward maps, that are gradually learned by gradient descent and without explicitly specifying a prior(or equivalently, specifying the base distribution ). To motivate our use of reward space clustering, we illustrate the underlying intuition using simple tabular grid-world environments, which act in this case as adversarial embeddings when it comes to inferring task similarity from reward locations. Our aim is to transfer policies between environments where similar rewards/goals are near each other, without relying on model-based knowledge of the environment’s dynamics that can be hard to learn in general and can introduce errors.
Recently proposed approaches by Barreto et al. also adjudicate between multiple policies using successor features (Barreto2016, ; Barreto2019, ). They either directly evaluate the value functions of all stored policies, to find the overall largest action-value (an approach referred to as generalized policy iteration, or GPI), or also attempt to build a linear basis of the reward space by pre-learning a set of base tasks. This approach can however result in difficulties with selecting the right successor map, as it depends strongly on how sharply policies are tuned and which states they visit near important rewards. A sharply tuned policy for a previous reward setting with reward locations close to the reward locations of the current task could lose out to a broadly tuned, but mostly unhelpful competing policy. On the other hand, keeping all stored policies diffuse, or otherwise regularising them can be costly, as it can hinder exploitation or the fast instantiation of optimal policies given new rewards . Similarly, constructing a set of base tasks (Barreto2019, ) can be difficult as it might require encountering a large number of tasks before successful transfer could be guaranteed, as demonstrated most simply in a tabular setting.
Focusing instead on the reward function itself, we regard the choice of successor map as inference over a latent context variable, where the agent accumulates evidence during, and across, episodes to infer the current reward context as it takes steps in the environment. This allows the agent to adjust its policy online also in cases where reward changes, and episode/task boundaries are not signalled, or the reward function is not provided, as well as enabling a principled use of resources.
The inference process is realized by storing, corresponding to each successor map , a convolved-reward map (CR map) that summarizes the average of local rewards along trajectories that traversed a particular state, while using that particular map (Fig.1c). By locally ‘spreading’ rewards along experienced trajectories and clustering this extended reward space, it is possible to select successor maps close to the new optimal policy, such that they should require only a smaller amount of adjustment and exploration before yielding good performance. Interestingly, these maps can also been seen to serve as approximate priors for the rewards in each task context, and can thus be used to ‘guess’ reward locations in a form of directed exploration, as we show below. Fig. 1c illustrates the computation of CR values in a simple grid-world domain that we will use to simulate navigation.
More formally, we frame the generative model that our agent inverts as a Dirichlet process mixture model (Antoniak1974, )
. The generating distribution for the observed CR value of a particular state, given a context, is a Gaussian distribution around the context’s stored CR value (Fig. 1b).
We approximate the overall, intractable inference over the reward contexts using a particle filter (Algorithm 1, lines 31-40), with a filtering step performed following every actual step in the environment. Since the convolution on future rewards from a particular state can only be calculated retrospectively, the filtering process was delayed, in our case by three steps, though it is possible to only use past rewards (e.g. in cases where the environment does not have a commutative structure). CR maps themselves were learnt by gradient descent (delta rule), and we found that a successful clustering process relied on a winner-take-all dynamics for updating only the CR map that had the highest posterior likelihood at the end of the episode. This hard-clustering nonlinearity allocates the environment to a particular context and helps the pattern separation of the different environments. While in our episodic implementation this CR replay update is triggered at the end of the episode, in continuing tasks it could be triggered, as also observed empirically, by novelty and rewards (Singer2009, ; Cheng2008, ). Details such as the filtering delay, or whether the convolutions are calculated bi-directionally, can be adjusted as desired. Finally, to choose an action, the algorithm samples a context given the current weight distribution, calculates the Q-values, and then acts greedily, or by using some exploration strategy.
4.1 Grid-world with signalled rewards and context-specific replay
We first tested the performance of the model in a tabular maze navigation task (Fig. 1e), where start and goal locations changed every 20 trials, and compared its performance to a number of controls and baselines. In the first experiment, the reward function was provided every time the environment changed, to directly test the algorithms’ ability to map out routes to new goals. Episodes were terminated after 75 steps if the agent has failed to reach the goal, otherwise they ended on reaching the goal, when the agent also received a reward of 10. We added walls to the interior of the maze to make the environment’s dynamics non-trivial, such that a single SR representing the uniform policy (diffusion-SR) would not be able to select optimal actions. Fig. 2a shows the cumulative total number of steps the agents required to complete the episodes as learning progressed. We compared BSR to a single SR representation (SSR), an adaptation of GPI (SFQL)from Barreto et al. (Barreto2016, ) to state-state SR, as well as an agent that was provided with a pre-designed clustering, and instructed to use a specific map whenever to goal was in a particular quadrant of the maze (Known Quadrant, KQ). Each algorithm, except SSR, was provided with four maps to use, which meant that we didn’t allow GPI to create new maps for every task. Instead, once all of its maps were in use, it randomly selected one to overwrite, while otherwise following its original specifications. Further, we added a replay buffer to each algorithm, and replayed randomly sampled transitions for all of our experiments in the paper. Replay buffers had specific compartments for each successor map,with each transition added to the compartment corresponding to the map used to select the corresponding action. The replay process thus formed part of our overall nonparametric approach for continual learning. While we found that the top-ranking performance of BSR was not dependent on using replay, it ensured all algorithms performed in a better regime than without replay, making for more relevant comparisons. We ran each algorithm with different -greedy exploration rates (after an initial, brief period of high exploration) of between 0 and 0.25, in increments of 0.05, and chose the best performing for each algorithm. Notably BSR was the most stable across all exploration parameters, but performed best with =0, whereas the other algorithms performed better with higher exploration rates. The best performing setting for each algorithm, averaged over 10 trials, is shown in Fig. 2a, with BSR comfortably outperforming the others by the end of the runs (one-way ANOVA, ). Increasing the number of maps in GPI to 10 led to worse performance by the agent, showing that it wasn’t the lack of capacity, but the inability to generalize well that impaired its performance.
4.2 Puddle-world with unsignalled reward changes and task boundaries
In the second experiment, we made the environment more challenging by introducing puddles, which carried a penalty of -1 every time the agent entered a state with a puddle, as well as by not signalling reward changes. GPI still received task change signals to reset its use of the map and reset its memory buffer. This relieved us from having to impose a particular way for the algorithm to try to detect reward function change, while still not providing it the new reward function. We fixed the exploration rate to , as this proved advantageous for the control algorithms in the previous experiment, but varied the number of available maps between four and six. Negative rewards are known to be problematic and potentially prevent agents from properly exploring and reaching goals, we therefore also evaluated an algorithm that randomly switched between its different successor maps at every step, corresponding to an with equal weights (Equal Weights, EW). This provided a control showing that it was not increased entropy or dithering that drove the better performance of BSR. Despite this, both EW and SSR originally struggled in this setting, ending their runs with large negative cumulative returns, as there was nothing that would direct their exploration towards new goals, which might lie e.g. across puddles. However, when we allowed them to use CR map based exploration (see below), this considerably improved their performance, to the levels shown in Fig. 2b.
4.3 Directed exploration with CR reward priors
Indeed, since CR maps act as a prior for rewards in a particular context, it should be possible to use them to ‘infer’ likely reward locations when combined with reward observations, and then direct actions towards these possible rewards, even as their precise location remains uncertain. Because of the linear nature of SR, we can simply offset the reward weight vector using the CR values (Algorithm 1, line 8), which will direct the agent towards rewards likely under the currently sampled map. This is particularly powerful combined with CR based inference of map choice, and can help the agent to redirect its policy from previous reward locations having observed the lack of rewards there. By switching to a new map and reward weight vector , the agent is encouraged to move towards states where rewards generally occur in that particular context. Importantly, these offsets are not the traditional pseudo rewards often used in reward shaping, and only temporarily influence the agent’s beliefs about rewards, never the actual rewards themselves. The beliefs will also be updated using the true reward values whenever the agent enters a state. This also means that this reward-guidance isn’t expected to interfere with the agent’s behaviour in the same manner as pseudo rewards that don’t obey specific conditions do (Ng1999, ), however, like any prior, they can potentially have detrimental effects. We leave a fuller exploration of these for future work.
4.4 Function approximation setting
The tabular setting already enables us to test many components of the algorithm and compare the emerging representations to their biological counterparts. However, it is important to validate that these can be scaled up and used with function approximators, potentially allowing the more fine-grade probing of the emerging neural code, as well as the use of continuous state and action spaces and more complex tasks. As a proof of principle, we created a continuous version of the maze (with walls but no puddles) from Experiment 1. Agents still took steps in the four cardinal directions, but these steps were perturbed by 2D Gaussian noise, such that the agents ended up potentially covering the whole continuous state space of the maze (walls allowing). State embeddings were provided in the form of Gaussian radial basis functions with 100 equally placed centres, and agents used an artificial neural network equivalent of Algorithm 1, where the Matrix componentswere replaced by
, a multi-layer perceptron (Fig. 1d). We tested BSR-4 vs. SSR and GPI-4 in this setting, with the former outperforming the other two again. Fig. 2c shows the performance of the algorithms in terms of total reward collected by timestep in the environment, to emphasize the continual learning setting. (Fig. 2c, means.e.m, Fig. 2c, one-way ANOVA after steps, ). Results
5 Neural data analysis
5.1 Hippocampal flickering during navigation to multiple, changing rewards
Our algorithm draws inspiration from biology, where animals often face similar continual learning tasks e.g. while foraging or evading predators in familiar environments. In the following section we show some evidence that the rodent brain might indeed use similar mechanisms to those proposed above. We performed two sets of analyses, comparing our model to experimental data from rodent navigation tasks with changing reward settings. We used the successor representation as a proxy to neural firing rates as suggested in (Stachenfeld2017, )
, with each neuron associated with a state and firing with a rate proportional to its current expected discounted visitation. Our framework predicts the continual, fast-paced remapping of hippocampal prospective representations, in particular in situations when a change in rewards increases the probability of switching over to a new map. Intriguingly, such ‘flickering’ of hippocampal place cells have indeed been reported, though a normative framework underpinning this phenomenon has been missing. Dupret et al. and Boccara et al. (Dupret2013, ; Boccara2019, ) recorded hippocampal neurons in a task where rodents had to collect three hidden food items in an open hexagonal maze. The locations of all the rewards changed together between sessions of around 30 episodes each. Both papers report what appears to be a fast, flickering alternation of distinct hippocampal representations, gradually moving from being dominated by the old to the new one, as the animal is repeatedly exposed to the new reward settings (Fig. 3a). In particular the data shows relatively stable firing in the reward-free pre- and post-probes, and a strong, gradual, and monotonic trend across rewarded trials as the spatially tuned firing in the hippocampus changes. This drift from being similar to the pre-probe to the post-probe firing patterns is depicted for place cells in Fig. 3a (adapted with permission from (Boccara2019, )) and for our model in 3b. Fig. 3c shows the evolution of the average of this difference across the thirty rewarded trials. We ran simulations with 150 consecutive sessions of 30 episodes each (a total of 4500 episodes as before) with the 3 random reward locations changing every session, and without signalling this change (except for GPI). We again used the maze from experiment 1 but with the internal walls removed, and parameter settings were preserved, with the exploration rate set to as before. BSR naturally and consistently captured the observed flickering behaviour as shown on a representative session in Fig 3b (with further sessions and correlation values shown in the Supplementary Information). Further, it was the only tested algorithm that captured the smooth, monotonic evolution of the z-scores across initial trials, reported as a correlation coefficient of 0.95 for the first 16 trials of each session in Boccara et al. (Boccara2019, ), with our model giving the closest value of 0.90.
5.2 Splitter cells
Another intriguing phenomenon of spatial tuning in the hippocampus is the presence of so-called splitter cells, cells with route-dependent spatial firing. These cells encode current location conditional both on previous and future states visited by the animal(Dudchenko2014, ). While the successor representation is purely prospective, in our model the inference over the reward context, and thus , depends on the route taken, predicting exactly this type of past-and-future dependent representation. Further, rather than a hard-assignment to a particular map, our model predicts switching back and forth between representations in the face of uncertainty. We analysed data of rats performing a foraging navigation task in a double Y-maze with 3 alternating reward locations (Grieves2016, ) (Fig. 4a shows the maze and routes, adapted with permission). The central goal location at the top of the maze has two possible routes (Route 2 and 3), one of which is blocked every time this central goal location is the rewarded one. This setup gives 4 routes/trial types. These blockades momentarily change the dynamics of the environment, a challenge for SR (Russek2017, ; Lenhert2017, ). Our inference framework, however, overcomes this challenge by‘recognizing’ a lack of convolved rewards along the blocked route when the animal can’t get access to the goal location, allowing the algorithm to switch to using the alternate route. Other algorithms, notably GPI, struggles with this problem (Fig. 4h), as it has to explicitly try to map the policy of going back from the barricade, through the maze and up the other arm to escape the barrier once it has started bumping against it. To further test our model’s correspondence with hippocampal spatial coding, we followed the approach adopted in Grieves et al. (Grieves2016, )
for trying to decode the trial type from the hippocampal activity as animals begin the trial in the start box. This decoding is performed using a nearest centroid classifier, based on cosine similarity. Namely for every trial the representation is compared to the average of representations across all trials, grouped by the known, ground truth trial type. The trial is then assigned to a decoded trial type, based on the ground truth average it is most similar to. The analysis was only performed on successful trials, and thus a simple prospective representation would result in large values on the diagonal, as in the case of GPI and KQ (our known-quadrant algorithm which in this case gets the full trial type information to pick its map accordingly).
In contrast, the empirical data resembles the pattern predicted by BSR, where a sampling of maps results in a more balanced representation, while still providing route dependent encoding that differentiates all four possible routes already in the start box. EW and SSR both predict close-to-chance decodability, EW since it samples randomly at every step, across slowly changing maps, and SSR since the same map has to be continually updated, muddying the difference session averages.
6 Related work
A number of recent, or concurrent papers have proposed algorithms for introducing transfer into RL/deep RL settings, by using multiple policies in some way, though none of them use an inferential framework similar to ours, provides a principled way to deal with unsignalled task boundaries, or explains biological phenomena. We extensively discuss the work of Barreto et al. (Barreto2016, ; Barreto2019, ) in the paper. Our approach shares elements with earlier work on probabilistic policy reuse (Fernandez2006, ), which also samples from a distribution over policies, however does so only at the beginning of each episode, doesn’t follow a principled approach for inferring the weight distribution, and is limited int performance by its use of Q-values rather than factored representations. Wilson et al. (Wilson2007, ) and Lazaric et al. (Lazaric2010, )
employed hierarchical Bayesian inference for generalization in multi-task RL using Gibbs sampling, however neither used the flexibility afforded by the successor representation or fully integrated online control and inference similarly to our method.(Wilson2007, ) uses value iteration to solve the currently most likely MDP, while (Lazaric2010, ) applies the inference directly on state-value functions.
Other approaches tackle the tension between generality and specialization by regularizing specialized policies in some way with a central general policy (Teh2017, ; Finn2017, ), which we instead expressly tried to avoid here. General value functions in the Horde architecture (Sutton2011, ) also calculate several value functions in parallel, corresponding to a set of different pseudo-reward functions, and in this sense are closely related, but a more generalized version of SR. Schaul et al. combined the Horde architecture with a factorization of value functions into state and goal embeddings to create universal value function approximators (UVFAs) (Schaul2015, ). Unlike our approach, UVFA-Horde first learns goal-specific value functions, before trying to learn flexible goal and state embeddings through matrix factorization, such that the successful transfer to unseen goals and states depends on the success of this implicit mapping. More recent work from Ma et al. (Ma2018, ) and Borsa et al. (Borsa2019, ) combines the idea of universal value functions and SR to try to learn an approximate universal SR. Similarly to UVFAs however, (Ma2018, ) relies on a neural network architecture implicitly learning the (smooth) structure of the value function and SR for different goals, in a setting where this smoothness is supported by the structure in the input state representation (visual input of nearby goals). Further this representations is then used only as a critic to train a goal-conditioned policy for a new signalled goal location. (Borsa2019, ) proposes to overcome some of these limitations by combining the UVFA approach with GPI. However, it doesn’t formulate a general framework for choosing base policies and their embeddings when learning a particular task space, or for sampling these policies, or addresses the issue of task boundaries and online adjustment of policy use.
Other recent work for continual learning also mixed learning from current experience with selectively replaying old experiences that are relevant to the current task (Rolnick2018, ; Isele2018, ). Our approach naturally incorporates, though is not dependent, on such replay, where relevant memories are sampled from SR specific replay buffers, thus forming part of the overall clustering process. Finally (Milan2016, ) also develops a nonparametric Bayesian approach to avoid relearning old tasks while identifying task boundaries for sequence prediction tasks, with possible applications for model-based RL, while (Franklin2018, ) explored the relative advantages of clustering transition and reward functions jointly or independently for generalization.
In this paper we proposed an extension to the SR framework by coupling it with the nonparametric clustering of the task space and amortized inference using diffuse, convolved reward values. We have shown that this can improve the representation’s transfer capabilities by overcoming a major limitation, the policy dependence of the representation, and turning it instead into a strength through policy reuse. Our algorithm is naturally well-suited for continual learning where rewards in the environment persistently change. While in the current setting we only inferred a weight distribution over the different maps and separate pairs of SR bases and CR maps, an important further advantage of a hierarchical approach (e.g. using hierarchical DP mixtures (Teh2006, ) and composition of submaps) would be to allow the agent to infer essentially new successor maps from very limited experience, even as dynamics change. We leave this for future work.
We further showed how our model provides a common framework to explain disparate findings that characterize the brain’s spatial and prospective coding. Although we committed to a particular interpretation of hippocampal cells representing SR in our simulations, the phenomena we investigate rely on features of the model (the online clustering and sampling of representations based on reward function similarity) that generalise across different implementational details of prospective maps. The model also makes several predictions that were not directly tested, e.g. about how the rate of flickering depends on uncertainty about the rewards and the nature of the reward change, or the animal’s trajectory. We also hypothesize that the dimensionality of the representation (the number of maps) would vary as the function of the diversity of the experienced reward space, and our model predicts specific suboptimal trajectories that can arise due to inference. Our model also predicts representations for , a likely candidate for which would be the orbitofrontal cortex, and for the CR maps, possibly in other cortical areas such as the cingulate. Taken together we hope this study is a step in connecting two fast moving areas in neuroscience and ML research that offer a trove of possibilities for mutually beneficial insight.
Thanks to Tim Behrens, Rex Liu, Shirley Mark, Evan Russek and James Whittington for comments and helpful discussions, as well as to Roddy Grieves and Paul Dudchenko for generously sharing data from their experiments.
- (1) V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis, “Human-level control through deep reinforcement learning.,” Nature, 2015.
- (2) D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis, “Mastering the game of Go with deep neural networks and tree search.,” Nature, 2016.
- (3) P. Dayan, “Improving Generalization for Temporal Difference Learning: The Successor Representation,” Neural Comput., 2008.
- (4) A. Pritzel, B. Uria, S. Srinivasan, A. Puigdomènech, O. Vinyals, D. Hassabis, D. Wierstra, and C. Blundell, “Neural Episodic Control,” arXiv preprint 1703.01988, 2017.
- (5) S. J. Gershman and N. D. Daw, “Reinforcement learning and episodic memory in humans and animals: An integrative framework,” Annual Review of Psychology, vol. 68, no. 1, pp. 101–128, 2017.
- (6) M. Botvinick, S. Ritter, J. X. Wang, Z. Kurth-Nelson, C. Blundell, and D. Hassabis, “Reinforcement Learning, Fast and Slow,” Trends Cogn. Sci., 2019.
- (7) M. Lengyel and P. Dayan, “Hippocampal Contributions to Control: The Third Way,” in Adv. Neural Inf. Process. Syst., 2007.
G. A. Carpenter and S. Grossberg, “A massively parallel architecture for a self-organizing neural pattern recognition machine,”Comput. Vision, Graph. Image Process., 1987.
- (9) M. McCloskey and N. J. Cohen, “Catastrophic Interference in Connectionist Networks: The Sequential Learning Problem,” Psychol. Learn. Motiv. - Adv. Res. Theory, 1989.
- (10) B. Ans and S. Rousset, “Avoiding catastrophic forgetting by coupling two reverberating neural networks,” Comptes Rendus l’Academie des Sci. - Ser. III, 1997.
- (11) A. Courville, N. D. Daw, G. Gordon, and D. S. Touretzky, “Model uncertainty in classical conditioning,” in Adv. Neural Inf. Process. Syst., vol. 16, 2004.
- (12) T. J. Madarasz, L. Diaz-Mataix, O. Akhand, E. A. Ycu, J. E. LeDoux, and J. P. Johansen, “Evaluation of ambiguous associations in the amygdala by learning the structure of the environment,” Nat. Neurosci., 2016.
- (13) S. J. Gershman, D. M. Blei, and Y. Niv, “Context, learning, and extinction.,” Psychol. Rev., vol. 117, 2010.
D. Meder, N. Kolling, L. Verhagen, M. K. Wittmann, J. Scholl, K. H. Madsen, O. J. Hulme, T. E. Behrens, and M. F. Rushworth, “Simultaneous representation of a spectrum of dynamically changing value estimates during decision making,”Nat. Commun., 2017.
- (15) S. C. Y. Chan, Y. Niv, and K. A. Norman, “A Probability Distribution over Latent Causes, in the Orbitofrontal Cortex,” J. Neurosci., 2016.
- (16) M. A. Yassa and C. E. Stark, “Pattern separation in the hippocampus,” Trends in Neurosciences, 2011.
- (17) E. T. Rolls, “A theory of hippocampal function in memory,” Hippocampus, 1996.
- (18) S. P. Jadhav, C. Kemere, P. W. German, and L. M. Frank, “Awake hippocampal sharp-wave ripples support spatial memory,” Science, 2012.
- (19) R. M. Grieves, E. R. Wood, and P. A. Dudchenko, “Place cells on a maze encode routes rather than destinations,” Elife, 2016.
- (20) T. I. Brown, V. A. Carr, K. F. LaRocque, S. E. Favila, A. M. Gordon, B. Bowles, J. N. Bailenson, and A. D. Wagner, “Prospective representation of navigational goals in the human hippocampus,” Science, 2016.
- (21) K. L. Stachenfeld, M. M. Botvinick, and S. J. Gershman, “The hippocampus as a predictive map,” Nat. Neurosci., 2017.
- (22) K. Jezek, E. J. Henriksen, A. Treves, E. I. Moser, and M. B. Moser, “Theta-paced flickering between place-cell maps in the hippocampus,” Nature, 2011.
- (23) K. Kay, J. E. Chung, M. Sosa, J. S. Schor, M. P. Karlsson, M. C. Larkin, D. F. Liu, and L. M. Frank, “Regular cycling between representations of alternatives in the hippocampus,” bioRxiv, 2019.
- (24) C. N. Boccara, M. Nardin, F. Stella, J. O’Neill, and J. Csicsvari, “The entorhinal cognitive map is attracted to goals,” Science, 2019.
- (25) D. Dupret, J. O’Neill, and J. Csicsvari, “Dynamic reconfiguration of hippocampal interneuron circuits during spatial learning.,” Neuron, 2013.
- (26) C. J. C. H. Watkins, Learning from delayed rewards. PhD thesis, 1989.
- (27) A. Barreto, W. Dabney, R. Munos, J. J. Hunt, T. Schaul, H. van Hasselt, and D. Silver, “Successor Features for Transfer in Reinforcement Learning,” in Adv. Neural Inf. Process. Syst., 2017.
- (28) A. Barreto, D. Borsa, J. Quan, T. Schaul, D. Silver, M. Hessel, D. Mankowitz, A. Žídek, and R. Munos, “Transfer in Deep Reinforcement Learning Using Successor Features and Generalised Policy Improvement,” 2019.
- (29) J. O’Doherty, P. Dayan, J. Schultz, R. Deichmann, K. Friston, and R. J. Dolan, “Dissociable Roles of Ventral and Dorsal Striatum in Instrumental Conditioning,” Science, 2004.
- (30) C. E. Antoniak, “Mixtures of Dirichlet Processes with Applications to Bayesian Nonparametric Problems,” Ann. Stat., 1974.
- (31) A. C. Singer and L. M. Frank, “Rewarded Outcomes Enhance Reactivation of Experience in the Hippocampus,” Neuron, 2009.
- (32) S. Cheng and L. M. Frank, “New Experiences Enhance Coordinated Neural Activity in the Hippocampus,” Neuron, 2008.
- (33) A. Y. Ng, D. Harada, and S. J. Russell, “Policy invariance under reward transformations: Theory and application to reward shaping,” in Proceedings of the Sixteenth International Conference on Machine Learning, 1999.
- (34) P. A. Dudchenko and E. R. Wood, “Splitter cells: Hippocampal place cells whose firing is modulated by where the animal is going or where it has been,” in Space, Time Mem. Hippocampal Form., 2014.
- (35) E. M. Russek, I. Momennejad, M. M. Botvinick, S. J. Gershman, and N. D. Daw, “Predictive representations can link model-based reinforcement learning to model-free mechanisms,” PLOS Computational Biology, 2017.
- (36) L. Lehnert, S. Tellex, and M. L. Littman, “Advantages and limitations of using successor features for transfer in reinforcement learning,” arXiv preprint 1708.00102, 2017.
- (37) F. Fernández and M. Veloso, “Probabilistic policy reuse in a reinforcement learning agent,” in Proceedings of the Fifth International Joint Conference on Autonomous Agents and Multiagent Systems, 2006.
- (38) A. Wilson, A. Fern, S. Ray, and P. Tadepalli, “Multi-task reinforcement learning: a hierarchical bayesian approach.,” in Proceedings of the 274h International Conference on Machine Learning, 2007.
- (39) A. Lazaric and M. Ghavamzadeh, “Bayesian multi-task reinforcement learning,” in Proceedings of the 27th International Conference on Machine Learning, 2010.
- (40) Y. Teh, V. Bapst, W. M. Czarnecki, J. Quan, J. Kirkpatrick, R. Hadsell, N. Heess, and R. Pascanu, “Distral: Robust multitask reinforcement learning,” in Adv. Neural Inf.Process. Syst., 2017.
- (41) C. Finn, P. Abbeel, and S. Levine, “Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks,” arXiv 1703.03400, 2017.
- (42) R. S. Sutton, J. Modayil, M. Delp, T. Degris, P. M. Pilarski, A. White, and D. Precup, “Horde: A Scalable Real-time Architecture for Learning Knowledge from Unsupervised Sensorimotor Interaction,” Aamas, 2011.
- (43) T. Schaul, D. Horgan, K. Gregor, and D. Silver, “Universal Value Function Approximators,” in 32nd Int. Conf. Mach. Learn., 2015.
- (44) C. Ma, J. Wen, and Y. Bengio, “Universal successor representations for transfer reinforcement learning,” in International Conference on Learning Representations, 2018.
- (45) D. Borsa, A. Barreto, J. Quan, D. J. Mankowitz, H. van Hasselt, R. Munos, D. Silver, and T. Schaul, “Universal successor features approximators,” in International Conference on Learning Representations, 2019.
- (46) D. Rolnick, A. Ahuja, J. Schwarz, T. P. Lillicrap, and G. Wayne, “Experience Replay for Continual Learning,” arXiv 1811.11682, 2018.
D. Isele and A. Cosgun, “Selective Experience Replay for Lifelong
AAAI Conference on Artificial Intelligence, 2018.
- (48) K. Milan, J. Veness, J. Kirkpatrick, M. Bowling, A. Koop, and D. Hassabis, “The Forget-me-not Process,” in Adv. Neural Inf. Process. Syst., 2016.
- (49) N. T. Franklin and M. J. Frank, “Compositional clustering in task structure learning,” PLOS Computational Biology, 2018.
- (50) Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei, “Hierarchical dirichlet processes,” Journal of the American Statistical Association, 2006.
- (51) X. Glorot and Y. Bengio, “Understanding the difficulty of training deep feedforward neural networks,” in Proc. 13th Int. Conf. Artif. Intell. Stat., 2010.
T. Tieleman, G. E. Hinton, N. Srivastava, and K. Swersky, “Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude,”COURSERA Neural Networks Mach. Learn., 2012.
S1 Algorithm details
s1.1 Tabular algorithm
Here we provide details for our agent not included in the main text. The concentration parameter for the Dirichlet process was
,the standard deviation for the Gaussian generative distribution, and like most other parameter values, these parameters were not optimized or otherwise systematically evaluated. P was a 100 by 15 matrix, corresponding to 100 particles each with 15 of the most recent contexts. In the tabular setting, the learning rate was 1 (0.5 in the Y-maze task), (0.1 in Y-maze), . was annealed starting from 0.15 to 0.01. The replay updates were done on 5 randomly sampled transitions from the replay buffer after every direct update, with each successor map assigned its own replay buffer. For GPI, the buffer for a new map, when a change in the environment was signalled was reset, so as not to contaminate the learning of the current policy with transitions from a different task setting. The delay in filtering, f, was set to 3 steps.
s1.2 Neural network algorithm
For the neural network architecture, the input layer had 100 neurons, hidden layer sizes were 150, while the output layers by definition had the same size as the input, but with a separate output layer for each of the four possible actions (Fig. 1d), resulting in a total of 400 neurons. We used the tanh function as nonlinearity, and no nonlinearity was used for the output layers. Parameters were initialised using Glorot initialisation (Glorot2010, ). We employed the successor equivalent of target networks (Mnih2015, )for better performance (see Algorithm 2), and a designated replay buffer accompanying each successor network. This was equivalent to a single memory with transitions labelled by sampled context, as the limit on the memory size, determining when a memory was overwritten, was by episode (set to 100 episodes), and thus common to all memories.
Parameters were , , and was annealed from 0.005 to 0.001. Replay updates were performed on minibatches of 15 transitions, one minibatch after every transition, following a direct update based on the most recent transition. The rmsprop (Tieleman2012, ) optimizer was used for updates, and a dropout rate of 0.1 was applied. Networks and target networks were synchronized every 80 steps, starting from the beginning of each episode.
S2 Environment and experiment details
s2.1 Grid-world maze
The maze for Experiment 1 was, as depicted in Fig. 1e, a 8 x 8 tabular maze. Available actions were up, down, left and right. If, on taking an action, the agent hit an internal or external wall, it stayed in place. Reward on landing on the goal was 10, , and episodes were restricted to at most 75 steps. Start and goal locations changed every 20 episodes.
Numerical results for plot Fig. 2a, total steps taken by the end of 4500 episodes, mean s.e.m.:
For Experiment 2, puddles were added in the quadrant opposite the current reward, covering the entire quadrant except where walls were already in place. Landing in a puddle carried a penalty of -1, and puddles stayed in place during the entire episode (i.e. couldn’t be ‘picked up’). Otherwise things remained unchanged from the previous setting, except that the reward function now changed every 30 episodes, for a total of 150 sessions.
Numerical results for plot Fig. 2b, sum of cumulative returns per episode at the end of 4500 episodes, mean s.e.m.:
s2.3 Continuous maze
This was a continuous copy of the maze from Experiment 1. We set the length of the maze to be 3 but it was still partitioned by the walls into an equivalent 8 x 8 setting (Fig. S1). 100 input neurons represented the agent’s location, each with respect to one of 100 equally spaced locations as a Gaussian likelihood with diagonal covariance of 0.1. The firing rate of the i-th neuron, with center was
where the normalizing constant was that for the Gaussian distribution in question times 10.
Actions were again up, down, left, or right, but the arrival point of the step was offset by two-dimensional Gaussian noise, with a diagonal covariance of . The agent’s step-size was 0.3 (thus smaller than before, at a tenth of the maze’s length). Agents were point-like, but were not allowed to touch, or traverse walls. Steps that would have resulted in such an outcome instead meant that the agent stayed in place. This was also the case if the random offset would have resulted in contact between the agent and a wall. The agent collected the reward and ended the episode if it was at a distance of less than 0.25 from the goal’s location, which could be anywhere outside the walled-off areas). In all other aspects the task was identical to Experiment 1.
Numerical results for plot Fig. 2c, sum of all rewards collect after steps s.e.m.:
s2.4 Three-reward open maze foraging task
Here the environment was the 8 by 8 maze with no internal walls. 3 reward locations were sampled at the beginning of each of the 150 sessions that lasted 30 trials (episodes) each. The pre- and post-probe parts of the sessions were 75 steps each (the same number as the upper limit of steps for an episode), and for simplicity we assumed no learning during these times. Instead the agent was randomly moving around while preserving its representation. We used the values of the successor maps to represent firing rates, with . For data analysis, the pre- and post-probe averages were calculated by averaging over the successor maps according to their weights at the beginning of the probe . Since rewards in this task are not Markovian, and implementing a working memory or learning the overall structure of the task was not our focus, we gave all agents the ability to block out and temporarily set to 0 in all rewards they already collected on that trial when evaluating the value function. This did not otherwise affect their beliefs about where the rewards were, i.e. they didn’t forget learnt reward locations by the next trial. We got the following values for the average Spearman correlation coefficients for trials 1 to 16 for the different algorithms:
The first 25 sessions were used as a warm-up for the agent to learn the environment, and sessions 25 to 150 were used in the data analysis. Plots equivalent to that in Fig. 3a for every 10th session from session 30 onwards are shown in figures S3 to S6 for the different algorithms.
s2.5 Y-maze navigation task
We also implemented the Y-maze on a grid-world, as shown in Figure S2. On trial types 2 and 3 one of the two barriers shown were put in place, and the goal was in the top left hand corner state. The top right and top left hand corners were equivalent states (state 1, equivalent to the top goal state of the Y-maze in Fig. 4a), such that if the agent entered the top right hand corner, it got teleported to the top left and that was the only destination state recorded. Similarly, if it took the ‘right’ action from the top left corner it ended up in the state below the top right corner. This was permissible, since we don’t assume any type of action embedding, and there is no generalization based on action identity. This setup thus gave us an equivalent representation to the Y-maze used in rodent experiments.
We ran the simulations on the exact session structure followed during the experiments and the agent had to complete the same number of successful episodes in each session that the experimental animals have done. In addition, we gave the agents a 500 episode pre-training phase, before we started following the experimental trial structure and ‘recording’ the representations. During these first 500 episodes, the trial type was reset every 20 episodes, as in Experiments 1 and 3. After the pre-training phase there were 24 blocks of sessions where each block contained consecutive trials of all the trial types, such that there were exactly 3 changes to the trial type during every block.