The effect of synaptic weight initialization in feature-based successor representation learning

by   Hyunsu Lee, et al.

After discovering place cells, the idea of the hippocampal (HPC) function to represent geometric spaces has been extended to predictions, imaginations, and conceptual cognitive maps. Recent research arguing that the HPC represents a predictive map; and it has shown that the HPC predicts visits to specific locations. This predictive map theory is based on successor representation (SR) from reinforcement learning. Feature-based SR (SF), which uses a neural network as a function approximation to learn SR, seems more plausible neurobiological model. However, it is not well known how different methods of weight (W) initialization affect SF learning. In this study, SF learners were exposed to simple maze environments to analyze SF learning efficiency and W patterns pattern changes. Three kinds of W initialization pattern were used: identity matrix, zero matrix, and small random matrix. The SF learner initiated with random weight matrix showed better performance than other three RL agents. We will discuss the neurobiological meaning of SF weight matrix. Through this approach, this paper tried to increase our understanding of intelligence from neuroscientific and artificial intelligence perspective.



page 1

page 2

page 3

page 4


Domain-Independent Optimistic Initialization for Reinforcement Learning

In Reinforcement Learning (RL), it is common to use optimistic initializ...

Probabilistic Successor Representations with Kalman Temporal Differences

The effectiveness of Reinforcement Learning (RL) depends on an animal's ...

Lifelong Learning with Sketched Structural Regularization

Preventing catastrophic forgetting while continually learning new tasks ...

Analytic Network Learning

Based on the property that solving the system of linear matrix equations...

Accelerating Learning in Constructive Predictive Frameworks with the Successor Representation

Here we propose using the successor representation (SR) to accelerate le...

Layer-Wise Interpretation of Deep Neural Networks Using Identity Initialization

The interpretability of neural networks (NNs) is a challenging but essen...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Animals have to wander and interact with the environment to survive. The essential ability to interact with the environment is memorization of information of environment. Based on information of environment, an animal can expect the future event or state depends on its decision. In other word, animal can predict based on its experience. Prediction and memorization are the essential property of general intelligence who has to interact with the environment. In animal intelligence, hippocampal system is developed for prediction and memorization [Andersen et al., 2006]. Recently, the place cell activity in the hippocampus is similar to the successor representation, which is originated from reinforcement learning (RL) field of artificial intelligence [de Cothi and Barry, 2020, Geerts et al., 2020, Stachenfeld et al., 2017].

The predictive map theory interpreting the place cell activity as learning successor representation (SR) has shown explanatory power to place cell activity of in vivo results [Stachenfeld et al., 2017]. Place cell activity changes as the animal becomes accustomed to environment. Although the activity of place cells exhibits a geodesic manner while animals explore new environments, after the animal becomes familiar with the environment, the geodesic symmetrical firing pattern of the place field changes to an asymmetrical firing pattern [Mehta et al., 2000]

. This phenomenon can be explained by using the predictive map theory. A place cell related to a specific location is not responding to visit the very location but responding to expectation of visiting the very location. Thus, a place cell activity shows skewed pattern to direction of the animal movement because the expectation is increasing upon to closer to the very location. The animal takes random movements while exploring a new environment, the expectation pattern exhibits symmetrical in all directions. Although the SR explains well the activity pattern of place cell, it has to give the location information as index form to the specific place cell. In other words, the agent must already know the whole size of the environment and to be given fully observable form of location information. To overcome this limitation and apply to a partially observable environment, recent research proposed the feature-based SR as a model of hippocampus

[Geerts et al., 2020].

Feature-based SR, also called successor feature (SF), uses weight matrix for SR learning; it works as a linear function approximator for successor states. However, initialization methods for weight matrix are various in several papers [Barreto et al., 2017, Geerts et al., 2020]; and little is known about how effect on overall learning by initializing different weight matrix.

This paper tried to reveal the effect of weight initialization methods on spatial learning. To test it, identity matrix, zero matrix, and random matrix were used for experiment. Under -greedy policy, the SR agent and SF agents show similar performance in 1D maze except for the SF agent initiated with random weight matrix: it showed superior performance compared to the other three agents. In results section, we will compare and analyze changes of the SR matrix during the learning process and discuss its neurobiological significance.

2 Backgrounds and Methods

2.1 Reinforcement learning

We assume that an RL agent interacts with the environment through Markov decision processes (MDPs, [Puterman, 2014]) in this paper. An MDP is a tuple consisting of the following elements. The sets and are the state (e.g., spatial locations) and action spaces. The function

gives the probability distribution for the next-state according to taken action

in state . The function specify the immediate reward received in the state which can be expressed in . The discount factor is the weight that makes the reward smaller in the distant future.

In RL, the agent finds a policy function to maximize the overall discounted reward, which is also called the return , where . To solve this problem, we usually use the method of dynamic programming (DP), which defines and calculates the value function of a policy as


where is the expected value obtained when the agent follows the policy . After computing , also known as policy evaluation, we can improve the policy with greedy manner. The greedy policy improvement is as follows: where .

2.2 Successor representation (SR) and feature-based SR

As previously described [Dayan, 1993], the key idea of the SR learning is that the value function (1) can be decomposed in expected visiting occupancy and reward of the successor states as the following equation:


where returns 1 when an agent visits the successor state at time , or 0 otherwise. Thus, represents the discounted expectation of visitation to state from state ; in other word, we can interpret

as transitional probability from state

to .

The matrix can be learned as the SR agent can incrementally learn the value of the environment by temporal difference (TD) algorithm. Therefore, we can derive the following TD learning equation for the M matrix:


The original form of SR learning can only learn a tabular environment [Lee, 2020], but can be generalized by using a set of feature function [Barreto et al., 2017]. We can generalize SR learning by assuming that the expected reward of state

can be factorized into the product of the feature vector and its weights for the reward as

. Therefore, we can rewrite value function (1) as follows:


where , and it can be simply expressed as a one-hot vector in for a tabular environment. In that case, is equivalent to the vector of the SR learning because is the discounted sum of occurrences of when the transition occurs under the policy . We call the successor features (SFs) for state under policy . This idea generalizes SR learning to be possible to learn in other types of MDP environments, including partially observable MDP and continuous states [Vertes and Sahani, 2019].

We can estimate SFs using a linear function as follows:


Assuming that

is a population vector of neurons responding to the observed state

by an agent, using a linear function is plausible for a neurobiological model of hippocampal place cells [Geerts et al., 2020, Lee, 2020, Stachenfeld et al., 2017]. We use TD learning for updating to estimate as same as updating the matrix of the SR learning:


Note that it is equivalent to the equation (3) when is the one-hot vector. The reward expectation weight vector is updated using a simple delta rule:


2.3 Experimental setting

Grid world

We test each agent on a simple one-dimensional grid world of size in range from 3 cells to 100 cells. In a grid world, agents navigate with left and right actions (Figure 1). For every episode, the agent starts at the left end position of the world. The goal is to reach the right end (terminal state) where receives reward 1, otherwise receives a zero reward.






Figure 1: Schematic drawing of the one-dimensional grid world following the MDP is shown. represents the true expected value of each cell according to the discount factor () when the reward of the terminal state is 1.

The discount factor is set to . To maximize the overall discounted reward, the agent select actions according to the current estimated Q value using an -greedy policy. Actions are uniformly selected at random with probability ; otherwise, the action with the highest Q-value estimate is used with probability . To stabilize learning by allowing sufficient exploration, the probability is decayed by the following rule: , where is the episode index. The learning rate for each learner— of SR agent and of SF agents—are set to . The learning rate for reward position vector for SR agent and SF agents is set to . Observation of maze environment is served as the index of each state for SR agent; it is served as one-hot coding vector for SF agents.

Variation of weight initialization

In previous each paper, they used dissimilar initialization methods of the weight matrix of the SF agent. Geerts et al.[Geerts et al., 2020] used identity matrix as the initial form of weight matrix of SF agents. Whereas, Barreto et al.[Barreto et al., 2017]

used small random variable for the initial value of weight matrix. In this paper, we compare three initialization methods: identity matrix, zero matrix, and small random matrix. The Xavier method was used for initializing the small random matrix

[Glorot and Bengio, 2010].

2.4 Evaluation of performance

Analysis updating history of SR place field matrix

Dimensional reduction of history of SR matrix was done with principal component analysis(PCA)

[Jolliffe and Cadima, 2016]. The L1 distance between matrices was measured as follows:


where and indicate one element from different SR place field matrix and , respectively.

Value error

In a one-dimensional maze starting from the left end, the optimal policy is to move to the right always. The true value of n-th grid cell is , where is n-th grid cell. To compare learning efficiency of each agent, the mean square error (MSE) between the and is calculated as follows:


Each agent simulation test was run 10 times, and the mean and standard deviation of these results are presented.

3 Results

3.1 Random SF agent converges to asymmetrical SR place field faster

To investigate the effect of different initial matrix form to SF agents, the changing learning history of weight variable of SF agents was analyzed and compared with SR agent.

SR place field learning history

Figure 2 shows representative results simulated in a grid world with N=100. Comparing learning pattern SR place field of 50th cell between agents reveals that the SF agent with random weight (random agent) skews its SR place field earlier than other three agents. The SR place field of the 50th cell of the other three agents, except for the random agent, shows a symmetrical pattern even in episodes 50th and 100th. Considering the learned pattern of the whole SR matrix (Figure 2C) at the 50th episode, the SR place field of the random agent has an asymmetrical pattern even in the cells near the first cell. Whereas the SR place field of other three agents shows a symmetrical pattern except for the cells near the goal location.

Figure 2: The simulated learning histories of the SR place field show that the SF agent, with the initial weight set at random, rapidly converges to the asymmetric SR place field. (A) Line plots of the learned SR place field of 50th cell after end of 10th, 50th, 100th, 300th episode are shown. Each line and shade show the averaged result with the standard deviation from 10 simulations in a grid world with 100 cells. Each row panel displays the SR agent (first row) or SF agents with different weight initialization methods (three rows below). (B) Rearranged line plots from A for comparing between the agents (SR, blue; SF weight initialization with identity matrix, orange; zero matrix, green; random matrix, red). Each row panel displays the simulated results after end of 10th, 50th, 100th, 300th episode. Note that the SF agent with random weight shows skewed SR place field at 50th episode, but other agents show symmetrical SR place field. (C) Learning histories of whole SR place field matrix in a grid world (N=100) are shown according to episodes (column panels) and learning agents (row panels).


The whole SR matrix represents the population activity of all place cells that encode the entire grid world. Thus, we need to reduce the dimension of SR matrix to analyze and track the change of learning pattern according to episodes. Similar to the method used in neuroscience research to analyze large-scale neuronal recordings [Cunningham and Yu, 2014], PCA was used in this study (Figure 3). As expected from the representative result of changes in the SR place field, random agent takes a shorter route to the converging point.

In a small grid world (N=5), the changing pattern of the SR place matrix from the SR agent and the SF agent with identity matrix (identity agent) looks similar, and that from the random agent and the SF agent with zero matrix (zero agent) looks similar. In a grid world with a larger size, however, the changing patterns from the three agents, except for the random agent, are similar. Note that the scales of axis are different. (See the Supple. Fig. 1 for the same scale axis.)

Figure 3: Principal component analysis (PCA) of SR place field matrix learning history shows that SF agent with random weight takes shorter route to converging optima. The simulated results from four different sizes of grid worlds (N=5, 25, 50, 100) are shown. Each dot shows PCA results of SR place fields after each episode (*, first episode). Each line shows historical route of SR place field learning from SR or SF agents (SR, blue; SF weight initialization with identity matrix, orange; zero matrix, green; random matrix, red). The average of the SR place field matrices from 10 simulations was used for PCA.

Distance of SR matrix between agents

As can be inferred from the PCA results, if the random agent converges to the optimal SR place field faster, the distance of the SR place matrices between the random agent and other agents will increase in the learning process and become closer as they converge. The distances between the SR place matrices of the other three agents will not increase. To confirm this, the L1 distances between the agents’ SR matrices are plotted according to the learning episode (Figure 4). The L1 distance changing pattern between the four agents shows expected results. Since the larger size of the grid world leads to the larger SR matrix and the greater the distance, the y-axises are plotted in log scale.

To normalize the size effect of the SR matrix on L1 distance, we can divide the L1 distance by the size of the matrix (). This normalizes to the distance between one element of the SR matrix, which also shows that the random agents’ distance from the other increases (Supple. Fig. 2).

Figure 4: L1 distance between SR place fields of agents shows that SF agent with random weight differ from other three agents. Line plots show the change of L1 distance according to episode in four different sizes of grid worlds (N=5, 25, 50, 100). Since the total number of episodes depends on the grid world size, the relative episodes () are shown on the x-axis.

3.2 Random agent decreases value error and step length faster

Considering equation (2) and (4), we can directly relate the difference in the learning of the SR matrix to performing the RL agent. This section compares the expected value and the step length to show the difference in the learning performance.

Value error

Based on true value (), the mean square error (MSE) of the estimated value () was calculated (see section 2.4 for equation (9)). As expected, Figure 5 shows that the MSE of of the random agent decreases faster than other agents. It is noticeable in early episodes that the random agent rapidly reduces the MSE. In late episodes, the MSE did not differ between agents.

Figure 5: The mean square error of values shows that the estimated value of the SF agent with random weights decreases to the true value faster than the other agents. (A) The upper panel shows the mean squared error (MSE) of the estimated values () decreases as the episode progresses. The lower panel shows the decrease in the MSE per single episode (). The results are from 10 simulations in four grid worlds of different sizes (arranged by columns). The averages (lines) and standard deviation (shades) of the SR or SF agents (SR, blue; SF weight initialization with identity matrix, orange; zero matrix, green; random matrix, red) are shown. (B) The upper panel shows that the averages of the MSE from last 100 episodes are similar across the SR or SF agents. The lower panel shows that the average of from first 10 episodes of the SF agent with random weight are larger than other three agents. Each circle marker indicates the size of grid worlds, which were simulated.

Step length decreasing

Since the MSE of the random agent decreases rapidly, we can expect that the step length to the goal cell of the grid world in each episode will also decrease faster. Because of their high initial epsilon, all the RL agents explore the grid world in a random walk manner, so the step length is long in the earlier episode. As the episode progresses, the step length is reduced to the size of the grid world (Figure 6).

In the small grid worlds (N¡30), the decreasing rates of step length between the RL agents show no significant difference. But in the large grid worlds (N¿=50), we can notice that the step length of the random agent decreases faster (upper in Figure 6B). While the rate of decrease in step length of the random agent is relatively constant, the other three agents show jittering in the rate of decrease in step length (lower in Figure 6B).

Figure 6: The SF agent with random weight converges to the optimal step length more rapid and stable than other three agents. (A) The upper panel displays the total step length taken to reach the target state for each episode. In the simulation results from the large grid world (N=100), the step length of the SF agent with random weights decreases to the ideal step length for the first 10 episodes, while the other agent decreases to the ideal step length after hundreds of episodes. The lower panel shows the decrease in total step length with each episode(). In the simulation results of the large-scale grid world (N ¿= 50), the jittering of of the SF agent with random weights disappears after 10 episodes, whereas its jittering of the other three agents persists. The results are from 10 simulations in four grid worlds of different sizes (arranged by columns). The averages (lines) and standard deviation (shades) of the SR or SF agents (SR, blue; SF weight initialization with identity matrix, orange; zero matrix, green; random matrix, red) are shown. (B) The top panel shows the average step length of the first 100 episodes according to grid world size. The lower panel shows the standard deviation of of the first 100 episodes according to grid world size.

4 Discussion

In this study, we compared the performance of the SF agent by varying the initialization value of the weight matrix. We confirmed that the random agent learns faster than the identity or the zero agent.

In the study of artificial neural networks (ANNs) using backpropagation algorithms, an efficient initialization method depending on the activation function is well known. The normalized Xavier initialization

[Glorot and Bengio, 2010] is mainly used for sigmoid or tanh functions. The He initialization [He et al., 2015]

is often used in ReLU functions.

When initializing the random agent in this paper, the absolute value of the Xavier method was used. If negative numbers are included, the SR value corresponding to the future occupancy becomes negative, because we used a single-layer function approximator without activation function. We can use multi-layer ANNs as a function approximator to mitigate this problem. When using deep neural network as successor feature approximator, we need further research to know which activation function of hidden layer is suitable and what is the efficient weight initialization method.

Since the feature vector of the input layer provides the current location to the RL agents, the SF weight matrix transforms the current location to the population vector that encodes the probability of future occupancy based on the policy. According to recent studies [de Cothi and Barry, 2020, Geerts et al., 2020, Stachenfeld et al., 2017], CA1 place cells encode successor representation by its population codes. From the neurobiological perspective, the SF weight matrix will be comparable to the synaptic weights between CA1 place cells and the previous neural layers (for example, CA3 neuron and entorhinal cortical neuron.)

Although it is still unclear how the synaptic update rule used in this paper could be implemented in the brain, it seems that TD learning corresponds to a response of dopaminergic neurons to reward prediction errors [Montague et al., 1996, Schultz, 1998]. It is presumed that TD learning is implemented with neural plasticity rules, such as spike timing-dependent plasticity and heterosynaptic plasticity [Lee, 2020, Rao and Sejnowski, 2001].

It is biologically plausible to consider that place coding and reward prediction coding are processed in parallel, and the brain integrates them into expected values for a certain state. We can view the brain as a parallel distributed processing device [Rumelhart, 1986]. Based on this assumption, the backpropagation algorithm has shown excellent performance in image recognition [Krizhevsky et al., 2012, Rumelhart et al., 1986]

. The convolutional neural network (CNN) learned based on the algorithm showed its activation patterns similar to the brain’s visual cortex and MT cortex

[Yamins et al., 2014], manipulating image by the activation pattern of learned CNN predicts neuronal responses in the V4 visual cortex of the macaque monkeys [Bashivan et al., 2019]. But it is unclear and still controversial whether the backpropagation algorithm is actually occurring in the brain [for review, see [Lillicrap et al., 2020, Whittington and Bogacz, 2019]].

Returning to the biological implementation of SR learning, it is unclear where and how in the brain the inner product of feature vector and reward vector (Equation 4) is processed. Experimental evidences show that the reward signal is represented in the orbitofrontal cortex (OFC) [Gottfried et al., 2003, Sul et al., 2010]; and the anterior cingulate gyrus is a candidate region for integrating the reward signal of OFC and the SF signal of HPC [Kolling et al., 2016, for review, see [Shenhav et al., 2013]]. Another paper [Gauthier and Tank, 2018], however, shows that HPC directly encodes the position of reward. If we turn our attention from the where problem to the how problem, it turns into a problem of reading a scalar value from the successor feature vector and reward vector [Meyniel et al., 2015]. Although this paper did not address these issues directly, we can obtain the following neuroembryological insights: if the neural network in the developmental stage randomly initializes the synaptic weights, they converge to the optimal state faster.

Comparing theoretical models and its results with biological evidence is an essential research process to improve understanding of the brain and to design an explainable, reliable, and efficient artificial intelligence (AI). For this purpose, we used the SF learning as a model in this paper and compared and analyzed the performance of agents according to weight initialization. We also discussed the neurobiological significance of the results. Through this convergence interaction between neuroscience and AI research, we can expect that both communities will reach a deep understanding of intelligence.

5 Credit author statement

Hyunsu Lee: Conceptualization, Methodology, Writing, Funding acquisition

6 Acknowledgements

This study was supported by the National Research Foundation of Korea(NRF) grant funded by the Korea government(MSIT; Ministry of Science and ICT)(No. NRF-2017R1C1B507279).

7 References

Supple. Fig. 1: The same principal component analysis (PCA) of the SR place field matrix learning history as shown in Figure 3, but drawn to the same scale. Except for the scale, the details are the same as in Figure 3.
Supple. Fig. 2: L1 distances divided by the size of the matrix () are shown. This normalization shows the distance between one element of the SR matrix. Random agent has shown great distance from the other agents.


  • Andersen et al. [2006] P. Andersen, R. Morris, D. Amaral, T. Bliss, and J. O’Keefe. The Hippocampus Book (Oxford Neuroscience Series). Oxford University Press, 2006.
  • Barreto et al. [2017] A. Barreto, W. Dabney, R. Munos, J. J. Hunt, T. Schaul, H. Van Hasselt, and D. Silver. Successor features for transfer in reinforcement learning. 31st Conference on Neural Information Processing Systems, 2017.
  • Bashivan et al. [2019] P. Bashivan, K. Kar, and J. DiCarlo. Neural population control via deep image synthesis. Science, 364, 2019. doi: 10.1126/science.aav9436.
  • Cunningham and Yu [2014] J. P. Cunningham and B. M. Yu. Dimensionality reduction for large-scale neural recordings. Nat. Neurosci., 17:1500–1509, 2014. doi: 10.1038/nn.3776.
  • Dayan [1993] P. Dayan. Improving generalization for temporal difference learning: The successor representation. Neural Computation, 5(4):613–624, 1993. ISSN 0899-7667. doi:
  • de Cothi and Barry [2020] W. de Cothi and C. Barry. Neurobiological successor features for spatial navigation. Hippocampus, 30:1347–1355, 2020. doi: 10.1002/hipo.23246.
  • Gauthier and Tank [2018] J. Gauthier and D. Tank. A dedicated population for reward coding in the hippocampus. Neuron, 99:179–193.e7, 2018. doi: 10.1016/j.neuron.2018.06.008.
  • Geerts et al. [2020] J. P. Geerts, F. Chersi, K. L. Stachenfeld, and N. Burgess. A general model of hippocampal and dorsal striatal learning and decision making. Proceedings of the National Academy of Sciences, 117(49):31427–31437, 2020. ISSN 0027-8424.
  • Glorot and Bengio [2010] X. Glorot and Y. Bengio. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 249–256, 2010.
  • Gottfried et al. [2003] J. Gottfried, J. O’Doherty, and R. Dolan. Encoding predictive reward value in human amygdala and orbitofrontal cortex. Science, 301:1104–1107, 2003. doi: 10.1126/science.1087919.
  • He et al. [2015] K. He, X. Zhang, S. Ren, and J. Sun.

    Delving deep into rectifiers: Surpassing human-level performance on imagenet classification., cs.CV, 2015.
  • Jolliffe and Cadima [2016] I. Jolliffe and J. Cadima. Principal component analysis: a review and recent developments. Philos Trans A Math Phys Eng Sci, 374:20150202, 2016. doi: 10.1098/rsta.2015.0202.
  • Kolling et al. [2016] N. Kolling, M. K. Wittmann, T. E. J. Behrens, E. D. Boorman, R. B. Mars, and M. F. S. Rushworth. Value, search, persistence and model updating in anterior cingulate cortex. Nat. Neurosci., 19:1280–1285, 2016. doi: 10.1038/nn.4382.
  • Krizhevsky et al. [2012] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25:1097–1105, 2012.
  • Lee [2020] H. Lee. Toward the biological model of the hippocampus as the successor representation agent. arXiv preprint arXiv:2006.11975, page 2006.11975, 2020.
  • Lillicrap et al. [2020] T. Lillicrap, A. Santoro, L. Marris, C. Akerman, and G. Hinton. Backpropagation and the brain. Nat. Rev. Neurosci., 2020. doi: 10.1038/s41583-020-0277-3.
  • Mehta et al. [2000] M. R. Mehta, M. C. Quirk, and M. A. Wilson. Experience-dependent asymmetric shape of hippocampal receptive fields. Neuron, 25:707–715, 2000. doi: 10.1016/S0896-6273(00)81072-7.
  • Meyniel et al. [2015] F. Meyniel, M. Sigman, and Z. Mainen.

    Confidence as bayesian probability: From neural origins to behavior.

    Neuron, 88:78–92, 2015. doi: 10.1016/j.neuron.2015.09.039.
  • Montague et al. [1996] P. R. Montague, P. Dayan, and T. J. Sejnowski. A framework for mesencephalic dopamine systems based on predictive hebbian learning. J. Neurosci., 16:1936–1947, 1996.
  • Puterman [2014] M. L. Puterman. Markov Decision Processes. John Wiley & Sons, 2014.
  • Rao and Sejnowski [2001] R. P. Rao and T. J. Sejnowski. Spike-timing-dependent hebbian plasticity as temporal difference learning. Neural computation, 13:2221–2237, 2001. doi: 10.1162/089976601750541787.
  • Rumelhart [1986] D. E. Rumelhart. Parallel Distributed Processing. Explorations in the Microstructure of Cognition. Vol. 1. 1986.
  • Rumelhart et al. [1986] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning representations by back-propagating errors. Nature, 323:533–536, 1986. doi: 10.1038/323533a0.
  • Schultz [1998] W. Schultz. Predictive reward signal of dopamine neurons. J. Neurophysiol., 80:1–27, 1998. doi:
  • Shenhav et al. [2013] A. Shenhav, M. Botvinick, and J. Cohen. The expected value of control: an integrative theory of anterior cingulate cortex function. Neuron, 79:217–240, 2013. doi: 10.1016/j.neuron.2013.07.007.
  • Stachenfeld et al. [2017] K. L. Stachenfeld, M. M. Botvinick, and S. J. Gershman. The hippocampus as a predictive map. Nature Neuroscience, 7:1951, Oct 2017. doi: 10.1038/nn.4650.
  • Sul et al. [2010] J. Sul, H. Kim, N. Huh, D. Lee, and M. Jung. Distinct roles of rodent orbitofrontal and medial prefrontal cortex in decision making. Neuron, 66:449–460, 2010. doi: 10.1016/j.neuron.2010.03.033.
  • Vertes and Sahani [2019] E. Vertes and M. Sahani. A neurally plausible model learns successor representations in partially observable environments. arXiv, page 1906.09480v1, 2019.
  • Whittington and Bogacz [2019] J. C. Whittington and R. Bogacz. Theories of error back-propagation in the brain. Trends Cog. Sci., 2019. doi: 10.1016/j.tics.2018.12.005.
  • Yamins et al. [2014] D. L. K. Yamins, H. Hong, C. F. Cadieu, E. A. Solomon, D. Seibert, and J. J. DiCarlo. Performance-optimized hierarchical models predict neural responses in higher visual cortex. Proc. Natl. Acad. Sci. U S A, 111:8619–8624, 2014. doi: 10.1073/pnas.1403112111.