In the field of cognitive science, behaviour is widely considered to be separated into two classes. Habitual (reflexive) behaviour responds rapidly and instinctively to stimuli, while reflective behaviour involves slower top-down processing and a period of deliberation. These different classes of behaviour have been labelled System 1 and System 2 by Stanovich and West in 2000  and the topic was popularised in Daniel Kahneman’s book ”Thinking Fast and Slow” in 2011 . However, it has remained unclear how the two processes are implemented in the brain.
This paper investigates how a single system operates when generating behaviours which require different amounts of deliberation. The system does not carry out planning or evaluation of prospective outcomes, but simply compares the triggering of actions which require more or less time to select the correct action for a presented situation. This distinction has some parallels with, but is separate from, the split between goal based planning (often called model-based) and habitual (often called model-free) control [11, 31]. Although there is some evidence that separate brain systems underpin goal based planning and habits [18, 11, 33], there are also indications from fMRI and lesion studies that these processes can co-exist within the same regions of the brain [9, 11, 33], challenging the notion of separate systems. Since it is possible that even these extremes of behaviour are computed together in the brain, a useful contribution to the topic would be to analyse how reflexive and reflective behaviour can arise from a unified system.
. According to this approach, the brain maintains a generative model of its environment and uses variational inference to approximate Bayesian inference[29, 20]. One way of implementing this (under Gaussian assumptions) is to use a hierarchical predictive coding architecture [28, 12, 13], which has successive layers of descending predictions and ascending prediction errors [5, 3, 32, 24]. This theory is also often referred to as Predictive Processing (PP) [7, 25]. In this paper, we investigate how PP approaches can explain both reflexive and reflective behaviour simultaneously using a single hierarchical predictive coding network architecture and inference procedure.
In it basic form, PP uses a generative model to try and correctly infer hidden causes for incoming observations. In a hierarchical predictive coding network, all layers of the hierarchy are updated to minimize prediction errors until a fixed point is reached, with the resultant top layer being the best explanation of the hidden causes of the observations at the bottom layer [7, 5, 3].
PP can also be used to explain action, with the network modelling how actions and sensations interact [2, 6, 4, 8, 27, 26, 25]. Actions are triggered by descending predictions which cause low level prediction errors. These errors are rectified through reflex arcs [17, 1, 19]. In theory, this means that motor behaviour need not wait for full end-to-end inference to be completed but, rather, action takes place once a threshold has been crossed on the reflex muscle.
This paper investigates the extent to which action selection in a predictive coding network (PCN) relies on all the layers of the PCN. To do this, we train a network to associate actions and observations with each other. We then investigate whether inference across the full network is required in order to trigger the correct action for a given observation. We show that a decision making task with a higher degree of complexity will use more of the layers and may be strongly dependent on the top layer being correctly inferred. Conversely, a decision which is a simple function of sensory observations can operate without involvement of higher layers, despite the fact that learning included those higher layers. This demonstrates that learning allows a hierarchy of action/sensation linkages to be built up in the network, with agents able to use information from lower layers to infer the correct actions without necessarily needing to engage the whole network. These findings suggest that a single PCN architecture could explain both reflexive and reflective behaviour.
In the general case of state space models, the fixed point of a PCN is often in a moving frame of reference. However, the implementation described in this paper ignores state transitions or dynamics and restricts itself to static images. It should therefore not be confused with the notion of predictive coding sometimes seen in the engineering or active inference literature which rests on a state space model for generating timeseries. Rather, our formulation follows the approach of Rao and Ballard’s seminal paper  and ignores any temporal prediction components, whilst retaining what Friston describes as ”the essence of predictive coding, namely any scheme that finds the mode of the recognition density by dynamically minimising prediction error in an input-specific fashion” .
The remainder of this paper is set out as follows. Section 2 outlines the HPC model which is used to implement variational inference. Section 3 describes the experiments which we use to analyse inference of labels and actions in PCNs. Section 4 presents the experimental results, demonstrating that learning to act need not rely on high level hidden states. Moreover, we show that the number of higher layers which can be ignored in decision making relates to the complexity of information needed to make that decision.
2 Hierarchical Predictive Coding (HPC)
At the core of the free-energy principle is the concept that, in order to survive, an agent must strive to make its model of the world a good fit for incoming observations, . If the model of observations can be explained by hidden states of the world
then, in theory, a posterior estimate forcould be obtained using Bayes rule over a set of observations:
but the denominator is likely to be intractable. Therefore is approximated using variational inference. An auxiliary model (the variational distribution) is created, , and the divergence between and the true posterior minimized. The KL divergence is used to measure this:
where the variational free energy is defined as:
The value , is an information theoretic measure of the unexpectedness of an observation, variously called surprise, suprisal or negative of log model evidence. By adjusting to minimize surprisal, the model becomes a better fit of the environment. Noting that KL is always positive, it can be seen from equation (2) that is an upper bound on surprisal. Therefore, to make the model a good fit for the data, it suffices to minimize .
The next step is to consider how this would be implemented in the brain via HPC. In HPC, the generative model is implemented in Markovian hierarchical layers, where the priors are simply the values of the layer above, mapped through a weight matrix and a nonlinear function. The prior at the top layer may either be a flat prior or set externally. With layers, the top layer is labelled as layer , and the observation at the bottom as layer . Thus:
The generative model is assumed to be Gaussian at each layer,
is a vector representing node values on layer n,is a matrix giving the connection weights between layer and layer , is a non-linear function and is Gaussian noise at each layer. Note that the network also has a bias at each layer which is updated in a similar manner to the weights. This has not been included here for brevity. [Here we have shown the form where the argument of is a weighted linear mixture of hidden states, in order to make clear how we have implemented the hierarchy. But this could equally be generalised to any non linear function .]
Making the assumption that
is a multivariate Gaussian distribution, and further assuming that the distribution of is tightly packed around (to enable use of the Laplace assumption), reduces to:
where is a vector representing observations and
is the mean of the brain’s probability distribution for. It is important to note that in this paper the observations are not confined to incoming senses but also include actions, in the form of proprioceptive feedback. Exteroceptive observations cause updates to model beliefs which, in turn, result in updated beliefs on proprioceptive observations. These drive motoneurons to eliminate any prediction error through reflex arcs [17, 15, 29]. Action can therefore be thought of as just a particular type of observation.
Using the distribution for a multivariate Gaussian, the estimate of can be transformed into:
where is the difference between value of layer and the value predicted by layer . is then minimized following the Expectation-Minimization approach [10, 23], by using gradient descent to alternately update node values () on a fast timescale and weight values () on a slower timescale.
The gradient for node updates in a hidden layer uses the values of and , and is given by the partial derivative:
but if the node values of the top layer are being updated then this is truncated to only use the difference compared to the layer below:
As pointed out earlier, downward predictions not only predict exteroceptive (sensory) signals, but also create a proprioceptive prediction error in the motor system (which is cancelled by movement via a reflex arc). In this paper we simply intend to monitor the signals being sent to the motor system and do not wish to include the error cancellation signal being fed back from the reflex arc. For this reason, the update of the ”observation node” in the motor system is shown as only using the difference to the layer above:
After the node values have been changed, is then further minimized by updating the weights using:
Since the impact of variance is not the primary focus here, our simulations assume that all
have fixed values of the identity matrix and therefore the gradient update forhas not been included.
Fig. 1 summarises the flow of information in the network during gradient descent update of node values.
Three sets of experiments were designed, to investigate how the process of inference is distributed through hierarchical layers. The first two experiments were each run on three different tasks. The third experiment was run on a single task. The experiments are described below.
In the first set of experiments, we trained three PCNs to carry out separate inference tasks, based on selecting the correct action for a given MNIST image. In all three networks, the observation layer at the bottom of the network contains 785 nodes, made up of 784 sensory nodes (representing the pixels of an MNIST image) and a single binary action node. The top layer uses a one-hot representation of each of the possible MNIST labels. There are two hidden layers of size 100 and 300. Thus, if there are 10 possible labels, there is a four-layer network of size [10,100,300,785], whose generative model produces an MNIST image and an action value from a given MNIST label. The role of each of the networks is, on presentation of an MNIST image, to infer the correct MNIST label at the top and the correct action associated with that image (Fig. 1).
|valign=b[Network in test mode]||
We investigated the relationship between the accuracies of action inference and label inference. Specifically, we asked: to what extent can the action be correctly triggered without correct label inference?
In the first task, MNIST-digit1, we trained the action node to output value 1 if the presented MNIST image has label 1, and value 0 for all other digits, i.e. the job of the action node is to fire when an image of the digit 1 is presented. The network is trained in a supervised manner to learn the generative model, by fixing the top and bottom layers with the training labels and observations respectively, and then, minimizing
in an expectation–maximization (EM) fashion[10, 23], as described in Section 2. Once trained, the network is then tested for its ability to infer the correct label and action for a given image. This is done by presenting an MNIST image to the 784 sensory states and allowing both the labels at the top and the action at the bottom to update via the variational inference process, according to equations (8) - (10). Updates to the network are applied over a large number of iterations and, at any stage of this process, the current inferred label can be read out as the argmax of the top-layer nodes while the selected action is read out according to a heaviside function applied to the action node value, centred on 0.5.
In the second task, MNIST-groups, we trained the action node to fire if the MNIST label is less than 5, and not fire otherwise. This network is trained and tested using the same process as above.
In the third task, MNIST-barred, half of the MNIST images had a white horizontal bar applied across the middle of the image. A new set of labels was created so that there were now 20 possible labels - 0 to 9 representing the digits without bars, and 10 to 19 representing the digits with bars. Action value 1 was associated with labels 10 to 19, and action value 0 with labels 0 to 9. The network for this task has size [20,100,300,785]. It is trained and tested as for the first two tasks. Appendix 0.A
gives full details of the hyperparameters used in the three PCNs.
The second set of experiments used the same three tasks but, instead of fixing MNIST labels to the top of the network in training, the top layer was populated with random noise. The purpose of these experiments was to determine whether the provision of label information in training had any impact on the network’s ability to infer the correct action. We then ablated layers from the PCNs in order to investigate the contribution which each layer makes towards inferring the correct action.
The third experiment trained a network where both the MNIST image and the MNIST one hot-labels were placed at the bottom. Above this were 6 layers, all initialized with noisy values. The top layer was allowed to vary freely (see Fig. 3(a)). This was used to investigate how label inference performed in this scenario (rather than the traditional case of label at the top and image at the bottom), and how performance reacted to ablation of layers in test mode.
We first investigated the relationship between accuracy of action and label inference for MNIST-digit1 (where the action node should fire if the MNIST label is 1). When run on a test set of images, the network generates values on the action node which correctly split into two groups centred near to 0 and 1, with a small overlap (Fig. 1). As a result, the network is able to correctly infer the action for a presented image in over 97% of cases. On the other hand, the label is only correctly inferred in 81% of cases, demonstrating that action selection does not depend entirely on correct label inference. Fig. 1 presents the development of label and action accuracies as iterations progress, confirming that a) action accuracy is always better than label accuracy, b) further iterations will not change this and c) action inference reaches asymptotic performance quicker than label inference.
Fig. 2 compares label and action accuracy for all three tasks. In the MNIST-group task, action accuracy appears to be constrained by label accuracy. In the MNIST-barred task, the correct action is always inferred, even though the network has relatively poor label accuracy. It would therefore seem that the MNIST-group task is reliant on upper layer values in order to select the correct action, whereas the simpler tasks can reach, or approach, optimal action performance regardless of the upper layer values.
However, it is not clear from these results whether the MNIST-group task is relying on the fact that the higher layers contain information about the image labels (recall that this is how the network was trained) or whether it is simply that the existence of the higher layers is providing more compute power. To investigate this, the second set of experiments were run, where the three networks are trained with random noise at the top layer instead of the image label. In testing, label accuracy was now no better than random (as one would expect), but action accuracy was indistinguishable from the original results of Fig. 2. This demonstrates that it is the existence of the layers, rather than provision of label information in training which is driving action inference.
To confirm that the three tasks make different use of the higher layers, action accuracy was measured when the top two layers were ablated in test mode (they were still present in training). Performance on the MNIST-group task (Fig. 2(a)) deteriorates significantly as the layers are ablated. Conversely, ablation of the top layer has no impact on the action accuracy of either the MNIST-barred (Fig. 2(c)) or MNIST-digit1 tasks (Fig. 2(b)). Both suffer slightly if the top 2 layers are ablated, although in the case of MNIST-barred the accuracy only moves from 100% to 99.9%. It can be concluded from these ablation experiments that reliance on higher layers varies with the nature of the task. Tasks which are more challenging may rely on the higher layers, while simple tasks may not suffer at all if the layers are ablated - presumably because all the information required for action selection is entirely available in the lower layers.
Effect of ablating layers on action accuracy. The three tasks cope differently with ablation of layers, as shown in (a), (b) and (c). Note that different y-scales are used on the figures for clarity. Each network was trained using 6 different seeds, and error bars show standard error. Results suggest that, if the lower layers are sufficient for action selection then the higher layers can be ignored.
In the third experiment, we constructed a network with both MNIST image and MNIST one
hot-labels at the bottom, representing 10 different binary actions to select from (see Fig. 3(a)). Above this were 6 layers, all initialized with noisy values (details in Appendix 0.A). Training was carried out as before, presenting a set of images and labels at the bottom of the network and leaving the network to learn weights throughout the hierarchy. The effect of layer ablation on the ability of the network to select the correct action (which in this experiment is the one-hot label) was then tested. When using all the layers, this network produces comparable results to the more standard PCN setup with label at the top and image at the bottom.111 At approximately 78%, the accuracy we achieved is significantly lower than standard non-PCN deep learning methods. This is partly because the model has not been fine-tuned (e.g. hyper-parameters, using convolutional layers, etc). But it is also true that generative models tend to underperform discriminative models in classification tasks. This will be particularly true in our implementation which uses flat priors.
At approximately 78%, the accuracy we achieved is significantly lower than standard non-PCN deep learning methods. This is partly because the model has not been fine-tuned (e.g. hyper-parameters, using convolutional layers, etc). But it is also true that generative models tend to underperform discriminative models in classification tasks. This will be particularly true in our implementation which uses flat priors.Ablation results are shown in Fig. 3(b). These are consistent with the previous experiments, with accuracy reducing (but still much better than chance value of 10%) as the layers are ablated. In this case it would appear that the top 2 layers are adding nothing to the network’s action selection ability. A key point to note is that the learning of the weights was not dependent on the provision of any information at the top of the network - all the learning comes about as a result of information presented at the bottom. Despite this, the network has distributed its ability through several layers, with the major part of successful inference relying on information towards the bottom of the network.
A 7 layer PCN where 10 binary actions are associated with MNIST images. (a) Image and one-hot labels both at the bottom. For ease of reading, the nodes shown on each layer represent both value and error nodes. Red lines show flow of information with no ablation. Black line shows flow if 5 layers are ablated. (b) Ablation of top two layers has no effect on accuracy of action selection. Ablation of the next 3 layers steadily reduces accuracy. Error bars are standard deviations across 10 differently seeded networks.
We have demonstrated that, when training a PCN with senses and actions at the bottom layer, it is not necessary to provide a high level ”hidden state” in training in order to learn the correct actions for an incoming sensation. Furthermore, the network appears to distribute its learning throughout the layers, with higher layers called into use only as required. In our experiments, this meant that higher layers could be ignored if the lower layers alone contained sufficient information to select the correct action. In effect, the network has learned a sensorimotor shortcut to select the correct actions. On the other hand, if the higher layers contain information which improves action selection, then ablation of those layers reduces, but doesn’t destroy, performance - ablation leads to graceful degradation. This flexibility is inherent in the nature of PCNs, unlike feed forward networks, which operate end to end.
Importantly, this suggests that a PCN framework can help explain the development of fast reaction to a stimulus, even though the learning process involves all layers. For example, driving a car on an empty road might only require involvement of lower layers, whereas heavy traffic or icy conditions would require higher layers to deal with the more complex task. The fact that simple short-cuts can arise automatically during training and that the agent can dynamically select actions without involvement of higher layers could possibly also help explain why well-learned tasks can be carried out without conscious perception.
While we have provided an illustrative ’proof of principle’ of this approach, much more can be done to investigate how this leads to a continuum of behaviour in active agents, which we list below in no particular order. Firstly, in our experiments inference took place with no influence from above and we have not considered the impact which exogenous priors would have. Secondly, we included no concept of a causal link between action and the subsequent sensory state. Action in real-life situations is a rolling process, with actions impacting subsequent decisions. Because our generative model did not consider time or state transitions, we cannot generalise to active inference in the sense of planning. One might argue that policy selection in active inference is a better metaphor for reflective behaviour, leading to a distinction between reflexive ‘homeostatic’ responses and more deliberative ‘allostatic’ plans. Having said this, it seems likely that the same conclusions will emerge. In other words, the same hierarchical generative model can explain reflective and reflexive behaviour at different hierarchical levels. Thirdly, the role of precisions has not been examined. Updating precisions should allow investigation of the role of attention. Finally, we have assumed the existence of a well trained network, and only touched on the performance of a partially trained network. It would be instructive to investigate how reliance on higher layers changes during the learning process.
These results support the view that a predictive coding network in the brain does not need to work from end to end, and can restrict itself to the number of lower layers required for the task at hand, possibly only in the sensorimotor system. There is the possibility of some tentative links here with more enactivist theories of the brain which posit that ”representations” encode predicted action opportunities, rather than specify an abstract state of the world, but much further analysis is needed to investigate possible overlaps.
PK would like to thank Alec Tschantz for sharing the ”Predictive Coding in Python” codebase https://github.com/alec-tschantz/pypc on which the experimental code was based. Thanks also to three anonymous reviewers whose comments helped improve the clarity of this paper, particularly in relation to temporal aspects of predictive coding. PK is funded by the Sussex Neuroscience 4-year PhD Programme. CLB is supported by BBRSC grant number BB/P022197/1.
-  Adams, R.A., Shipp, S., Friston, K.J.: Predictions not commands: active inference in the motor system. Brain Structure and Function 218(3), 611–643 (2013)
-  Baltieri, M., Buckley, C.L.: Generative models as parsimonious descriptions of sensorimotor loops. The Behavioral and brain sciences 42, e218–e218 (2019)
-  Bogacz, R.: A tutorial on the free-energy framework for modelling perception and learning. Journal of mathematical psychology 76, 198–211 (2017)
-  Bruineberg, J., Kiverstein, J., Rietveld, E.: The anticipating brain is not a scientist: the free-energy principle from an ecological-enactive perspective. Synthese (Dordrecht) 195(6), 2417–2444 (2018)
-  Buckley, C.L., Chang, S.K., McGregor, S., Seth, A.K.: The free energy principle for action and perception: A mathematical review (2017)
-  Burr, C.: Embodied decisions and the predictive brain. In: Wiese, T.M..W. (ed.) Philosophy and predictive processing. MIND Group, Frankfurt am Main (2016)
-  Clark, A.: Whatever next? predictive brains, situated agents, and the future of cognitive science. The Behavioral and brain sciences 36(3), 181–204 (2013)
-  Clark, A.: Predicting peace: The end of the representation wars. Open MIND. Frankfurt am Main: MIND Group (2015)
Daw, N.D., Gershman, S.J., Seymour, B., Dayan, P., Dolan, R.J.: Model-based influences on humans’ choices and striatal prediction errors. Neuron69(6), 1204–1215 (2011)
-  Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society: Series B (Methodological) 39(1), 1–22 (1977)
-  Dolan, R., Dayan, P.: Goals and habits in the brain. Neuron (Cambridge, Mass.) 80(2), 312–325 (2013)
Friston, K.: Learning and inference in the brain. Neural Networks16(9), 1325–1352 (2003)
-  Friston, K.: A theory of cortical responses. Philosophical transactions of the Royal Society B: Biological sciences 360(1456), 815–836 (2005)
-  Friston, K.: The free-energy principle: a unified brain theory? Nature reviews. Neuroscience 11(2), 127–138 (2010)
-  Friston, K.: What is optimal about motor control? Neuron 72(3), 488–498 (2011)
-  Friston, K., Kilner, J., Harrison, L.: A free energy principle for the brain. Journal of physiology-Paris 100(1-3), 70–87 (2006)
-  Friston, K.J., Daunizeau, J., Kilner, J., Kiebel, S.J.: Action and behavior: a free-energy formulation. Biological cybernetics 102(3), 227–260 (2010)
Gläscher, J., Daw, N., Dayan, P., O’Doherty, J.P.: States versus rewards: dissociable neural prediction error signals underlying model-based and model-free reinforcement learning. Neuron66(4), 585–595 (2010)
-  Hipólito, I., Baltieri, M., Friston, K., Ramstead, M.J.: Embodied skillful performance: Where the action is. Synthese pp. 1–25 (2021)
-  Hohwy, J.: The predictive mind. Oxford University Press (2013)
-  Kahneman, D.: Thinking, fast and slow. Macmillan (2011)
-  LeCun, Y., Cortes, C.: MNIST handwritten digit database (2010), http://yann.lecun.com/exdb/mnist/
-  MacKay, D.J., Mac Kay, D.J.: Information theory, inference and learning algorithms. Cambridge university press (2003)
-  Millidge, B.: Combining active inference and hierarchical predictive coding: A tutorial introduction and case study. PsyArXiv (2019)
-  Pezzulo, G., Donnarumma, F., Iodice, P., Maisto, D., Stoianov, I.: Model-based approaches to active perception and control. Entropy (Basel, Switzerland) 19(6), 266 (2017)
-  Pezzulo, G., Rigoli, F., Friston, K.: Active inference, homeostatic regulation and adaptive behavioural control. Progress in neurobiology 134, 17–35 (2015)
-  Ramstead, M.J., Kirchhoff, M.D., Friston, K.J.: A tale of two densities: active inference is enactive inference. Adaptive behavior 28(4), 225–239 (2020)
-  Rao, R.P., Ballard, D.H.: Predictive coding in the visual cortex: a functional interpretation of some extra-classical receptive-field effects. Nature neuroscience 2(1), 79–87 (1999)
-  Seth, A.K.: The cybernetic bayesian brain: from interoceptive inference to sensorimotor contingencies (2015)
-  Stanovich, K.E., West, R.F.: Individual differences in reasoning: Implications for the rationality debate? Behavioral and brain sciences 23(5), 645–665 (2000)
-  Sutton, R.S.: Reinforcement learning : an introduction (2018)
Whittington, J.C., Bogacz, R.: An approximation of the error backpropagation algorithm in a predictive coding network with local hebbian synaptic plasticity. Neural computation29(5), 1229–1262 (2017)
-  Wunderlich, K., Dayan, P., Dolan, R.J.: Mapping value based planning and extensively trained choice in the human brain. Nature neuroscience 15(5), 786–791 (2012)
Appendix 0.A Network parameters
Network size: 4 layer
Number of nodes on each layer: 10, 100, 300, 785 for MNIST-group and MNIST-digit1. 20, 100, 300, 785 for MNIST-barred. In the bottom layer, 784 nodes were fixed to the MNIST image, the 785th node was an action node which updates in testing. In initial set of experiments, top layer was fixed to a one-hot representation of MNIST label in training. In second set of experiments this was set to random value and allowed to update.
Non-linear function: tanh
Bias used: yes
Training set size: full MNIST training set of 60,000 images, in batches of 640
Number of training epochs
Number of training epochs: 10
Testing set size: 1280 images selected randomly from MNIST test set
Learning parameters used in weight update of EM process: Learning Rate= 1e-4, Adam
Learning parameters used in node update of EM process: Learning Rate= 0.025, SGD
Number of SGD iterations in training: 200
Number of SGD iterations in test mode: 200 * epoch number. The size is increased as epochs progress to allow for the decreasing size of the error between layers (as discussed in the text, this would normally be counteracted by increase in precision values).
Random initialisation: Except where fixed, all nodes were initialized with a random values selected from
In the experiment using a 7 layer network, the number of nodes on each layer were: 10, 25, 50, 100, 200, 300, 794. All other parameters the same as above