Predictive Coding: a Theoretical and Experimental Review

07/27/2021
by   Beren Millidge, et al.
University of Sussex
0

Predictive coding offers a potentially unifying account of cortical function – postulating that the core function of the brain is to minimize prediction errors with respect to a generative model of the world. The theory is closely related to the Bayesian brain framework and, over the last two decades, has gained substantial influence in the fields of theoretical and cognitive neuroscience. A large body of research has arisen based on both empirically testing improved and extended theoretical and mathematical models of predictive coding, as well as in evaluating their potential biological plausibility for implementation in the brain and the concrete neurophysiological and psychological predictions made by the theory. Despite this enduring popularity, however, no comprehensive review of predictive coding theory, and especially of recent developments in this field, exists. Here, we provide a comprehensive review both of the core mathematical structure and logic of predictive coding, thus complementing recent tutorials in the literature. We also review a wide range of classic and recent work within the framework, ranging from the neurobiologically realistic microcircuits that could implement predictive coding, to the close relationship between predictive coding and the widely-used backpropagation of error algorithm, as well as surveying the close relationships between predictive coding and modern machine learning techniques.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

06/20/2021

On the relationship between predictive coding and backpropagation

Artificial neural networks are often interpreted as abstract models of b...
05/07/2020

Hierarchical Predictive Coding Models in a Deep-Learning Framework

Bayesian predictive coding is a putative neuromorphic method for acquiri...
09/06/2016

Deviant Learning Algorithm: Learning Sparse Mismatch Representations through Time and Space

Predictive coding (PDC) has recently attracted attention in the neurosci...
04/30/2020

PreCNet: Next Frame Video Prediction Based on Predictive Coding

Predictive coding, currently a highly influential theory in neuroscience...
01/17/2021

Predictive Processing in Cognitive Robotics: a Review

Predictive processing has become an influential framework in cognitive s...
02/21/2019

Exploration, inference and prediction in neuroscience and biomedicine

The last decades saw dramatic progress in brain research. These advances...
05/28/2018

A Neurobiological Cross-domain Evaluation Metric for Predictive Coding Networks

Achieving a good measure of model generalization remains a challenge wit...

1 Introduction

Predictive coding theory is an influential theory in computational and cognitive neuroscience, which proposes a potential unifying theory of cortical function (friston2003learning; friston2005theory; rao1999predictive; friston2010free; clark2013whatever; seth2014cybernetic) – namely that the core function of the brain is simply to minimize prediction error, where the prediction errors signal mismatches between predicted input and the input actually received 111For a contrary view and philosophical critique see cao2020new.. This minimization can be achieved in multiple ways: through immediate inference about the hidden states of the world, which can explain perception (beal2003variational), through updating a global world-model to make better predictions, which could explain learning (friston2003learning; neal1998view), and finally through action to sample sensory data from the world that conforms to the predictions (friston2009reinforcement), which potentially provides an account adaptive behaviour and control (friston2015active)

. Prediction error minimization can also be influenced by modulating the precision of sensory signals, which corresponds to modulating the ‘signal to noise ratio’ in how prediction errors can be used to update prediction, and which may shed light on the neural implementation of attention mechanisms

(feldman2010attention; kanai2015cerebral). Predictive coding boasts an extremely developed and principled mathematical framework in terms of a variational inference algorithm (blei2017variational; ghahramani2000graphical; jordan1998introduction), alongside many empirically tested computational models with close links to machine learning (beal2003variational; dayan1995helmholtz; hinton1994autoencoders), which address how predictive coding can be used to solve challenging perceptual inference and learning tasks similar to the brain. Moreover, predictive coding also has been translated into neurobiologically plausible microcircuit process theories (bastos2012canonical; shipp2016neural; shipp2013reflections) which are increasingly supported by neurobiological evidence. Predictive coding as a theory also offers a single mechanism that accounts for diverse perceptual and neurobiological phenomena such as end-stopping (rao1999predictive), bistable perception (hohwy2008predictive; weilnhammer2017predictive), repetition suppression (auksztulewicz2016repetition), illusory motions (lotter2016deep; watanabe2018illusory), and attentional modulation of neural activity (feldman2010attention; kanai2015cerebral). As such, and perhaps uniquely among neuroscientific theories, predictive coding encompasses all three layers of Marr’s hierarchy by providing a well-characterised and empirically supported view of ‘what the brain is doing’ at all of the computational, algorithmic, and implementational levels (marr1982vision).

The core intuition behind predictive coding is that the brain is composed of a hierarchy of layers, which each make predictions about the activity of the layer immediately below them in the hierarchy (clark2015surfing) 222For much of this work we consider a simple hierarchy with only a single layer above and below. Of course, connectivity in the brain is more heterarchical with many ‘skip connections’. Predictive coding can straightforwardly handle these more complex architectures in theory, although few works have investigated the performance characteristics of such heterarchical architectures in practice.. These downward descending predictions at each level are compared with the activity and inputs of each layer to form prediction errors – which is the information in each layer which could not be successfully predicted. These prediction errors are then fed upwards to serve as inputs to higher levels, which can then be utilized to reduce their own prediction error. The idea is that, over time, the hierarchy of layers instantiates a range of predictions at multiple scales, from the fine details in local variations of sensory data at low levels, to global invariant properties of the causes of sensory data (e.g., objects, scenes) at higher or deeper levels333This pattern is widely seen in the brain (hubel1962receptive; grill2004human)

and also in deep (convolutional) neural networks

(olah2017feature), but it is unclear whether this pattern also holds for deep predictive coding networks, primarily due to the relatively few instances of deep convolutional predictive coding networks in the literature so far.

. Predictive coding theory claims that goal of the brain as a whole, in some sense, is to minimize these prediction errors, and in the process of doing so performs both perceptual inference and learning. Both of these processes can be operationalized via the minimization of prediction error, first through the optimization of neuronal firing rates on a fast timescale, and then the optimization of synaptic weights on a slow timescale

(friston2008hierarchical)

. Predictive coding proposes that using a simple unsupervised loss function, such as simply attempting to predict incoming sensory data, is sufficient to develop complex, general, and hierarchically rich representations of the world in the brain, an argument which has found recent support in the impressive successes of modern machine learning models trained on unsupervised predictive or autoregressive objectives

(radford2019language; kaplan2020scaling; brown2020language). Moreover, in contrast to modern machine learning algorithms which are trained to end with a global loss at the output, in predictive coding prediction errors are computed at every layer which means that each layer only has to focus on minimizing local errors rather than a global loss. This property potentially enables predictive coding to learn in a biologically plausible way using only local and Hebbian learning rules (whittington2017approximation; millidge2020predictive; friston2003learning).

While predictive coding as a neuroscientific theory originated in the 1980s and 1990s (srinivasan1982predictive; mumford1992computational; rao1999predictive), and was first developed into its modern mathematical form of a comprehensive theory of cortical responses in the mid 2000s (friston2003learning; friston2005theory; friston2008hierarchical), it has deep intellectual antecedents. These precursors include Helmholtz’s notion of perception as unconscious inference and Kant’s notion that a priori structure is needed to make sense of sensory data (hohwy2008predictive; seth2020preface), as well as early ideas of compression and feedback control in cybernetics and information theory (wiener2019cybernetics; shannon1948mathematical; conant1970every). One of the core notions in predictive coding is the idea that the brain encodes a model of the world (or more precisely, of the causes of sensory signals), which is used to make constant predictions about the world, which are then compared against sensory data. On this view, perception is not the result of an unbiased feedforward, or bottom-up, processing of sensory data, but is instead a process of using sensory data to update predictions generated internally by the brain. Perception, thus, becomes a ‘controlled hallucination’ (clark2013whatever; seth2020preface) in which top-down perceptual predictions are reined in by sensory prediction error signals. This view of ‘perception as unconscious inference’ originated with the German physicist and physiologist Hermann von Helmholtz (helmholtz1866concerning), who studied the way the brain “cancels out” visual distortions and flow resulting from its own (predictable movement), such as during voluntary eye movements, but does not do so for external perturbations, such as when external pressure is applied to the eyeball, in which case we consciously experience visual movement arising from this (unpredicted) ocular motion. Helmholtz thus argued that the brain must maintain both a record of its own actions, in the form of a ‘corollary discharge’ as well as a model of the world sufficient to predict the visual effects of these actions (i.e. a forward model) in order to so perfectly cancel self-caused visual motion (huang2011predictive).

Another deep intellectual influence in predictive coding comes from information theory (shannon1948mathematical), and especially the minimum redundancy principle of Barlow (barlow1961coding; barlow1961possible; barlow1989unsupervised). Information theory tells us that information is inseparable from a lack of predictability. If something is predictable before observing it, it cannot give us much information. Conversely, to maximize the rate of information transfer, the message must be minimally predictable and hence minimally redundant. Predictive coding as a means to remove redundancy in a signal was first applied in signal processing, where it was used to reduce transmission bandwidth for video transmission. For a review see spratling2017review. Initial schemes used a simple approach of subtracting the new (to-be-transmitted) frame from the old frame (in effect using a trivial prediction that the new frame is always the same as the old frame), which works well in reducing bandwidth in many settings where there are only a few objects moving in the video against a static background. More advanced methods often predict each new frame using a number of past frames weighted by some coefficient, an approach known as linear predictive coding. Then, as long as the coefficients are transmitted at the beginning of the message, the receiving system can reconstruct signals compressed by this system. Barlow applied this principle to signalling in neural circuits, arguing that the brain faces considerably evolutionary pressures for information-theoretic efficiency, since neurons are energetically costly, and thus redundant firing would be potentially wasteful and damaging to an organism’s evolutionary fitness. Because of this, we should expect the brain to utilize a highly optimized code which is minimally redundant. Predictive coding, as we shall see, precisely minimizes this redundancy, by only transmitting the errors or residuals of sensory input which cannot be explained by top-down predictions, thus removing the most redundancy possible at each layer (huang2011predictive). Finally, predictive coding also inherits intellectually from ideas in cybernetics, control and filtering theory (kalman1960new; wiener2019cybernetics; conant1970every; seth2014cybernetic)

. Cybernetics as a science is focused on understanding the dynamics of interacting feedback loops for perception and control, based especially around the concept of error minimization. Control and filtering theory have, in a related but distinct way, been based around methods to minimize residual errors in both perception or action according to some objective for decades. As we shall see, standard methods such as Kalman Filtering

(kalman1960new) or PID control johnson2005pid can be shown as special cases of predictive coding under certain restrictive assumptions.

The first concrete discussion of predictive coding in the neural system arose as a model of neural properties of the retina (srinivasan1982predictive), specifically as a model of centre-surround cells which fire when presented with either a light-spot against a dark background (on-centre, off-surround), or alternatively a dark spot against a light background (off-centre, on surround) cells. It was argued that this coding scheme helps to minimize redundancy in the visual scene specifically by removing the spatial redundancy in natural visual scenes – that the intensity of one ‘pixel’ helps predict quite well the intensities of neighbouring pixels. If, however, the intensity of a pixel was predicted by the intensity of the surround, and this prediction is subtracted from the actual intensity, then the centre-surround firing pattern emerges (huang2011predictive). Mathematically, this idea of retinal cells removing the spatial redundancy of the visual input is derived from the fact that the optimal spatial linear filter which minimizes the redundancy in the representation of the visual information closely resembles the centre-surround receptive fields which are well established in retinal ganglion cells (huang2011predictive). This predictive coding approach was also applied to coding in the lateral geniculate nucleus (LGN), the thalamic structure that retinal signals pass through en-route to cortex, which was hypothesised to help remove temporal correlations in the input by subtracting out the retinal signal at previous timesteps using recurrent lateral inhibitory connectivity (huang2011predictive; marino2020predictive)

mumford1992computational was perhaps the first to extend this theory of the retina and the LGN to a fully-fledged general theory of cortical function. His theory was motivated by simple observations about the neurophysiology of cortico-cortical connections. Specifically, the existence of separate feedforward and feedback paths, where the feedforward paths originated in the superficial layers of the cortex, and the feedback pathways originated primarily in the deep layers. He also noted the reciprocal connectivity observed almost uniformly between cortical regions – if a region projects feedforward to another region, it almost always also receives feedback inputs from that region. He proposed that the deep feedback projections convey abstract ‘templates’ which each cortical region then matches to its incoming sensory data. Then, inspired by the minimum redundancy principle of Barlow (barlow1961possible), he proposed that instead of faithfully transmitting the sensory input upwards, each layer transmits only the ‘residual’ resulting after attempting to find the best fit match to the ‘template’.

While Mumford’s theory contained most aspects of classical predictive coding theory in the cortex, it was not accompanied by any simulations or empirical work and so its potential as a framework for understanding the cortex was not fully appreciated. The seminal work of Rao and Ballard in 1999 (rao1999predictive) had its impact precisely by doing this. They created a small predictive coding network according to the principles identified by Mumford, and empirically investigated its behaviour, demonstrating that the complex and dynamic interplay of predictions and prediction errors could explain several otherwise perplexing neurophysiological phenomena, specifically ‘extra-classical’ receptive field effects such as endstopping neurons. Extra-classical refers to the classical view in visual neuroscience of the visual system being composed of a hierarchy of feature-detectors, which originated in the pioneering work of (hubel1962receptive). According to this classical view, the visual cortex forms a hierarchy which ultimately bottoms out at the retina. At each layer, there are neurons sensitive to different features in the visual input, with neurons at the bottom of the hierarchy responding to simple features such as patches of light and dark, while neurons at the top respond to complex features such as faces. The feature detectors at higher levels of the hierarchy are computed by combining several lower-level simpler feature detectors. For instance, as a crude illustration, a face detector might be created by combining several spot detectors (eyes) with some bar detectors (mouths and noses). However, it was quickly noticed that some receptive fields displayed properties which could not be explained simply as compositions of lower-level feature detectors. Most significantly, many receptive field properties, especially in the cortex, showed context sensitivities, with their activity depending on the context outside of their receptive field. For instance, the ‘end-stopping’ neurons fired if a bar was presented which ended just outside the receptive field of the cell, but not if it continued for a long distance beyond it. Within the classical feedforward view, such a feature detector should be impossible, since it would have no access to information outside of its receptive field. Rao and Ballard showed that a predictive coding network, constructed with both bottom up prediction error neurons and neurons providing top-down predictions

, enables the replication of several extra-classical receptive field properties, such as endstopping, within the network. This capability is made possible by the top-down predictions conveyed by the hierarchical predictive coding network. In effect, the predictive coding network conveys a downward prediction of the continuation of the bar, in line with ideas in gestalt perception. When this prediction is violated a prediction error is generated and the neuron fires, thus reproducing the extra-classical prediction error effect. Moreover, in the Rao and Ballard model prediction error, value estimation, and weight updates follow from gradient descents on a single energy function. This model was later extended by Karl Friston in a series of papers

(friston2003learning; friston2005theory; friston2008hierarchical), which placed the model on a firm theoretical grounding as a variational inference algorithm, as well as integrating predictive coding with the broader free energy principle (friston2006free; friston2010free)

by identifying the energy function of Rao and Ballard with the variational free energy of variational inference. This identification enables us to understand the Rao and Ballard learning rules as performing well-specified approximate Bayesian inference.

Following the impetus of these landmark developments, as well as much subsequent work, predictive coding has become increasingly influential over the last two decades in cognitive and theoretical neuroscience, especially for its ability to offer a supposedly unifying, albeit abstract, perspective on the daunting multi-level complexity of the cortex. In this review, we aim to provide a coherent overview and introduction to the mathematical framework of predictive coding, as defined using the probabilistic modelling framework of friston2005theory, as well as a comprehensive review of the many directions predictive coding theory has evolved in since then. For readers wishing to gain a deeper appreciation and understanding of the underlying mathematics, we also advise them to read these two didactic tutorials on the framework (buckley2017free; bogacz2017tutorial). We also advise reading spratling2017review for a quick review of major predictive coding algorithms and marino2020predictive

for another overview of predictive coding and close investigation of its relationship to variational autoencoders

(kingma2013auto) and normalizing flows (rezende2015variational). In this review, we survey the development and performance of computational models designed to probe the performance of predictive coding on a wide variety of tasks, including those that try to combine predictive coding with ideas from machine learning to allow it to scale up to complex tasks. We also review the work that has been done on translating the relatively abstract mathematical formalism of predictive coding into hypothesized biologically plausible neural microcircuits that could, in principle, be implemented by the brain, as well as the empirical neuroscientific work explicitly seeking experimental confirmation or refutation of the many predictions made by the theory. We also look deeply at more theoretical matters, such as the extensions of predictive coding using dynamical models which utilizes generative models over multiple dynamical orders of motion, the relationship of learning in predictive coding to the backpropagation of error algorithm widely used in machine learning, and the development of the theory of precision which enables predictive coding to encode not just direct predictions of sensory stimuli but also predictions as to their intrinsic uncertainty. Finally, we review extensions of the predictive coding framework that generalize beyond from perception to also include action, drawing on the close relationship between predictive coding and classical methods in filtering and control theory.

2 Predictive Coding

2.1 Predictive Coding as Variational Inference

A crucial advance in predictive coding theory occurred when it was recognized that the predictive coding algorithm could be cast as an approximate Bayesian inference processed based upon Gaussian generative models (friston2003learning; friston2005theory; friston2008hierarchical)

.. This perspective illuminates the close connection between predictive coding as motivated through the information-theoretic minimum-redundancy approach, and the Helmholtzian idea of perception as unconscious inference. Indeed, the two are fundamentally inseparable owing to the close mathematical connections between information theory and probability theory. Intuitively, information can only be defined according to some ‘expected’ distribution, just as predictability or redundancy can only be defined against some kind of prediction. Prediction, moreover, presupposes some kind of model to do the predicting. The explicit characterisation of this model in probabilistic terms as a

generative model

completes the link to probability theory and, ultimately Bayesian inference. Friston’s approach, crucially, reformulates the mostly heuristic Rao and Ballard model in the language of variational Bayesian inference, thus allowing for a detailed theoretical understanding of the algorithm, as well as tying it the broader project of the Bayesian Brain

(deneve2005bayesian; knill2004bayesian). Crucially, Friston showed that the energy function in Rao and Ballard can be understood as a variational free-energy of the kind that is minimized through variational inference. This connection demonstrates that predictive coding can be directly interpreted as performing approximate Bayesian inference to infer the causes of sensory signals, thus providing a mathematically precise characterisation of the Helmholtzian idea of perception as inference.

Variational inference describes a broad family of methods which have been under extensive development in machine learning and statistics since the 1990s (ghahramani2000graphical; jordan1998introduction; beal2003variational; blei2017variational). They originally evolved out of methods for approximately solving intractable optimization problems in statistical physics (feynman1998statistical). In general, variational inference approximates an intractable inference problem with a tractable optimization problem. Intuitively, we postulate, and optimize the statistics of an approximate ‘variational’ density, which we then try to match to the desired inference distribution 444

This contrasts with the other principle method for approximating intractable inference procedures – Markov Chain Monte-Carlo (MCMC)

(brooks2011handbook; metropolis1953equation; hastings1970monte). This class of methods sample stochastically from a Markov Chain with a stationary distribution equal to the true posterior. MCMC methods asymptotically converge to the true posterior, while variational methods typically do not (unless the class of variational distributions includes the true posterior). However, variational methods typically converge faster and are computationally cheaper, leading to a much wider use in contemporary machine learning and statistics..

To formalize this, let us assume we have some observations (or data) , and we wish to infer the latent state . We also assume we have a generative model of the data generating process . By Bayes rule, we can compute the true posterior directly as , however, the normalizing factor is often intractable because it requires an integration over all latent variable states. The marginal is often referred to as the evidence, since it effectively scores the likelihood of the data under a given model, averaged over all possible values of the model parameters. Computing the marginal (model evidence) is intrinsically valuable since it is a core quantity in Bayesian model comparison methods, where it is used to compare the ability of two different generative models to fit the data.

Since directly computing the true posterior through Bayes rule is generally intractable, variational inference aims to approximate this posterior using an auxiliary posterior , with parameters . This variational distribution is arbitrary and under the control of the modeller. For instance, suppose we define

to be a Gaussian distribution. Then the parameters

become the mean

and the variance

of the Gaussian. The goal then is to fit this approximate posterior to the true posterior by minimizing the divergence between the true and approximate posterior with respect to the parameters. Mathematically, this problem can be written as,

(1)

Where is a function that measures the divergence between two distributions and . Throughout, we take to be the KL divergence , although other divergences are possible (cichocki2010families; banerjee2005clustering) 555Interestingly the KL divergence is asymmetric ( and is thus not a valid distance metric. Throughout we use the reverse-KL divergence , as is standard in variational inference. Variational inference with the forward-KL has close relationships to expectation propagation (minka2001family).. When then and the variational distribution exactly equals the true posterior, and thus we have solved the inference problem666An exact solution is only possible when the family of variational distributions considered includes the true posterior as a member – for example. if both the true posterior and the variational posterior are Gaussian..By doing this, we have replaced the inference problem of computing the posterior with an optimization problem of minimizing this divergence. However, merely writing the problem this way does not solve it because the divergence we need to optimize still contains the intractable true posterior. The beauty of variational inference is that it instead optimizes a tractable upper bound on this divergence, called the variational free energy777In machine learning, this is instead called the negative evidence lower bound (ELBO) which is simply the negative free-energy, and is maximized instead.. To generate this bound, we simply apply Bayes rule to the true posterior to rewrite it in the form of a generative model and the evidence.

(2)

Where in the third line the expectation around vanishes since the expectation is over the variable which is not in . The variational free energy is an upper bound because is necessarily since, as a probability, . Importantly, is a tractable quantity, since it is a divergence between two quantities we assume we (as the modeller) know – the variational approximate posterior and the generative model . Since is an upper bound, by minimizing , we drive closer to the true posterior. As an additional bonus, if then or the marginal, or model, evidence, which means that in such cases can be used for model selection (wasserman2000bayesian). We can also gain an important intuition about by showing that it can be decomposed into a likelihood maximization term and a KL divergence term which penalizes deviation from the Bayesian prior. These two terms are often called the ‘accuracy’ and the ‘complexity’ terms. This decomposition of is often utilized and optimized explicitly in many machine learning algorithms (kingma2013auto).

In many practical cases, we must relax the assumption that we know the generative model . Luckily this is not fatal. Instead, it is possible to learn

the generative model alongside the variational posterior on the fly and in parallel using the Expectation Maximization (EM) algorithm

(dempster1977maximum). The EM algorithm is extremely intuitive. First, assume that we parametrize our unknown generative model with some parameters which are initialized at some arbitrary . Similarly, we initialize our variational posterior at some arbitrary . Then, we take turns optimizing with respect to the variational posterior parameters with the generative model parameters held fixed and then, conversely, optimize with respect to the generative model parameters with the variational posterior parameters held fixed. Mathematically, we can write this alternating optimization as

(3)

Where we use the notation to mean that the variable is fixed at value throughout the optimization. It has been shown that this iterative sequence of optimization problems often converges to good solutions and often does so robustly and efficiently in practice (dempster1977maximum; dellaert2002expectation; boyles1983convergence; gupta2011theory).

Having reviewed the general principles of variational inference, we can see how they relate to predictive coding. First, to make any variational inference algorithm concrete, we must specify the forms of the variational posterior and the generative model. To obtain predictive coding, we assume a Gaussian form for the generative model where we first partition the generative model into likelihood and prior terms. The mean of the likelihood Gaussian distribution is assumed to be some function of the hidden states , which can be parameterized with parameters , while the mean of the prior Gaussian distribution is set to some arbitrary function of the prior mean . We also assume that the variational posterior is a dirac-delta (or point mass) distribution with a center 888In previous works, predictive coding has typically been derived by assuming a Gaussian variational posterior under the Laplace approximation. This approximation effectively allows you to ignore the variance of the Gaussian and concentrate only on the mean. This procedure is essentially identical to the dirac-delta definition made here, and results in the same update scheme. However, the derivation using the Laplace approximation is much more involved so, for simplicity, here we use the Dirac delta definition. The original Laplace derivation can be found in Appendix A of this review – see also buckley2017free for a detailed walkthrough..

Given these definitions of the variational posterior and the generative model, we can write down the concrete form of the variational free energy to be optimized. We first decompose the variational free energy into an ‘Energy’ and an ‘Entropy’ term

(4)

where, since the entropy of the dirac-delta distribution is 0 (it is a point mass distribution), we can ignore the entropy term and focus solely on writing out the energy.

(5)

where we define the ‘prediction errors’ and . We thus see that the energy term, and thus the variational free energy, is simply the sum of two squared prediction error terms, weighted by their inverse variances, plus some additional log variance terms.

Finally, to derive the predictive coding update rules, we must make one additional assumption – that the variational free energy is optimized using the method of gradient descent such that,

(6)

Given this, we can derive dynamics for all variables of interest ( by taking derivatives of the variational free energy . The update rules are as follows,

(7)
(8)
(9)

Importantly, these update rules are very similar to the ones derived in rao1999predictive, and therefore can be interpreted as recapitulating the core predictive coding update rules. Furthermore while it is possible to run the dynamics for the and the simultaneously, it is often better to treat predictive coding as an EM algorithm and alternate the updates. Empirically, it is typically best to run the optimization of the s, with fixed until close to convergence, and then run the dynamics on the with fixed for a short while. This implicitly enforces a separation of timescales upon the model where the are seen as dynamical variables which change quickly while the are slowly-changing parameters. For instance, the s are typically interpreted as rapidly changing neural firing rates, while the s are the slowly changing synaptic weight values (rao1999predictive; friston2005theory).

Finally, we can think about how this derivation of predictive coding maps onto putative psychological processes of perception and learning. The updates of the can be interpreted as a process of perception, since the is meant to correspond to the estimate of the latent state of the environment generating the observations. By contrast, the dynamics of the can be thought of as corresponding to learning, since these effectively define the mapping between the latent state and the observations . Importantly, as will be discussed in depth later, these predictive coding update equations can be relatively straightforwardly mapped onto a potential network architecture which only utilizes local computation and plasticity – thus potentially making it a good fit for implementation in the cortex.

2.2 Multi-layer Predictive Coding

The previous examples have only focused on predictive coding with a single level of latent variables

. However, the expressiveness of such a scheme is limited. The success of deep neural networks in machine learning have demonstrated that having hierarchical sets of latent variables is key to enabling methods to learn powerful abstractions and to handle intrinsically hierarchical dynamics of the sort humans intuitively perceive. The predictive coding schemes previously introduced can be straightforwardly extended to handle hierarchical dynamics of arbitrary depth, equivalently to deep neural networks in machine learning. This is done through postulating multiple layers of latent variables

and then defining the generative model as follows,

(10)

where and the final layer has an arbitrary prior and the latent variable at the bottom of the hierarchy is set to the observation actually received . Similarly, we define a separate variational posterior for each layer , then the variational free energy can be written as a sum of the prediction errors at each layer,

(11)

where = . Given that the free energy divides nicely into the sum of layer-wise prediction errors, it comes as no surprise that the dynamics of the and the are similarly separable across layers.

(12)
(13)

We see that the dynamics for the variational means depend only on the prediction errors at their layer and the prediction errors on the level below. Intuitively, we can think of the s as trying to find a compromise between causing error by deviating from the prediction from the layer above, and adjusting their own prediction to resolve error at the layer below. In a neurally-implemented hierarchical predictive coding network, prediction errors would be the only information transmitted ‘upwards’ from sensory data towards latent representations, while predictions would be transmitted downwards. Crucially for conceptual readings of predictive coding, this means that sensory data is not directly transmitted up through the hierarchy, as is assumed in much of perceptual neuroscience. The dynamics for the s are also fairly biologically plausible as they are effectively just the sum of the precision-weighted prediction errors from the s own layer and the layer below, the prediction errors from below being transmitted back upwards through the synaptic weights

and weighted with the gradient of the activation function

. This means that there is no direct feedforward pass, as is often assumed in models of vision, in predictive coding. It is possible, however, to augment predictive coding models with a feedforward pass, as is discussed in the section on hybrid inference.

Importantly, the dynamics for the synaptic weights are entirely local, needing only the prediction error from the layer below and the current at the given layer. The dynamics thus becomes a Hebbian rule between the presynaptic and postsynaptic , weighted by the gradient of the activation function.

Figure 1: Architecture of a multi-layer predictive coding network (here shown with two value and error neurons in each layer. The value neurons project to both the error neurons of the layer below (representing the top down connections) and the error neurons at the current layer to represent the current activity. The error neurons receive inhibitory top down inputs from the value neurons of the layer above and excitatory inputs from the value neurons at the same layer. Conversely, the value neurons receive excitatory projections from the error neurons of the layer below and inhibitory from the error neurons at the current layer. Crucially, for this model with its explicit error neurons, all synaptic plasticity rules are purely Hebbian.

2.3 Dynamical Predictive Coding and Generalized Coordinates

So far, we have considered the modelling of just a single static stimulus . However, most interesting data the brain receives comes in temporal sequences . To model such temporal sequences, it is often useful to split the latent variables into states, which can vary with time, and parameters which cannot. In the case of sequences, instead of minimizing the variational free energy, we must instead minimize the free action , which is simply the path integral of the variational free energy through time 999This quantity is called the free action due to the analogy between it and the action central to the variational principles central to classical mechanics. (friston2008DEM; friston2008hierarchical):

(14)

While there are numerous methods and parameterisations to handle sequence data, one influential and elegant approach, which has been developed by Friston in a number of key papers (friston2008DEM; friston2008hierarchical; friston2010generalised) is to represent temporal data in terms of generalized coordinates of motion. In effect, these represent not just the immediate observation state, but all time derivatives of the observation state. For instance, suppose that the brain represents beliefs about the position of an object. Under a generalized coordinate model, it would also represent beliefs about the velocity (first time derivative), acceleration (second time derivative), jerk (third time derivative) and so on. All these time derivative beliefs are concatenated to form a generalized state. The key insight into this dynamical formulation is, that when written in such a way, many of the mathematical difficulties in handling sequences disappear, leaving relatively straightforward and simple variational filtering algorithms which can natively handle smoothly changing sequences.

Because generalised coordinates can become notationally awkward, we will be very explicit in the following. We denote the time derivatives of the generalized coordinate using a , so is the belief about the velocity of the , just as is the belief about the ‘position’ about the . A key point of confusion is that there is also a ‘real’ velocity of , which we denote , which represents how the belief in actually changes over time – i.e. over the course of inference. Importantly, this is not necessarily the same as the belief in the velocity: , except at the equilibrium state. Intuitively, this makes sense as at equilibrium (minimum of the free action, and thus perfect inference), our belief about the velocity of mu and the ‘real’ velocity perfectly match. Away from equilibrium, our inference is not perfect so they do not necessarily match. We denote the generalized coordinate representation of a state

as simply a vector of each of the beliefs about the time derivatives

. We also define the operator which maps each element of the generalised coordinate to its time derivative i.e. . With this notation, we can define a dynamical generative model using generalized coordinates. Crucially, we assume that the noise

in the generative model is not white noise, but is colored, so it has non-zero autocorrelation and can be differentiated. Effectively, colored noise allows one to model relatively slowly (not infinitely fast) exogenous forces on the system. For more information on colored noise vs white noise see

stengel1986stochastic; jazwinski2007stochastic; friston2008DEM. With this assumption we can obtain a generative model in generalized coordinates of motion by simply differentiating the original model.

(15)

Where we have applied a local linearisation assumption (friston2008DEM) which drops the cross terms in the derivatives. We can write these generative models more compactly in generalized coordinates.

(16)

which, written probabilistically is . It has been shown (friston2008DEM) that the optimal (equilibrium) solution to this free action is the following stochastic differential equation,

(17)

Where is the generalized noise at all orders of motion. Intuitively, this is because when then , or that the ‘real’ change in the variable is precisely equal to the expected change. This equilibrium is a dynamical equilibrium which moves over time, but precisely in line with the beliefs . This allows the system to track a dynamically moving solution precisely, and the generalized coordinates let us capture this motion while retaining the static analytical approach of an equilibrium solution, which would otherwise necessarily preclude motion. There are multiple options to turn this result into a variational inference algorithm. Note, the above equation makes no assumptions about the form of variational density or the generative model, and thus allows multimodal or nonparametric distributions to be represented. For instance, the above equation Equation 17 could be integrated numerically by a number of particles in parallel, thus leading to a generalization of particle filtering (friston2008variational). Alternatively, a fixed Gaussian form for the variational density can be assumed, using the Laplace approximation. In this case, we obtain a very similar algorithm to predictive coding as before, but using generalized coordinates of motion. In the latter case, we can write out the free energy as,

(18)

Where and . Moreover, the generalized precisions not only encode the covariance between individual elements of the data or latent space at each order, but also the correlations between generalized orders themselves. Since we are using a unimodal (Gaussian) approximation, instead of integrating the stochastic differential equations of multiple particles, we instead only need to integrate the deterministic differential equation of the mode of the free energy,

(19)

which cashes out in a scheme very similar to standard predictive coding (compare to Equation 7), but in generalized coordinates of motion. The only difference is the term which links the orders of motion together. This term can be intuitively understood as providing the ‘prior motion’ while the prediction errors provide ‘the force’ terms. To make this clearer, let’s take a concrete physical analogy where is the position of some object and is the expected velocity. Moreover, the object is subject to forces which instantaneously affect its position. Now, the total change in position can be thought of as first taking the change in position due to the intrinsic velocity of the object and adding that on to the extrinsic changes due to the various exogenous forces.

2.4 Predictive Coding and Precision

One core aspect of the predictive coding framework, which is absent in the original Rao and Ballard formulation, but which arises directly from the variational formulation of predictive coding and the Gaussian generative model, is the notion of precision or inverse-variances, which we have throughout denoted as (sometimes is used in the literature as well). Precisions serve to multiplicatively modulate the importance of the prediction errors, and thus possess a significant influence in the overall dynamics of the model. They have been put to a wide range of theoretical purposes in the literature, all centered around their modulatory function. Early work (friston2005theory) ties the precision parameters to lateral inhibition and biased competition models, proposing that they serve to mediate competition between prediction error neurons, and are implemented through lateral synaptic weights. Later work (kanai2015cerebral; feldman2010attention) has argued instead that precisions can be interpreted as implementing top-down attentional modulation of predictions – which are thus sensitive to the global context variables such as task relevance which have been shown empirically to have a large affect on attentional salience. This work has shown that equipping predictive coding schemes with precision allows them to recapitulate key phenomena observed in standard attentional psychophysics tasks such as the Posner paradigm (feldman2010attention).

Further theoretical and philosophical work has further developed the interpretation of precision matrices into a general purpose modulatory function (clark2015surfing). This function could be implemented neurally in a number of ways. First, certain precision weights could be effectively hardcoded by evolution. One promising candidate for this would be the precisions of interoceptive signals transmitting vital physiological information such as hunger or pain. These interoceptive signals would be hardwired to have extremely high precision, to prevent the organism from simply learning to ignore or down-weight them in comparison to other objectives. Conceptually, assuming high precision for interoceptive signals can shed light on how such signals drive adaptive action through active inference (seth2016active; seth2013extending). It is also possible that certain psychiatric disorders such as autism (van2013predictive; lawson2014aberrant) and schizophrenia (sterzer2018predictive) could be interpreted as disorders of precision (either intrinsically hardcoded, or else resulting from an aberrant or biased learning of precisions). If on track, these theories would provide us with a helpful mechanistic understanding of the algorithmic deviations underlying these disorders, which could potentially enable improved differential diagnosis, and could even guide clinical intervention. This approach, under the name of computational psychiatry, is an active area of research and perhaps one of the most promising avenues for translating highly theoretical models such as predictive coding into concrete medical advances and treatments (huys2016computational).

Mathematically, a dynamical update rule for the precisions can be derived as a gradient descent on the free-energy. This update rule becomes an additional M-step in the EM algorithm, since the precisions are technically parameters of the generative model. Recall that we can write the free-energy as,

(20)

We can then derive the dynamics of the precision matrix as a gradient descent on the free-energy with respect to the variance,

(21)

Where we have used the fact that the covariance (and precision) matrices are necessarily symmetric (. Secondly, we have defined the precision-weighted predictions errors as . From these dynamics, we can see that the average fixed-point of the precision matrix is simply the variance of the precision-weighted prediction errors at each layer

(22)

In effect, the fixed point of the precision dynamics will lead to these matrices simply representing the average variance of the prediction errors at each level. At the lowest level of the hierarchy, the variance of the prediction errors will be strongly related to the intrinsic variance of the data, and thus algorithmically, learnable precision matrices allow the representation and inference on data with state-dependent additive noise. This kind of state-dependent noise is omnipresent in the natural world, in part due to its own natural statistics, and in part due to intrinsically noisy properties of biological perceptual systems (stein2005neuronal). Similarly, the human visual system can function over many orders of magnitude of objective brightness which dramatically alters the variance of the visual input, while in auditory perception the variance of specific sound inputs is crucially dependent on ambient audio conditions. In all cases, being able to represent the variance of the incoming sensory data is likely crucial to being able to successfully model and perform accurate inference on such sensory streams.

Precision also has deep relevance to machine learning. As noted in the backpropagation section later in the paper, predictive coding with fixed predictions and identity precisions forms a scheme which can converge to the exact gradients computed by the backpropagation of error algorithm. Importantly, however, when precisions are included in the scheme, predictive coding forms a superset of backpropagation which allows it to weight gradients by their intrinsic variance. This more subtle and nuanced approach may prove more adaptable and robust than standard backpropagation, which implicitly assumes that all data-points in a dataset are of equal value - an assumption which is likely not met for many datasets with heteroscedastic noise. Exploring the use of precision to explicitly model the intrinsic variance of data is an exciting area for future work, especially applied to large-scale modern machine learning systems. Indeed, it can be shown (see Appendix B) that in the linear case using learnt precisions is equivalent to a natural gradients algorithm

amari1995information. Natural gradients modulate the gradient vector with the Hessian or curvature of the loss function (which is also the Fisher information) and therefore effectively derive an optimal adaptive learning rate for the descent, which has been found to improve optimization performance, albeit at a sometimes substantial computational cost of explicitly computing and materializing the Fisher information matrix.

There remains an intrinsic tension, however, between these two perspectives on precision in the literature. The first interprets precision as a bottom-up ‘objective’ measure of the intrinsic variance in the sensory data and then, deeper in the hierarchy, the intrinsic variance of activities at later processing stages. This contrasts strongly with views of precision as serving a general purpose adaptive modulatory function as in attention. While attention is indeed deeply affected by bottom up factors, which are generally termed attentional salience (parkhurst2002modeling), these factors are typically modelled as Bayesian surprise (itti2009bayesian). ‘Bayesian surprise’ is often modelled mathematically as the information gain upon observing a stimulus which is not necessarily the same as high or low variance. For instance, both high variance (such as strobe-lights or white noise) and low variance (such as constant bright incongruous blocks of colour) may both be extremely attentionally salient in visual input while having opposite effects on precision. Precisions, if updated using Equation 2.4 explicitly represent the objective variance of the stimulus or data, and therefore cannot easily account for the well-documented top-down or contextually guided attention (torralba2003statistics; kanan2009sun; henderson2017gaze)

. This means that it seems likely that theories of top-down modulatory precision cannot simply rely on a direct derivation of precision updating in terms of a gradient descent on the free energy, but must instead postulate additional mechanisms which implement this top down precision modulation explicitly. One possible way to do this is to assume a system of direct inference over precisions with modulatory ‘precision expectations’ which form hyperpriors over the precisions, which can then be updated in a Bayesian fashion by using the objective variance of the data as the likelihood. However, much remains to be worked out as to the precise mathematics of this scheme.

Finally, there remains an issue of timescale. Precisions are often conceptualised as being optimized over a slow timescale, comparable with the optimisation of synaptic weights – i.e., in the M-step of the EM algorithm – which fits their mathematical expression as a parameter of the generative model. However, attentional modulation can be very rapid, and likely cannot be encoded through synaptic weights. These concerns make any direct identification of precision with attention difficult, while the idea of precision as instead encoding some base level of variance-normalization or variance weighting finds more support from the mathematics. However even here problems remain due to timescales. The objective variance of different regions of the sensory stream can also vary rapidly, and it is not clear that this variation can be encoded into synaptic weights either, although it is definitely possible to maintain a moving average of the variance through the lateral synaptic weights. Overall, the precise neurophysiological and mathematical meaning and function of precision remains quite uncertain, and is thus an exciting area of future development for the predictive coding framework. Finally, there has been relatively little empirical work on studying the effects of learnable precision in large-scale predictive coding networks.

2.5 Predictive Coding in the Brain?

While technically predictive coding is simply a variational inference and filtering algorithm under Gaussian assumptions, from the beginning it has been claimed to be a biologically plausible theory of cortical computation, and the literature has consistently drawn close connections between the theory and potential computations that may be performed in brains. For instance, the Rao and Ballard model explicitly claims to model the early visual cortex, while friston2005theory explicitly proposed predictive coding as a general theory of cortical computation. In this section, we review work which has began translating the mathematical formalism into neurophysiological detail, and focus especially on the seminal cortical microcircuit model by bastos2012canonical. We also briefly review empirical work that has attempted to verify or falsify key tenets of predictive coding in the cortex, and discuss the methodological or algorithmic difficulties with this approach.

The hierarchical generative models generally treated in predictive coding are composed of multiple layers in a stacked structure. Each layer consists of a single vector of value, or activity, neurons and a single vector of error neurons. However, the cortex is not organised into such a simple structure. Instead each cortical ‘area’ such as V1, V2, or V4 is comprised of 6 internal layers: L1-L6. These layers are reciprocally connected with each other in a complex way which has not yet been fully elucidated, and may subtly vary between cortical regions and across species (felleman1991distributed). Nevertheless, there is convergence around a relatively simple scheme where the six cortical layers can be decomposed into an ‘input layer’ L4, which primarily receives driving excitatory inputs from the area below as well as from the thalamus, and then two relatively distinct processing streams – a feedforward superficial, or supragranular stream consisting of layers L1/2/3 and a feedback deep, or infragranular stream consisting of layers 5 and 6 (layer 4 is typically called the ‘granular’ layer). These streams have been shown to have different preferred oscillatory frequencies, with the superficial layers possessing the strongest theta and gamma power ((bastos2015visual), and the deep layers possessing strongest alpha and beta power which are negatively correlated across layers. The superficial layers then send excitatory connectivity forward to L4 of the area above, while the deep layers possess feedback connectivity, which can be both inhibitory or excitatory, back to both deep and superficial layers of the areas below. Within each cortical area, there is a well-established 3 step feedback relay, from the input L4, to the superficial layers L2/3, which then project their input forwards to the next area in the hierarchy. From L2/L3, the superficial layers then project to the deep Layer 5, which could then project to L6, or else provide feedback to regions lower in the hierarchy (rockland2019we). Interestingly, deep L5 and L6 are the only cortical layers which contains neurons which project to subcortical regions or the brainstem, and L6 especially appears to maintain precise reciprocal connectivity with the thalamus (thomson2010neocortical). While this feedforward input, superficial, deep ‘relay’ is well studied, there are also other pathways, including from deep to L4 (amorim2010whose), and superficial feedback connections which are not well explored. Moreover, alongside the cortico-cortico connectivity studied here, there are also many cortico-subcortico, and especially cortico-thalamic connections which are less well-understood or integrated into specific process theories of predictive coding (markov2014anatomy).

While this intrinsic connectivity of the cortical region may seem dauntingly complex, much progress has been made within the last decade of fitting predictive coding models to this neurophysiology. Of special importance is the work of bastos2012canonical who provided the central microcircuit model of predictive coding 101010Perhaps the first worked out canonical microcircuit for predictive coding, although not using that name, was in the early work of mumford1992computational. He argued that the descending deep pathway transmits ‘templates’ backwards which are then fitted to the data present in the layers below before computing ‘residuals’ which are transmitted upwards on the superficial to L4 ascending pathway.. The fundamental operations of predictive coding require predictions to be sent down the hierarchy, while prediction errors are sent upwards. The dynamics of the value neurons s require both the prediction errors at the current layer (from the top down predictions of the layer above) to be combined with the prediction errors from the layer below mapped through the backwards weights and the derivative of the activation function . The Bastos microcircuit model associates a ‘layer’ of predictive coding, with a 6-level cortical ‘region’. The inputs to L4 of the region are taken to be the prediction errors of the region below, which are then immediately passed upwards to the superficial levels L2/L3 where the prediction error and the value neurons are taken to be located. The predictions are taken to reside in the deep layers L5/6. The superficial layers receive top-down prediction inputs from the deep layers of the region above it in the hierarchy which are combined to compute the prediction errors of the region. These are then combined with the bottom-up prediction errors coming from L4 to update the value neurons , also located in the superficial layers. The value neurons then transmit to the deep layers L5/6 where the predictions to be transmitted to the region below on the hierarchy are computed, while the superficial prediction error units transmit to the L4 input layer of the region above in the hierarchy. The full schematic of the Bastos model is presented in Figure 2.

Figure 2: The canonical microcircuit proposed by Bastos et al mapped onto the laminar connectivity of a cortical region (which comprises 6 layers). Here, for simplicity, we group layers L2 and L3 together into a broad ‘superficial’ layer and L5 and L6 together into a ‘deep’ layer. We ignore L1 entirely since there are few neurons there and they are not involved in the Bastos microcircuit. Bold lines are included in the canonoical microcircuit of Bastos et al. Dashed lines are connections which are known to exist in the cortex which are not explained by the model. Red text denotes the values which are computed in each part of the canonical microcircuit

This model fits the predictive coding architecture to the laminar structure within cortical regions. It explains several core features of the cortical microcircuit – that superficial cells (interpreted as encoding prediction errors), project forward to the L4 input layer of the region above. It also provides an interpretation of the function of the well-established ‘relay’ from L4 to superficial (transmitting inputted prediction errors to superficial), the computation of prediction errors and value neurons in the superficial layers, and then the superficial to deep connectivity as encoding the prediction of the value neurons. It also can explain the deep to superficial backwards pass as the transmission of predictions from one region to the next. Similarly, the L5 and L6 deep cortico-subcortical transmission can be straightforwardly interpreted as the transmission of predictions to subcortical areas, or to motor and brainstem regions to perform active inference. However, there are several aspects of the model and the neurophysiology which require further elucidation. One primary issue is the well-established deep-deep feedback pathway from the deep layers of one region to the deep layers of the region below it in the hierarchy (amorim2010whose; harris2018organization). In strength, this feedback pathway is often considered more important than the deep-superficial pathway that is thought to convey the predictions to the prediction error units in the superficial layers of the region below. This pathway is entirely unexplained within the Bastos model, yet it appears to be important for cortical function. It is possible that this may be thought of as a prediction pathway, so that predictions at one level can directly modulate predictions at the layer below without having to have their influence modulated by going through the prediction error units. This would provide the brain with a powerful downward or generative path, enabling it to compute low-level predictions effectively directly from high-level abstractions in a single downward sweep. However, such a path is not used in the mathematical predictive coding model, and it is unclear what probabilistic interpretation it could have. A secondary concern is the fact that there also exists a superficial-superficial feedback connection with unclear function within this model (markov2014anatomy). It has been suggested that this feedback connection may carry precision information (kanai2015cerebral), although it is unclear why this is necessary since the actual dynamics of the precision only require access to the precision-weighted prediction errors of the current layer (Equation 2.4).

A more general concern with this model is that the deep layers are in general relatively poorly utilised by the model. All the ‘action’ so to speak occurs in the superficial layers – which is where both the prediction errors and the value neurons are located, and where the top-down predictions interface with the prediction errors. The only task the deep layers provide in this model is to compute the predictions and then relay them to the layers below. It is possible, perhaps likely, that the actual function could be considerably more complex than the ANN approach of synaptic weights passed through an elementwise nonlinear activation function. However, extensions of predictive coding to more complex backwards functions have not yet been substantially explored in the literature, and would require more complex update rules for the parameters which may lead to less biologically plausible learning rules. It is also known that the deep layers contain the connections to the thalamus, striatum, and other subcortical regions which are likely important in action selection as well as large-scale coordination across the brain. However, such effects and connections are not included in standard predictive coding models which are primarily concerned with only cortico-cortical processing.

One interesting potential issue is that there are several null operations in the model. For instance, the prediction errors are computed in the superficial layers at the level below, and then transmitted first to the input layer L4 before being transmitted again to the superficial layers. This provides two sets of operations on the prediction errors while only one is necessary, thus necessitating that one of these steps is effectively a pure relay step without modification of the prediction errors – an interesting and testable neurophysiological prediction. A similar situation arises with the predictions, where although the predictions are considered to only be computed as a nonlinear function of the value units, mapped through synaptic connections, the predictions actually undergo two steps of computation. Firstly in the superficial-deep transmission within a region, and secondly the deep-superficial feedback transmission to the region below. According to the standard model, one of these steps must be a null step and not change the predictions, which could in theory be tested by current methods. Importantly, it is possible that the actual functions being computed by the predictions is more complex (although not the prediction errors), and thus takes multiple steps. However, to achieve learning in such a system would require more complicated update rules, which would likely exacerbate the issues of weight transport and derivative computation already inherent in the algorithm. An additional interesting consideration is the extension of predictive coding from a simple linear stack of hierarchical regions, to a heterarchical scheme, where multiple regions may all project to the same region, and similarly one region may send prediction errors to many others. Predictive coding makes a very strong hypothesis in this situation, which is that heterarchical connectivity must be symmetrical. If a region sends feedforward prediction errors to another region, it must receive feedback predictions from it and vice versa. This feature of connectivity in the brain has long been confirmed through neuroantomical studies mumford1992computational; felleman1991distributed.

While the prediction errors must be transmitted upwards by being modulated through the transpose of the forward weights , which would be implemented as the synaptic strengths in either the deep-superficial backwards paths, or the superficial-deep forward relay step, both of which are far from the superficial-input pathway, thus raising a considerable issue of weight transport. Additionally, these prediction errors must transmit with them the derivative of the activation function , which is theoretically available at the superficial layers of the level below where the prediction errors are transmitted, but would then need to be computed separately and transmitted back up with the prediction errors. The weight transport problem poses a greater difficulty, however, as we discuss below may be able to be solved with learnable or random backwards weights.

A further potential issue arises from the interplay of excitation and inhibition within the microcircuit (kogo2015predictive; heilbron2018great). Specifically, predictive coding requires the feedback connectivity containing predictions to be inhibitory, or else it requires the connection between the error and value neurons to be inhibitory, depending on the direction of the update. However, both cortico-cortico feedback projections as well as pyramidal-pyramidal interactions within laminar layers both tend to be excitatory. To address this, additional inhibitory interneuron circuitry may be required to negate one of these terms which remains to be explored. For instance, shipp2016neural suggests that the Martinotti cells could perform this function. An additional consideration is the fact that while the mathematical form of predictive coding allows for negative prediction errors, these cannot be implemented directly in the brain – a neuron cannot have a negative firing rate. Negative firing rates could be mimicked by using a high baseline firing rate and then interpreting firing below the baseline to be negative, although to have a baseline high enough to enable a roughly equal amount of positive and negative ‘space’ for encoding would be extremely energetically inefficient (keller2018predictive). Another potential solution, as suggested in keller2018predictive would be to use separate populations of positive and negative prediction errors, although then precise circuitry would be needed to ensure that information is routed to the correct positive or negative neuron, or additionally that each value neuron would have to stand in a one-to-one relationship with both a positive and negative error neuron, with one connection being inhibitory and the other being excitatory, and it is unclear whether such precise connectivity can exist in the brain. Finally, prediction error neurons could encode their errors in an antagonistic fashion, as do color-sensitive opponent cells in the retina – for instance a red-green opponent cell could simply encode negative green prediction error as positive red prediction error. However, it is unclear to what extent opponent cells exist in deeper, more abstract regions of the cortex, nor what the opponency would signify (huang2011predictive).

The implementation of precision in such a predictive coding microcircuit is another interesting question which has yet to be fully fleshed out. kanai2015cerebral suggests that precision may be encoded in subcortical regions such as the pulvinar, which is known to be engaged in attentional regulation. They suggest that such precision modulation could be implemented either as direct modulation of the superficial error units by neurons projecting from the pulvinar, or else alternatively via indirect effects of the pulvinar being instrumental in establishing or dissolving synchrony within or between regions, which is known to affect the amount of information transmitted between layers (kanai2015cerebral; buzsaki2006rhythms; uhlhaas2006neural). However, mathematically, precision is a matrix which quantifies the covariance between each error unit and each other error unit within the layer. It is unlikely that the pulvinar could usefully process or precisely modulate this amount of information. The pulvinar could, however, be instrumental in computing a diagonal approximation to the full precision matrix, by essentially modulating each superficial error neuron independently while lateral connectivity within the layer, perhaps mediated by SST interneurons, which are known to project relatively uniformly to a local region, could be involved in the implementation of the full precision matrix. The pulvinar could potentially focus on global exogenous precision modulation, such as due to attention, while the lateral inhibition would focus primarily on modelling the bottom-up aspects of precision such as the prediction error variance which arises naturally from Equation 2.4. If such a scheme were implemented in the brain, with diagonal global precision modulation, and full-matrix lateral precision computation, then this immediately suggests the intriguing hypothesis that top-down attention can only modulate independent variations in stimulus aspects, while bottom-up attention or salience can and does explicitly model their covariances.

One final interesting consideration arises from the consistent and well-established finding that the superficial and deep layers operate at different principal frequencies, with the superficial layers operating at the fast gamma frequencies, while the deep layers primarily utilize the slower alpha and beta frequencies (bastos2015visual). This finding does not necessarily follow from the cortical microcircuit model above, which if anything suggests that predictions and prediction errors, and thus superficial and deep layers should operate at roughly the same frequency. It has been argued (bastos2012canonical), that the predictions could operate at a slower frequency, since they integrate information from the prediction errors over time, however this would be an additional assumption, not a direct consequence of the standard mathematical model, in which the predictions are an instantaneous function of the value neurons. Now, the value neurons themselves, since they are updated using iterative dynamics, do integrate information from the instantaneous prediction errors, and thus would potentially have a slower frequency, however the value neurons are intermingled with the prediction error neurons in the superficial layers, and would thus also be expected to affect the dominant frequency of the superficial layers. It is possible, however, that the higher frequency prediction error neurons operating at the gamma frequency disguise the lower frequency value neurons in their midst, while the alpha/beta signal of the deep layers, which only contain prediction neurons, is not so disguised, giving rise to the observed frequency dynamics.

Overall, while much progress has been made in translating the abstract mathematical specification of predictive coding into neurophysiologically realistic neural circuitry, there are still many open questions and important problems. The fit to the cortical microcircuitry is not perfect, and there are several known cortical pathways which are hard to explain under current models. Nevertheless, predictive coding process theories provide perhaps one of the clearest and most general neuronal process theory relating cortical microcircuitry to an abstract computational model which is known to be able to solve challenging cognitive tasks. Finally, all the process models considered have assumed that neurons primarily communicate through real-valued rate-codes instead of spiking codes. If the brain does use spiking-codes heavily, then the algorithmic theory and process theories of predictive coding would need to be reformulated, if possible, to be able to natively handle spiking neural networks. In general, learning and inference in spiking neural network models remains relatively poorly understood, although there has been much recent progress in this area (zenke2018superspike; bellec2020solution; kaiser2020synaptic; neftci2019surrogate; zenke2021remarkable). The extension of predictive coding-like algorithms to the spiking paradigm is also an exciting open area of research (ororbia2019biologically; boerlin2013predictive; brendel2020learning). An additional complication, which may lead to novel algorithms and implementations is the fact that neurons have several distinct sites of dendritic integration (sacramento2018dendritic; takahashi2020active), as well as complicated internal, synaptic, and axonal physiology which may substantially affect processing or offer considerably more expressive power than the current understanding of ‘biological plausibility’ admits.

There has also been a substantial amount of research empirically investigating many of the predictions made by the process theories of predictive coding (bastos2012canonical; keller2018predictive; kanai2015cerebral). A recent thorough review of this work is walsh2020evaluating. A large amount of research has focused on the crucial prediction that expected, or predicted, stimuli should elicit less error response than unexpected ones. While the neural phenomena of repetition-suppression and expectation-suppression are well-established at the individual unit level, these phenomena are also well explained by other competing theories such as neural adaptation (desimone1995neural). Evidence from large-scale fMRI studies of the activity of whole brain regions are mixed, with some studies finding increases and others decreases in activity. Additionally, predictive coding does not actually make clear predictions of the level of activity in whole brain regions. While predictive coding predicts that the activity of error neurons should drop, the error neurons are generally thought of as being situated in the superficial layers alongside the value neurons, whose activity may rise. Additionally, well-predicted stimuli might be expected to have high precision, which would then boost the error unit activity, thus counteracting to some extent the drop due to better prediction. Precisely how these effects would interact in aggregate measures like the fMRI BOLD signal is unclear.

Another approach to empirically testing the theory is to look for specific value and error units in the brain. However, this task is complicated by the fact that often the predictions against which the errors are computed are unknown. For instance, it is well known that there are neurons in V1 which are sensitive to the illusory contours of stimuli such as the Kanizsa triangle (kanizsa1955margini; kok2015predictive; kok2015role), however, there remains a problem of interpretation. It is not clear whether such neurons are prediction errors, since there was a prediction of the contour which was not in fact there, or whether they represent value neurons faithfully representing the prediction itself. Nevertheless, there has been some evidence of functionally distinct neuronal subpopulations potentially corresponding to prediction and prediction error neurons (bell2016encoding; fiser2016experience).

It is also important to note that some implementations of predictive coding do not necessarily require separate populations of error units, but instead assume multicompartmental neurons with a distinct apical dendrite compartment which can store prediction errors separately from the value encoded by the main neuron body (sacramento2018dendritic; takahashi2020active) 111111While the original architecture of sacramento2018dendritic was not explicitly derived from predictive coding, it has later been shown that the two are equivalent (whittington2019theories).. If predictive coding were to be implemented in the brain in such a fashion, then not finding explicit error units would not conclusively refute predictive coding. Due to this flexibility in both the theory, and the process theory translating it to neural circuitry, as well as the difficulty in extracting predictions of aggregate measures (such as fMRI BOLD signal) from the mathematical model, it has been challenging to either experimentally confirm or refute predictive coding as a theory up to now. However, with emerging advances in experimental techniques and methodologies, as well as theoretical progress in exploring the landscape and computational efficacy of different predictive coding variants, as well as making more precise process theories, predictive coding, or at least specific process-theories, may well be amenable to a definitive experimental verification or falsification in the future.

3 Paradigms of Predictive Coding

While predictive coding has quite a straightforward mathematical form, there are numerous ways to set up predictive coding networks to achieve particular tasks, and numerous subtleties which can hinder a naive implementation. In this section, we survey the different paradigms of training predictive coding networks and review the empirical studies which have been carried out using these types of network. In brief, we argue that predictive coding can be instantiated in a supervised or unsupervised fashion. In the unsupervised case, there is a hierarchy of nodes, but the top-level of the hierarchy is allowed to vary freely. New data enters only at the bottom level of the hierarchy. In this case, the predictive coding network functions much like an autoencoder network (hinton1994autoencoders; kingma2013auto), in which its prediction is ultimately of the same type of its input. In the supervised case, the activity nodes at the highest latent level of the network are fixed to the values given by the supervisory signal – i.e. the labels. The supervised network can then be run in two directions, depending on whether the ‘data’ or the ‘label’ is provided to the top or the bottom of the network respectively.

In the following, we review empirical work demonstrating the performance characteristics of predictive coding networks in both unsupervised and supervised scenarios. We then summarize work in making the standard predictive coding architecture more biologically plausible by relaxing certain assumptions implicit in the canonical model of predictive coding as described so far, as well as how predictive coding networks can be extended to perform action, through active inference.

Figure 3: Summary of the input output relationships for each paradigm of predictive coding. Specifically a.) What the input to the network is and b.) what the network is trained to predict.

3.1 Unsupervised predictive coding

Unsupervised training is perhaps the most intuitive way to think about predictive coding, and is the most obvious candidate for how predictive coding may be implemented in neural circuitry. On this view, the predictive coding networks functions essentially as an autoencoder (hinton1994autoencoders; kingma2013auto; hinton2006reducing), attempting to predict either the current sensory input, or the next ’frame’ of sensory inputs (temporal predictive coding). Under this model the latent activations of the highest level are not fixed, but can vary freely to best model the data. In this unsupervised case, the question becomes what to predict, to which there are many potential answers. We review some possibilities here, which have been investigated in the literature.

To train an unsupervised predictive coding network, the activities of the lowest layer are fixed to those of the input data. The activities of the latent variables at all other levels of the hierarchy are initialized randomly. Then, an iterative process begins in which each layer makes predictions at the layer below, which generates prediction errors. The latent variable s then follow the dynamics of equation (PC) to minimize prediction errors. Once the network has settled into an equilibrium, or else the dynamics have been run for some prespecified number of steps, then the parameters are updated for one step using the dynamics of Equation 7. If the precisions are learnt as well, then the precision dynamics (Equation 2.4 will be ran for one step here as well. To help reduce the variance of the update, predictive coding networks are often trained with a minibatch of data for each update, as in machine learning (friston2005theory; whittington2017approximation; millidge2019implementing; millidge2019combining; orchard2019making; millidge2018predictive)

. In general, predictive coding networks possess many of the same hyperparameters such as the batch size, the learning rate, and layer width as artificial neural network models from machine learning. Predictive coding networks can even be adapted to use convolutional or recurrent architectures

(millidge2020predictive; salvatori2021predictive). The main difference is the training algorithm of predictive coding (Equations 7

) rather than stochastic gradient descent with backpropagated gradients, although under certain conditions it has been shown that the predictive coding algorithm approximates the backpropagation of error algorithm

(millidge2020predictive; whittington2017approximation; song2020can).

3.1.1 Autoencoding and predictive coding

A simple implementation of predictive coding is as an autoencoder (hinton2006reducing). A data item is presented to the network, and the network’s goal is to predict that same data item back from the network. The goal of this kind of unsupervised autoencoding objective is typically to learn powerful, structured, compressed, or disentangled representations of the input. In a completely unconstrained network, the solution is trivial since the predictive coding network can just fashion itself into the identity function. However, by adding constraints into the network this solution can be ruled out, ideally requiring the network to learn some other, more useful representation of the input to allow for correct reconstruction. Typically, these constraints arise from a physical or informational bottleneck in the network (tishby2000information) – typically by constricting the dimensionality of one or multiple layers to be smaller (often significantly smaller) than the input dimensionality, thus effectively forcing the input to be compressed in a way amenable to later decompression and reconstruction. The activations in the smallest bottleneck layer are then taken to reflect important representations learned by the network. Autoencoders of this kind of have been widely used in machine learning, and variational autoencoders (kingma2013auto), a probabilistic variant which learns a Gaussian latent state, are often state of the art for various image generation tasks (child2020very). Such networks can be thought of as having an ‘hourglass shape’, with wide encoders and decoders at each end and a bottleneck in the middle which encodes the latent compressed code. Predictive coding networks, by contrast, only have the bottom half of the hourglass (the decoder), since the latent bottleneck states are optimized by an iterative inference procedure (the E step of the E-M algorithm).

An early example of an autoencoding predictive coding network was provided by rao1999predictive

, who showed that such a network could learn interpretable representations in its intermediate layers. The representation learning capabilities of predictive coding autoencoder networks has also been tested on the standard machine learning dataset of MNIST digits

(millidge2019implementing), where it was shown that the latent code can be linearly separated by PCA into distinct groups for each of the 10 digits, even though the digits are learned in an entirely unsupervised way.

3.1.2 Temporal Predictive Coding

Another way to implement an unsupervised predictive coding network is in a temporally autoregressive paradigm (spratling2010predictive; friston2008hierarchical; millidge2019fixational). Here, the network is given a time-series of data-points to learn, and it must predict the next input given the current input and potentially some window of past inputs. This learning objective has additional neurobiological relevance (as compared to simple autoencoding), given that the brain is in-practice exposed to continuous-time sensory streams, in which predictions must necessarily be made in this temporally structured way.

Furthermore, predicting temporal sequences is a fundamentally more challenging task than simply reconstructing the input, since the future is not necessarily known or reconstructable given the past, and a simple identity mapping will not suffice except for the most trivial sequences. Indeed, autoregressive objectives like this have been used in machine learning to successfully train extremely impressive and large transformer models to predict natural language text to an incredibly high degree of accuracy (vaswani2017attention; brown2020language). Autoregressive predictive coding networks have primarily been explored in the context of signal deconvolution in neuroimaging (friston2008DEM; friston2008hierarchical), as well as in predictive coding inspired machine learning architectures such as PredNet (lotter2016deep).

Such networks have also been used in the context of reinforcement learning for 1-step environment prediction to enable simple planning and action selection

(millidge2019combining). Moreover, as shown previously, 1-step autoregressive linear predictive coding is mathematically similar to Kalman Filtering (millidge2021neural; friston2008hierarchical)

, thus demonstrating the close connection between predictive coding and standard filtering algorithms in engineering. In general, however, despite the empirical successes and importance of unsupervised autoregressive modelling in machine learning, and its neurobiological relevance, surprisingly little work has been done on empirically testing the abilities of large-scale predictive coding networks on autoregressive tasks.

3.1.3 Spatial Predictive Coding

Another objective which could be used in predictive coding is to predict parts of the input from other parts. Specifically, it is possible to get a predictive coding network to learn to predict a pixel or spatial element from its surroundings. This objective has been used in early work on predictive coding in the retina (srinivasan1982predictive) which models the receptive field properties of retinal ganglion cells. This paradigm of predictive coding has close relationships to normalization or whitening transforms. A closely related paradigm is cross-modal predictive coding, which uses information from one sensory modality to predict another. This has been explored in (millidge2018predictive) and has been shown to lead to good representation learning performance with cross-predicting autoencoders which cross-predict using the three colour channels (red,green,blue) of natural images. The brain may use a cross-modal predictive objective more widely, as suggested by the close integration of multimodal inputs and the ability to effortlessly make cross-modal predictions.

A similar approach is taken in machine learning where it is called contrastive predictive coding (oord2018representation), which aims to learn latent representations by forcing the network to be able to associate two different crops of the same image, while dissociating crops of different images. This contrastive approach has demonstrated strong results on unsupservised machine learning benchmarks

3.2 Supervised predictive coding: Forwards and Backwards

Supervised predictive coding is the second major paradigm of predictive coding. In the supervised mode, both data and supervisory labels are provided to the network. The training objective is to learn some function of the data that will allow the network to successfully predict the correct labels. In supervised predictive coding, the activities of the units at one end of the network is fixed to the data, and the other end is fixed to the values of the labels. Then, predictions are computed from the latent variables at the top of the network and fed down to the bottom, generating prediction errors at each step. The activations of all intermediate s are then allowed to evolve according to the dynamics of Equation 7 while the activations at the top and bottom layers of the network are fixed to the label or the data values. Once the network has settled into an equilibrium, then the parameters of the network are updated. This is repeated for each minibatch of data points and labels.

There are two separate modes for running the predictive coding network in the supervised case – the forward mode and the backwards mode. The forward mode is more intuitive, in which s at the top of the hierarchy are fixed to the labels while the bottom of the network is fixed to the data values. Predictions thus flow down the hierarchy from the labels to the data. To test the network, the bottom nodes of the network are fixed to the test data item, and the label latent variables are allowed to freely evolve according to Equation 7. The label prediction the network makes is determined by the values the top layer of s have taken on after convergence. Thus, in the forward mode convergence is an iterative process requiring multiple dynamical iterations. In the backwards mode, the bottom of the network is fixed to the labels, and the top of the network is fixed to the datapoint. Predictions thus flow directly from data to labels in a manner reminiscent of the feedforward pass of an artificial neural network. In this case, at test time, all the network needs to do is to make a single forward (really downward) pass from data to labels without requiring multiple dynamical iterations. We will show later that this backwards predictive coding network can become equivalent to backpropagation of error in artificial neural networks under certain conditions, which provides a revealing link between predictive coding and contemporary machine learning.

[‘Standard’ Generative PC]   [‘Reverse’ Discriminative PC ]

Figure 4: Schematic architectures for the a.) Standard, or generative predictive coding setup, or b.) Reverse, or discriminative architecture trained for supervised classification on MNIST digits. In the generative model, the image input (in this case an MNIST digit) is presented to the bottom layer of the network, and the top layer is fixed to the label value (5). Predictions (in black) are passed down and prediction errors (in red) are passed upwards until the network equilibrates. In the discriminative mode, the input image is presented to the top of the network and the label is presented at the bottom. Thus the network aims to ‘generate’ the label from the image. The top-down flow of predictions becomes analogous to the forward pass in an artificial neural networks, and the bottom-up prediction errors become equivalent to the backpropagated gradients.

The forwards/backwards distinction in supervised predictive coding also maps closely to a distinction between generative and discriminative classifiers in machine learning

(bouchard2004tradeoff). While both forward and backwards predictive coding networks can perform both generation of ‘imagined’ input data given labels, as well as classification (prediction of the labels given data), in each mode one direction is ‘easy’ and the other is ‘hard’. The easy direction requires only a single sweep of predictions to generate the relevant quantity – labels or data – while the hard direction requires a full set of dynamical iterations to convergence to make a prediction of either the labels or data. In forward predictive coding networks, generation is easy, while classification is hard. Predictions flow directly from the labels to the data, whereas to generate predictions of the labels, the network must be run until it converges. Conversely, in a backwards predictive coding network, classification is easy, requiring only a downward sweep of predictions, whereas generation requires dynamical convergence. In effect, the ‘downwards’ sweep in predictive coding networks is always the ‘easy’ direction, so whatever quantity is represented at the lowest level of the hierarchy will be easiest to generate. Conversely, the ‘upwards’ direction in the predictive coding network is difficult, and thus whatever quantity is represented at the top of the hierarchy will require an iterative inference procedure to infer. In a forwards ‘generative mode’, we have the images at the bottom and the labels at the top while in the backwards ‘discriminative mode’, we have the labels at the bottom, and the images at the top.

In general, as might be expected, performance on easy tasks in predictive coding networks is better than performance on hard tasks. In the simple task of MNIST classification, forward predictive coding networks typically manage to generate example digits given labels with high fidelity (millidge2019implementing), while their classification accuracy, while good, is not comparable with artificial neural networks trained with backprop. Conversely, backwards predictive coding networks are often able to reach classification accuracies comparable to backprop-trained ANNs on MNIST, while often their generated images are blurry or otherwise poor. orchard2019making have argued that this poor generation ability in backwards predictive coding networks arises from the fact that the generative problem is under determined – for any given label there are many possible images which could have given rise to it – and so the network ‘infers’ some combination of all of them, which is generally blurry and does not correspond to a precise digit. They propose to solve this problem with a weight-decay regulariser, which encourages the network to find the minimum norm solution.

3.3 Relaxed Predictive Coding

Although there has been much work trying to determine the kinds of neural circuitry required to implement the predictive coding algorithm, and whether that circuitry can be mapped to that known to exist in the cortex, there are also additional problems of biological plausibility of the predictive coding algorithm which must be raised. Regardless of any details of the circuitry, there are three major implausibilities inherent in the algorithm which would trouble any circuit implementation. These are the weight transport problem, the nonlinear derivatives problem, and the error connectivity problem. The weight transport problem concerns the or backwards weight issue in the equation for the dynamics of the s Equation 7. Biologically, what this term says is that the prediction errors from the layer affect the dynamics of the at each layer by being mapped backwards through the forward weights

used for prediction. Taken literally, this would require the prediction errors to be transmitted ‘backwards’ through the same axons and synapses as the predictions were transmitted forwards. In the brain, axons are only unidirectional so this is prima-facie implausible. The other option is that there is maintained a set of backwards weights which is a perfect copy of the forward weights, and that the prediction errors are mapped through these backwards weights instead. The first problem with this solution is that it requires perfectly symmetrical connectivity both forwards and backwards between layers. The second, and more serious, problem is that it requires the backwards synapses to have actually the same weight values of the forward synapses, and it is not clear how this equivalence could be initialized or maintained during learning in the brain.

The second issue of biological implausibility is the nonlinear derivative problem, or the fact that the learning rules for both and

contain the derivative of the nonlinear activation function. Although in certain cases, such as rectified linear units, this derivative is trivial, in other cases it may be challenging for neurons to compute. The third biological implausibility is that predictive coding, interpreted naively, requires a precise one-to-one connectivity between each ‘value neuron’

and its ‘error neuron’ . Such a precise connectivity pattern would be hard to initialize and maintain in the brain and, if it existed, would almost certainly have been detected by current neurophysiological methods. While these three problems seem daunting for any neurobiologically plausible process theory of predictive coding in the brain, recent work (millidge2020relaxing) has begun to attack these problems among others and show how the predictive coding algorithm can be relaxed to resolve these problems while still maintaining high performance.

Specifically, it has been shown (millidge2020relaxing) that the weight transport problem can be addressed with a learnable set of backwards weights which can be initialized randomly and independently from the forward weights . Then the backwards weights can be updated with the following Hebbian learning rule,

(23)

which is just a multiplication of the (postsynaptic) latents at a layer and the (presynaptic) prediction errors of the layer below, weighted by the derivative of the activation function. millidge2020relaxing has shown that starting with randomly initialized weights and learning them in parallel with the forward weights gives identical performance to using the correct backwards weights

in supervised learning tasks on MNIST.

Similarly, the nonlinear derivative problem can, surprisingly, be handled by simply dropping the nonlinear derivative terms from the parameter and latent update rules, thus rendering the ‘backwards pass’ of the network effectively linear. The resulting rules become,

(24)
(25)

and that this also does not unduly affect performance of predictive coding networks on classification tasks. Finally, the one-to-one connectivity of latent variable nodes and their corresponding error units can be relaxed by feeding the predictions through an additional learnable weight matrix , so that the computation rule for the prediction errors becomes

(26)

neurally, the would implement a fully-connected connectivity scheme between each and each top-down prediction. While a fixed randomly initialized negatively impacts performance, the matrix can also be learned online throughout the supervised learning task according to the following update rule,

(27)

which was found to allow the predictive coding network to maintain performance while avoiding precise one-to-one connectivity between prediction error units and latent variables. Overall, these results show that there is in fact considerable flexibility in relaxing certain assumptions in the predictive coding update rules while maintaining performance in practice, and this relaxing of constraints opens up many more possibilities for neurobiologically accurate process theories of predictive coding to be developed and matched to neural circuitry. Moreover, it also suggests that experimental work that tries to prove or disprove predictive coding by looking for naive implementational details such as distinct prediction error neurons in one-to-one correspondence with value neurons may not prove conclusive, since there is considerably implementational flexibility of the predictive coding model into actual cortical microcircuitry.

3.4 Deep Predictive Coding

While so far in this review, we have considered only direct variations on the Friston, and Rao and Ballard, models of predictive coding, which are relatively pure and only use local, biologically plausible learning rules, there also exists a small literature experimenting with predictive-coding inspired deep neural networks. These are typically trained with backprop and, while they are not ‘pure’ in that they do not faithfully implement a Rao and Ballard-esque scheme, and are not biologically plausible, by utilizing recent advances in machine learning, they often achieve substantially better performance on more challenging tasks than the purer models can achieve. As such, they provide a vital thread of evidence about the scaling properties and performances of deep and complex predictive coding networks, as might be implemented in the brain. Intuitively, deep predictive coding networks can improve upon the standard feedforward artificial neural networks which are ubiquitous in machine learning through the use of feedback connections and recurrence, which in theory allow the network to handle temporally extended tasks more naturally, as well as the feedback connections allow for an unbounded amount of computation and progressive updating or sharpening of representations and predictions over multiple recurrent iterations (kietzmann2019recurrence; van2020going)

The first major work in this area is PredNet (lotter2016deep), which uses multiple layers of recurrent convolutional LSTMs to implement a deep predictive coding network. In accordance with the predictive coding framework, each convolutional LSTM cell received as input the error from the layer below. The error itself was calculated as the difference between the recurrent prediction of the error from the last timestep, as well as the top-down prediction from the layer above. An interesting quirk of this architecture was that instead of the network’s goal being to predict the activation at the layer below, it was instead to predict the prediction error at that layer. It is unclear to what extent this difference affects the behaviour or performance of PredNet against a more standard deep predictive coding implementation. The network was trained on a number of video object recognition tasks such as a face-pose dataset and the KITTI dataset comprised of images taken from a car-mounted camera. They showed that their network outperformed baseline feedforward-only convolutional neural networks. The loss function optimized was the sum of errors at all layers, although interestingly they found that optimizing only the error at the bottom layer performed the better. The parameters of the network were optimized using backpropagation through time on the loss function.

Although PredNet has become the seminal work in this field, it has been recently criticised for its lack of adherence to the pure predictive coding model, and related lack of biological plausibility, as well as questions over the degree to which it is learning useful representations rather than simply predicting the low-level optical flow in the images (rane2020prednet), which is always a danger in video prediction where straightforward extrapolations of optical flow can perform surprisingly well. This work has been further developed in wen2018deep

who develop a recurrent-convolutional scheme with top-down deconvolutional predictions, and demonstrate that the recurrence the network enables allows for greater performance than strictly feedforward networks, and additionally that the degree of performance increase scales with the number of recurrent iterations allowed. For a further review of predictive-coding inspired deep learning architectures, see

hosseini2020hierarchical

Moving further into the field of machine learning, there is also a number of papers which utilize a predictive-coding-inspired objective, called contrastive predictive coding, to learn useful abstract latent representations from unsupervised inputs (oord2018representation). The intuition behind contrastive predictive coding is that direct prediction in the data-space is often highly redundant, since it is often sufficient to model only short range temporal correlations in order to do well, and thus leads to models which overfit to minor details or flows than learning the true latent structure of the data. Moreover, if the objective is in the data-space, the model is penalized for mis-predicting small, irrelevant details, which can often lead it to devote modelling capacity to local noise instead of the global structure. Contrastive predictive coding instead argues to use predictive losses in the latent space of the model, so that the important prediction is of the future latent-state of the model itself, and not the actual observations. The original work showed that this approach worked well for learning informative representations of audio and visual datasets, while later work has demonstrated its benefits in video (wen2018deep; elsayed2019reduced). Interestingly, the standard Rao and Ballard predictive coding model implicitly implements this scheme as well, since apart from the lowest layer, all prediction errors are in the latent-space of the model and not the observation space. Effectively, predictive coding optimizes the sum of contrastive losses at every level of the hierarchy. It remains an open question whether such an objective would be beneficial for deep neural networks.

Finally, a small number of works have experimented with deep Rao and Ballard-style networks with additional sparsity regularisers (chalasani2013deep), which demonstrate a close link between predictive coding and solutions to sparse linear equation solvers and non-negative matrix factorization algorithms (lee2001algorithms). Overall, this is still a young area of research, with many open areas for further exploration. In general, the task of accurately translating the predictive coding paradigm into a modern deep learning framework, still remains open to new contributions, and additionally, it is still largely unknown the degree to which implementing recurrent and top-down connections (which comprise the majority of connections in the cortex (kietzmann2019recurrence)) in artificial feedforward neural networks can improve performance and potentially lead to more biologically plausible, or robust machine learning solutions remains unknown.

4 Relationship to Other Algorithms

4.1 Predictive Coding and Backpropagation of error

One of the major streams of technological advancement in the past decade has been in machine learning, with extremely large and deep neural networks, often containing millions or billions of parameters, to reach extremely high levels of performance on a wide range of extremely challenging tasks such as machine translation and text generation

(brown2020language; radford2019language), Go (silver2017mastering), atari (schrittwieser2019mastering), as well as image (child2020very) and audio generation (oord2016wavenet; dhariwal2020jukebox) and object detection (krizhevsky2012imagenet). Core to all these successes is the backpropagation of error (backprop) algorithm (werbos1982applications; linnainmaa1970representation), which is used to train all such networks. Backprop is fundamentally a credit assignment algorithm which correctly computes the derivative of each parameter (often interpreted as a synaptic weight from a neuroscientific perspective) with respect to a global, often distant loss. Given these derivatives, or credit assignments, each parameter can then be adjusted independently in a way which will minimize the global loss. In such a way the network learns a set of weights which allows it to successfully solve the task. Crucially, backprop can compute gradients for effectively any computation, so long as it is differentiable. All that is needed, then, is for the operator to define a loss function, and the forward computation of the model, and then backprop can compute the gradients of each parameter in the model with respect to the loss.

Due to the immense successes of backprop in training deep artificial neural networks, a natural question is whether the brain – which faces an extremely similar credit assignment problem – might potentially be using an algorithm like backprop to update its own synaptic weights. Unfortunately, backprop is generally not considered biologically plausible (crick1989recent), rendering a direct implementation of the algorithm in neural circuitry unlikely. However, in recent years a large amount of work has explored various biologically plausible approximations to, or alternatives to backprop, which could in theory be implemented in the brain (bengio2015early; bengio2017stdp; sacramento2018dendritic; akrout2019deep; lillicrap2016random; whittington2017approximation; whittington2019theories; millidge2020relaxing; millidge2020activation; millidge2020predictive; millidge2020investigating; scellier2017equilibrium; scellier2018generalization; lee2015difference; amit2019deep). This work reignites the idea that, in fact, biological brains could be implementing backpropagation for learning, or something very close to it. If this were the case, then this would provide an extremely important insight, allowing us to mechanistically understand many aspects of brain function and dysfunction, as well as allowing the immediate portability of results from machine learning, where experience with extremely large and deep backprop-trained neural networks exists, directly to neuroscience. It would also take a large step towards answering deep questions about the the nature of (human/biological) learning, and would make a substantial contribution towards our understanding of the prospects for the development of artificial general intelligence within the current machine learning paradigm.

From the perspective of this review, we focus on recent work investigating the links between predictive coding and the backpropagation of error algorithm. Specifically, it has been shown that, under certain conditions, the error neurons in predictive coding networks converge to the gradients computed by backprop, and that if this convergence has occurred, then the weight updates of predictive coding are identical to those of backprop. This was shown first in multi-layer perceptron models by

whittington2017approximation and then for arbitrary computational graphs, including large-scale machine learning models by millidge2020predictive under the fixed prediction assumption. Similarly song2020can has shown that if the network is initialized to its feedforward pass values, the first iteration of predictive coding is precisely equal to backprop. Since predictive coding is largely biologically plausible, and has many potentially plausible process theories, this close link between the theories provides a potential route to the development of a biologically plausible alternative to backprop, which may be implemented in the brain. Additionally, since predictive coding can be derived as a variational inference algorithm, it also provides a close and fascinating link between backpropagation of error and variational inference.

Here we demonstrate the relationship between backprop and predictive coding on arbitrary computation graphs. A computation graph is the fundamental object which is operated on by backprop, and it is simply a graph of the computational operations which are undertaken during the computation of a function. For instance, when the function is a standard multi-layer perceptron (MLP), the computation graph is a series of elementwise nonlinearities and multiplication by the synaptic weights .

Formally, a computation graph is a graph of vertices and edges where the vertices represent intermediate computation products – for instance the activations at each layer in a MLP model – and the edges represent differentiable functions. A vertex may have many children, denoted and many parents, denotes . A vertex is a child of another if the value of the parent vertex is used to compute value of the child vertex, using the function denoted by the edge between them. For backpropagation, we only consider computation graphs that are directed acyclic graphs (DAGS) which ensure that it is impossible to be stuck in a cycle forever by simply going from children to parents. DAG computation graphs are highly general and can represent essentially any function that can be computed in finite time. Given an output vertex and a loss function

, then backpropagation can be performed upon a computation graph. Backpropagation is an extremely straightforward algorithm which simply uses the chain rule of multivariable calculus to recursively compute the derivatives of children nodes from the derivatives of their parents.

(28)

By starting with the output gradient , and if all the gradients , which are the gradients of the edge functions, are known, then the derivative of every vertex with respect to the loss can be recursively computed. Then, once the gradients with respect to the vertices are known, the gradients with respect to the weights can be straightforwardly computed as,

(29)

Predictive coding can also be straightforwardly extended to arbitrary computation graphs. To do so, we simply augment the standard computation graphs with an additional error units for each vertex. Formally, the augmented graph becomes where is the set of all error neurons. We then adapt the core predictive coding dynamics equations from a hierarchy of layers to arbitrary graphs,

(30)
(31)

where we have assumed that all precisions are equal to the identity , and thus can be ignored. The dynamics of the parameters of the vertices and edge functions such that can be derived as a gradient descent on , where is the sum of prediction errors of every node in the graph. Importantly these dynamics still require only information (the current vertex value, prediction error, and prediction errors of child vertices) locally available at the vertex. Here we must apply the fixed prediction assumption and postulate that the prediction remain fixed throughout the optimization of the . Given this assumption, we can see that, at the equilibrium of the dynamics of the , the prediction errors equals the sum of the prediction errors of the child vertices multiplied by the gradient of the prediction with respect to the activation,

(32)

Importantly, this recursion is identical to that of the backpropagation of error algorithm (Equation 28), which thus implies that if the output prediction error is the same as the gradient of the loss , then throughout the entire computational graph, the fixed point of the prediction errors is the gradients computed by backprop, and thus by running the dynamics of Equation 32 to convergence, the backprop gradient can be computed. Finally, we see by inspection that the update rule for the weights in predictive coding and backpropagation are the same if the prediction error is equal to the gradient with respect to the activation, such that if the predictive coding network is allowed to converge, and then the weights are updated, then the procedure is equivalent to a single update of backpropagation. This means that, under the fixed prediction assumption, the dynamical iterative inference procedure of predictive coding can be interpreted as a convergence to the backpropagated gradients in an artificial neural network. While this equivalence is easy to see for an artificial neural network, there remain several issues of biological plausibility when applying this approach to neural circuitry. These are discussed in Appendix C.

Another set of results from song2020can and (salvatori2021predictive) demonstrate that the update in predictive coding is equal to backpropagation for the first update step even without this fixed prediction assumption. This is because for the first steps, all the prediction errors are 0 except at the output, where the prediction error is exactly the gradient of the loss function, and thus the last step will be identical to backpropagation in any case. Then, at each new step up to the N number of layers, the same process repeats with a new layer where the prediction error has been initialized to 0, and thus the first update is identical to the backpropagation one.

Intuitively, we can think of this as a predictive coding network in which all the error is initially focused at the output of the network, where the loss is, and then through the dynamical minimization of prediction errors at multiple layers, this error is slowly spread out through the network in parallel until, at the optimum distribution, the error at each vertex is precisely the credit that it should be assigned to causing the error in the first place. Importantly, unlike backprop, the dynamics of predictive coding are local, only requiring the prediction errors from the current vertex and the child vertices.

Finally, this correspondence implies a link between variational inference and backprop. Specifically, that backprop arises as a variational inference algorithm which attempts to infer (under Gaussian assumptions) the value of each vertex in the computational graph, given a known start and end node. Backprop can then be seen as the optimal solution to the ‘backwards inference’ problem of inferring the values of the nodes of the graphs given a Gaussian prior centered at the activations provided by the feedforward pass. An additional further note is that the Gaussian assumptions of predictive coding gives rise to precision parameters which are ignored in this analysis showing convergence to backpropagation. If these precisions are added back in, we see that it is possible to derive an ‘uncertainty-aware’ extension to standard backprop, which can adaptively regulate the importance of gradients throughout the computational graph depending on their intrinsic variance. This allows us to directly understand the assumption, implicit within backprop, that all nodes of the computational graph, and the data, are equally certain or uncertain, and that they are i.i.d distributed. The use of precision parameters, thus, may allow for the mathematically principled extension of backprop into situations where this implicit assumption does not hold. Exploring the close connections between backprop and inference is an exciting avenue for future work, which has recently been unlocked by the discovery of these important correspondences.

4.2 Linear Predictive Coding and Kalman Filtering

Here, we demonstrate how predictive coding in the linear regime corresponds to the celebrated Kalman Filter, a ubiquitous algorithm for optimal perception and filtering with linear dynamics (kalman1960new). The Kalman filter, due to its simplicity, accuracy, and robustness to small violations in its assumptions, is widely used to perform perceptual inference and filtering across a wide range of disciplines (welch1995introduction) and is especially prevalent in engineering and control theory. Kalman filtering assumes a linear state-space model of the world defined as

(33)

where is the latent state, is some control or action input (for pure perception without any action on the world these terms can be ignored – i.e. or ), is a vector of white Gaussian noise, and is a vector of observations. These quantities are related through linear maps, parametrised as matrices , , and .

This state-space model can be written as a Gaussian generative model.

(34)

Kalman filtering proceeds in two steps. First the state is ‘projected forwards’ using the dynamics model, or prior . Then these estimates are ’corrected’ by new sensory data by inverting the likelihood mapping . The Kalman filtering equations are as follows:

Projection:

(35)

Correction:

(36)
(37)
(38)

where and are the predicted values of and before new data are observed, and is the Kalman Gain matrix which plays a crucial role in the Kalman Filter update rules. The derivation of these update rules is relatively involved. For a concise derivation, see Appendix A of millidge2021neural The estimated and found by the Kalman filter are optimal in the sense that they minimize the squared residuals and where and

are the ‘true’ values, given that the assumptions of the Kalman filter (linear dynamics and Gaussian noise) are met. Kalman filtering can also be interpreted as finding the optimum of the maximum-a-posteriori estimation problem

(39)

Since the generative model is assumed to be linear, this optimization problem becomes convex and can be solved analytically, giving the Kalman Filter solution.

Importantly, the optimization problem can also be solved iteratively using gradient descent. First, we write out the maximization problem explicitly,

(40)

Given this loss function, to derive the dynamics with respect to we can take derivatives of the loss, which results in a familiar update rule.

(41)

where and .

These derivations show that predictive coding in the linear regime is an iterative form of the same optimization problem that the Kalman filter performs analytically. In effect, we see that predictive coding reduces to Kalman Filtering in the linear case, as the optimization procedure, due to the convexity of the underlying loss function, will converge rapidly and robustly. This insight not only connects predictive coding to well established optimal filtering schemes, and also showcases the more general nature of predictive coding since. Unlike Kalman filtering, predictive coding makes no assumptions of linearity and is able to perform effective perception and filtering even when the likelihood and dynamics models are potentially highly nonlinear, which is essential for systems (like the brain) operating in highly complex nonlinear worlds.

Interestingly, by viewing the Kalman filtering problem in this probabilistic lens, it also allows us to straightforwardly derive an EM algorithm to learn the matrices specifying the likelihood and dynamics models in the Kalman Filter. By taking gradients of the loss function with respect to these matrices, we obtain (see Appendix D for detailed derivations both of these update equations and the derivation of the Kalman Filter update rules):

(42)
(43)
(44)

We see that these dynamics take the form of simple Hebbian updates between the relevant prediction error and the estimated state, which could in theory be relatively straightforwardly implemented in neural circuitry.

4.3 Predictive Coding, Normalization, and Normalizing Flows

Computational accounts of brain function have stressed the importance of normalization, especially at the lower levels in perceptual hierarchies (carandini2012normalization). Sensory stimuli are almost always highly redundant in both space (close sensory regions are generally similar) and in time (the world typically changes smoothly, so that large parts of the sensory input are relatively constant over short timescales). A substantial portion of the brain’s low-level circuitry (i.e. at the sensory epithelia and first few stages of processing) can be understood as accounting for and removing these redundancies. For instance, amacrine and ganglion cells in the retina are instrumental in creating the well-known centre-surround structure of receptive fields in the earliest levels of visual processing. Centre-surround receptive fields can be interpreted as the outcome of spatial normalization – obtained by subtracting an expected uniform spatial background from the input. The brain also performs significant temporal normalization by subtracting away input that remains constant over time. A noteworthy example of this comes from the phenomenon of retinal stabilization (riggs1953disappearance; gerrits1970artificial; ditchburn1955eye) whereby if you stare at a pattern for sufficiently long it will fade from conscious perception. Moreover, if a visual stimulus is experimentally held at a fixed location on the retina, fading is extremely rapid, often in less than a second (coppola1996extraordinarily), and can be explained straightforwardly by predictive coding of temporal sequences (millidge2019fixational).

The ubiquity of normalization in early sensory processing speaks to the utility and applicability of predictive coding models, since prediction and normalization are inseparable. Any normalization requires an implicit prediction, albeit potentially a crude one. Indeed, the earliest neurobiological predictive coding models (srinivasan1982predictive) were deployed to model the normalization abilities of retinal ganglion cells. Spatial or temporal normalization is straightforward to implement in a predictive coding scheme. Consider the standard mean normalization whitening filter (Equation 45) or the single step autoregressive whitener (Equation 46)

(45)
(46)

Where and denote the expectation values and variances respectively. In these cases we see that the whitened prediction is equivalent to the prediction errors in the predictive coding framework, with straightforward predictions of for the standard whitening filter and – the value at the last timestep – for the temporal whitener. The fact that precision-weighted prediction errors are the primary quantity transmitted ‘up’ the hierarchy now becomes intuitive – the prediction errors are effectively the inputs after normalization while the precision weighting instantiates that normalization.

Recently, this deep link between predictive coding and normalization has been further extended by marino2020predictive by situating predictive coding within the broader class of normalizing flows. Normalizing flows provide a general recipe for building or representing a complex distribution from a simple and tractable one, and have been recently applied with much success in challenging machine learning tasks (rezende2015variational; papamakarios2019normalizing). The central observation at the centre of normalizing flows is the ’change of variables’ formula. Suppose we have two values and related by an invertible and differentiable transformation such that , then if instead we maintain distributions and then, by the change of variables formula, we know that,

(47)

where denotes the determinant and denotes the Jacobian matrix of the inverse of the transform . Crucially, if we can compute this Jacobian inverse determinant, then we can sample from or compute probabilities of given probabilities of and vice-versa. This allows us to take a simple distribution such as a Gaussian, which we know how to sample from and compute probabilities with, and then map it through a complex transformation to obtain a complex distribution while retaining the mathematical tractability of the simple distribution. A normalizing flow model can be constructed from any base distribution as long as the transformations are invertible and differentiable. For instance, the transformation in Equation 45 is clearly invertible and differentiable (indeed it is affine) and thus constitutes a normalizing flow. The key insight in (marino2020predictive), is that most such normalization schemes can be considered to be simple normalizing flow models, which allows a rich theoretical analysis as well as their unification under a simple framework. Indeed, even the complex hierarchical predictive coding models developed later constitute normalizing flows as long as the functions are invertible and differentiable. The invertibility condition is hard to maintain, however, since it ultimately requires that the synaptic weight matrix is full-rank and square, which implies that the dimensionality of an invertible predictive coding network remains the same at all levels. This rules out many architectures which have different widths at each layer, such as standard autoencoder models which possess an information bottleneck (tishby2000information). Nevertheless, this close link between predictive coding architectures and normalizing flows allows us to immediately draw parallels between and understand the function of predictive coding networks as progressively normalizing and then mapping sensory stimuli into spaces where they can be more easily classified or form useful representations.

4.4 Predictive Coding as Biased Competition

The biased competition theory of cortical information processing is another highly influential theory in visual neuroscience. It proposes that cortical representations are honed through a process of lateral inhibitory competition between neighbouring neurons in a layer, where this competition is biased by top-down excitatory feedback which preferentially excites certain neurons which best match with the top-down attentional preferences, and this feedback enhances their activity, helping them to inhibit the activity of their neighbours and thus ultimately win the competition (desimone1995neural; desimone1998visual; reynolds1999competitive). Biased competition theory relies heavily on the notions of lateral feedback in the brain, and is supported by much empirical neurophysiological evidence of the importance of this kind of lateral feedback which is largely ignored in the standard predictive coding theory.

It initially appears that the predictive coding and biased competition are incompatible with one another, and that they make rival predictions. Perhaps the most obvious discrepancy is that predictive coding predicts inhibitory top-down feedback while biased competition requires top-down feedback to be excitatory, which is more in line with neurophysiological evidence throughout the cortex 121212Although this analysis discounts potential intralaminar circuitry – especially interneurons – which could potentially flip the sign of the effective connectivity.. However, spratling2008reconciling showed that in the linear regime these two theories are actually mathematically identical. The superfiial contrast between biased competition and predictive coding theories can be resolved by noting that they imply the same mathematical structure, which can be realized in multiple ways, with different neural circuits which have different patterns of excitation and inhibition. This unification shows how neurophysiological process theories should not necessarily be literal translations of the mathematics into neural groupings, and that moreover disproving one element of a process theory – i.e. predictive coding requires top-down inhibitory feedback – does not necessarily show that the theory is wrong, just that the process theory could be rewritten under a different rearrangement of the mathematical structure. It is important, therefore, to not treat the process theory, nor the standard form of the mathematics too literally.

The unification between predictive coding and biased competition is remarkably simple, requiring only a few straightforward mathematical manipulations. To show this, we first introduce a standard biased competition model (harpur1994experiments), rewritten in our standard notation to make the equivalence clear. Consider a layer of neurons with inputs . This layer also receives top-down excitatory input which is mapped through the top-down weights . The neurons then inhibit their own inputs by mapping downwards through an inhibitory weight matrix . Writing down these equations, we obtain

(48)
(49)

Where