## 1. Introduction

Predictive coding (PC) is a general theory of the function of top-down and bottom-up processing in the neocortex (Rao_Ballard99_predcodi; spratling2008predictive; friston2009predictive). According to the theory, a central function of top-down projections in the cortex, which are connections that lead from higher to lower level areas of the cortical hierarchy, is to predict neural activity. Differences between the predictions and the actual activity is computed and encoded in error neurons, which feed the errors back up the cortical hierarchy from lower to higher level areas. Learning dynamics and neural activity updates both seek to minimize the prediction errors. While empirical testing of PC theory is needed to validate it, the theory has seen significant empirical support and is consistent with much of what we know about neuroanatomy (huang2011predictive; kok2015predictive; walsh2020evaluating), making it a potential basis for a general theory of how the neocortex learns and performs inference.

Recent computational work on PC theory has shown that, under certain conditions, a seminal model of PC, originally developed by Rao and Ballard (rao1999predictive), learns in a way that is very similar to, or in some cases exactly similar to, the gradient back-propagation of errors algorithm (whittington2017approximation; millidge2020predictive; song2020can). This finding is significant from the point of view of neuroscience because PC networks learn using local, Hebbian-like learning rules, and therefore PC theory could provide a basis for an explanation of how the cortex solves the credit assignment problem. The theory may also advance neuromorphic computing by providing a path toward local learning algorithms that are compatible with neuromorphic hardware (davies2018loihi; qiao2015reconfigurable; friedmann2016demonstrating).

In this work, we modify Rao and Ballard’s (rao1999predictive) PC model with the goal of making the model more tightly constrained by biology. Doing so will yield both a model that better fits with neurobiology and will be a step toward implementing gradient-based PC algorithms in spiking neural networks. We make and test several modifications to the model. First, the weights that propagate errors in the original PC model are transposes of the forward weights. To avoid this biologically implausible transport of weights, we replace the weight transpose with a separate weight matrix. We test both using the weight matrix untrained (random feedback) and trained using a rule from (kolen1994backpropagation). Second, the neurons in Rao and Ballard model encode continuous valued firing rates and the values of activity neurons (green nodes in figure 1

) are allowed to be negative. Firing rates can only be positive in artificial spiking neurons and biological neurons, so we prevent values from going negative using a ReLU activation function and develop a method for preventing the loss of gradient information due to the ReLU function. Third, it is necessary to report negative errors, but neural activities cannot be negative. We discuss several ways errors can be encoded in spiking neurons with only (positive) firing rate values. We also compute the gradients for these error encoding schemes for firing rate neurons and use them to update the model.

We test these modifications on a supervised learning task using the MNIST dataset. We find that certain versions of the modified PC model works as well as the original PC model and backpropagation. This suggests the PC model may be a favorable basis on which to develop more neuromorphic and biologically-constrained spiking neural network models that use gradient-based learning.

## 2. Gradient-Based Predictive Coding

Term | Value |
---|---|

activity at level | |

prediction error at level | |

prediction at level | |

synaptic weights from to |

In this section, we describe the standard gradient-based PC model based on Rao and Ballard’s model (rao1999predictive). We choose to focus on this model because, under certain conditions, the model was demonstrated to closely approximate gradient backpropagation (whittington2017approximation; millidge2020predictive; song2020can). Song et al. (song2020can), for example, showed that the first non-zero weight update that occurs to each weight matrix during supervised training is exactly identical to those made by backpropagation. Whittington et al. (whittington2017approximation) showed that after the network converges, the equations that define how errors are propagated through the network match those of the backpropagation algorithm. This relation is important because it provides a formal link between PC and the minimization of a global objective function through gradient ascent, which is a method that works well in artificial neural networks (ANN) at scale. Other predictive coding that lack this connection have no guarantees on their ability to minimize global objectives.

There are two kinds of neurons in the PC model: activity neurons and prediction error neurons. Activity neurons represent features of the input, while error neurons encode differences between predictions of activity neurons and the actual values of activity neurons. Predictions of activities at level are passed from level according to the following equation:

(1) |

A local prediction error is then computed and encoded in error neurons. The error is the difference between the activity and the prediction, which is first passed through an element-wise non-linearity .

(2) |

The weight matrices are trained using this local error according to:

(3) |

is multiplied by a small learning rate before being added to the weights, where . This update is equivalent to taking the gradient of the squared prediction error with respect to the weights that produced the prediction (rao1999predictive).

In addition to updating their weights, PC networks also optimize their activities. Activities optimize the same cost, i.e. the squared prediction errors at each level. At each time-step, they update their activities according to:

(4) |

Here, is the gradient of the non-linearity. Updating activities at level is equivalent to performing gradient ascent over the activities using the gradient of the squared prediction error at level and the next level . Because one is performing gradient ascent, the computed gradient must be multiplied by a small inference rate , where , before adding to the activities. One typically wants the activities to update quicker than the weights, so should be larger than . This is because one wants activities to converge quickly to an optimal representation of the input, while learning (i.e. weight change) should occur slowly to ensure generalization ability.

## 3. Predictive Coding Constrained

We now describe our modified PC network. Each modification to the network is made to make the model fit better with neurobiology. However, we wish to make these changes without losing the ability of the network to propagate gradients and perform similarly to backpropagation.

### 3.1. Predictive Coding without Weight Transport

Equation 4 shows that errors propagated back through the network require using the transpose of the forward weights. Using the transpose of forward weights to propagate errors is generally considered biologically implausible as it implies a bidirectional synapse. Transposing a weight matrix is also problematic from a neuromorphic hardware implementation perspective because it requires the duplication and synchronization of the weights on the pre-synaptic and the post-synaptic side.

The weight transport problem has been tackled in the context of gradient-backpropagation for conventional deep networks using approximations such as random feedback weights (lillicrap2016random)

and local loss functions

(mostafa2018deep; akrout2019deep; kunin2020two). These approximations underperform slightly compared to exact gradient back-propagation, but do not require a symmetric transpose of the network weights.We propose two modifications to PC networks that avoid this problem, which build off of previous work done on the weight transport problem for backpropagation.

In the first network model, which we call Rand-PC, we replace the in equation 4

with a random matrix

of the same size. Replacing weight transposes with random feedback matrices has been shown to work reasonably well for approximating backpropagation in conventional deep networks (lillicrap2016random; xiao2018biologically). In the results section, we show it works nearly as well as backpropagation in our predictive coding networks.In the second network model, which we call the Kollen-Pollack PC (KP-PC), we use a method developed by Kolen and Pollack (kolen1994backpropagation), and extended by Akrout et al. (akrout2019deep), to train backward matrices so that they converge to the transposes of the forward matrices. Like Rand-PC, with KP-PC we replace the weight transpose in equation 4 with a separate weight matrix . This weight matrix is then trained with the transpose of the update to the forward . Kollen and Pollack build on the simple idea that, if one makes the same weight update to two matrices, and , the two matrices will grow more similar. However, due to numerical precision errors, the two matrices in practice eventually diverge. To solve this problem in conventional ANN, Kolen and Pollack add a small decay term to both updates. They show that and will eventually converge to (nearly) the same matrix.

(5) |

where is a small decay rate and is the weight adjustment. Acrout et al. (akrout2019deep) point out that these weight updates are local for the forward and backward matrices, if each weight matrix is connected to the same two populations of neurons. Interestingly, as can be seen from the equations in section 2 and figure 1, PC networks require just this structure, where forward weights take input from activities at level and output to error neurons at level , while backward weights to the opposite.

### 3.2. Biologically Constrained Activity Neurons

Equation 4 shows that PC networks allow for negative neural activity values because updates are linear. In biological neurons and artificial spiking neurons, an activation

typically represents the firing rate of a neuron, which cannot be negative. Thus, in a neural modelling and neuromorphic computing setting it is desirable to have neurons with non-negative activities. This raises the question of how well PC models work under the constraint that activities are non-negative. To test this, we enforce positive activation values by passing the activities through a rectified linear unit (ReLU) function after each activity update to ensure there are no negative values.

Additionally, to help prevent a loss of gradient information due to the ReLU function, we add a bias term to the model. In equation 2, for example, the error is now equal to . For the subtraction and division threshold schemes, one simply replaces with in the equations where is present. No other changes are needed. This simple addition to the model will help prevent activities from going negative and will therefore help prevent gradient information from being lost when negative activities are zeroed out. The bias will do this by forcing predictions to be greater than zero. This will make sure the network activities are initialized to non-negative values. It will also help activities maintain non-negative during activity updates, through the top-down error (second term in equation 2). The top-down error pulls the activities toward the predictions at the same level. If predictions are always non-zero, they will always pull activities toward non-zero values. If activities a non-negative, the gradients they accumulate will not be lost when passed through a ReLU function.

### 3.3. Biologically Constrained Error Neurons

Like the activity neurons, error neurons, according to equation 2, can take on negative values when activity values are over-predicted. For the same reasons listed in the previous section, it is desirable to have error neurons with only positive firing rate values. This raises the question of how prediction errors could be encoded in spiking patterns with non-negative firing rates.

Rao and Ballard and several other scientists (e.g.(keller2018predictive)) hypothesize there could be two kinds of prediction error neurons in the brain. One kind spikes in response to over-predicted values (i.e. negative errors according to equation 2). The other kind spikes in response to under-predicted values (i.e. positive errors according to equation 2). This encoding scheme implies that generally, when prediction errors are large the error neuron activities will be large, and while errors are small error neuron activity is small. These are the sort of error neurons that are assumed to exist in the standard PC model. Implementing this encoding scheme could, for example, use one set of error neurons that encode only the positive values of (underpredictions) and another set of error neurons encoding only the positive values of (overpredictions). We will call this sort of error encoding scheme subtractive separated encoding since the subtractive error is encoded in two separate kinds of error neurons.

An alternative way to encode errors in spike trains involves setting error neurons to have a baseline activity rate, which is dampened when activities are over-predicted and excited when they are under-predicted. When subtraction is used to dampen and excite error neurons, we call this encoding scheme the subtractive threshold encoding scheme. There may be several ways to implement this in spiking neurons. For example, one could use a constant input of current that causes the error neurons to fire at a certain rate even when errors are absent.

Here we show how a subtractive threshold scheme can be developed in firing rate neurons so weight and activity updates are equivalent to the original Rao and Ballard equations. Consider the following error neuron encoding scheme

(6) |

Here, is the minimum possible value of , and is the maximum possible value of . There will be a minimum value if activities are clamped to have a minimum value and is a squashing non-linearity (e.g. sigmoid). If there is no maximum one can replace with an approximate

value to obtain a similar result. For example, we apply the sigmoid function to

, and the ReLU function to the activities. If activities never go above 1.1, then and . This equation forces all error values to be between 0 and 2, with a baseline firing rate of 1. One can replace the 2 with another value which determines the maximum firing rate.In spiking neurons, the can be thought of as a constant excitatory current source that, absent any other inputs, causes the neurons to spontaneously fire at a constant rate. The term can also be seen as a constant input current that performs some form of normalization, which is well known to be a pervasive computation throughout the brain (carandini2012normalization).

Weight and activity updates are computed by replacing the term in equation 2 and 4 with , since . With these replacements, updates remain local (since they only depend on the local errors, local activities, and constants) and are equivalent to the original equations.

We note that these two error encoding schemes with biological or spiking neurons can be obtained as special cases of population decoding (dayan2001theoretical), which can be conveniently realized with e.g. the neural engineering framework (eliasmith2003neural). Our focus on these special cases is due to their efficiency, as they require one or two neurons per encoded error value compared to population of neurons per error value in the general case.

A third way to encode mismatches between predictions and activities is an encoding scheme that involves dividing the activities by the predictions (or vice versa). Spratling (spratling2008predictive), for example, developed a firing rate model of predictive coding which uses a division error term that divides activities (element-wise) by the predictions. His model was able to replicate some neurophysiological data, including fine-grained calcium imaging data that seemed to show the existence of neurons in the mouse primary visual cortex that were sensitive to mismatches between actual and predicted visual flow (spratling2019fitting). We call encoding schemes of this form division mismatch encoding.

Spratling’s particular model is not formulated to minimize a global loss using its gradients w.r.t. weights and activities. We are interested in working within the framework of gradient ascent as it has proven effective in large neural networks. We thus present an alternative firing rate model with division mismatch encoding that maintains gradient updates on activities and errors. We compute values of the mismatch neurons as follows:

(7) |

The is a small constant that prevents division by zero and helps prevent exploding gradients. The term ranges between 0 and (assuming activities and predictions are positive). Over-predictions range between and , while under-predictions range between and . We find that, although not necessary, adding the square root function improves learning slightly and makes activity updates more stable. When activities equal the predictions, will equal one, so we develop a cost function that measures the difference between and (see appendix). Weights and activities are then updated in proportion to the gradient of this new cost function (see appendix).

## 4. Results

Predictive coding networks can be trained for both supervised and self-supervised learning tasks. In self-supervised learning tasks, the activities of the bottom level activities (level 0 in figure

1) are set to the data (e.g. image), while the activities of the top level (level 4 in figure 1) are set to some constant (e.g. 1). Alternatively, the top level can be removed entirely. The network weights can then be trained online after each update to the activities, or a single weight update can occur after the network activities converge (bogacz2017tutorial).In what follows, we show the results of training our models using supervised methods, which were developed in (whittington2017approximation). In supervised learning, the bottom level activities of the PC network are clamped to the target, while the top-level activities are clamped to the input data. Activities are initialized by propagating predictions down level by level to level 0. Activities of hidden layers are set to the predictions at each level: . After initialization, activities are optimized. Weight updates can occur online or after the activities reach convergence (See Algorithm 1). We found activity convergence is slow for MNIST, so instead of waiting for convergence we update weights after activity updates. For fashion-MNIST, the activities converge more quickly, so we update weights after activity updates. When tested, the the activities at level are clamped to the test image, then predictions propagate down to the output layer (level ), where the output is compared to the target.

Inference rates are set to .1 for MNIST and .025 for Fashion-MNIST. Learning rates are set to .001. Adam optimizers are used to train all weight matrices. All models use a network with fully connected layers of size of 784-300-300-10. We train the networks to classify images in the MNIST and Fashion-MNIST data set. Each dataset consists of gray-scale images of size 28x28. There are 60,000 training images and 10,000 test images in each dataset. We show the classification accuracies on the test sets below.

^{1}

^{1}1Code for models can be found here: https://github.com/nalonso2/Tightening-the-Biological-Constraints-on-Gradient-Based-Predictive-Coding

### 4.1. PC with Separate Feedback Weights

We begin by testing how well PC networks train with the separate feedback matrices discussed in section 2.3. Here all networks use sigmoid activation functions. Rand-PC uses a fixed, random feedback matrix, while the KP-PC uses a weight matrix trained using the rules discussed in section 2.3. The test accuracy, averaged over three runs, is shown after each epoch of training starting after the first training epoch.

Data | Backprop | PC | KP-PC | Rand-PC | PC w/ Div | Rand-PC w/ Div |
---|---|---|---|---|---|---|

MNIST | ||||||

Fashion-MNIST | ||||||

Mean (and standard deviation) of validation errors (%) for PC models with sigmoid activitations. Three different seeds of each model were trained. Values shown are the means and standard deviations of the validation errors of the last three epochs across training runs.

We find the standard PC network, which uses the weight transpose for feedback, performs as well as backpropagation, which replicates the findings of (whittington2017approximation). The KP-PC network also performs as well as backpropagation, while the Rand-PC network only does slightly worse (see table 1). All models achieve a mean accuracy within a standard deviation of . Similar results are found with fashion-MNIST, where these models’ mean test errors were within two standard deviations of backpropagation.

### 4.2. PC with Constrained Activity Rates

Next, we test how well PC networks work under the constraint that activity neurons can only take positive firing rate values. As mentioned above, because firing rates are updated linearly, activity values can become negative even when initialized to be positive. It is possible then, that preventing activities from being negative (by passing them through a ReLU function after each update) may erase gradient information useful for credit assignment, and this may consequently hurt performance. Here we test how severe this potential loss of gradient information is when using different activation functions. All models have sigmoid activations at the output layer, but some models use sigmoid functions at hidden layers while others use tanh functions at hidden layers.

Figure 3 shows that constraining activities to be positive does not negatively affect the performance of the network when sigmoid activation functions are used. However, there is a small but significant drop in performance when Tanh non-linearities are used. When a small bias is added, as specified in section 3.3, the network using Tanh non-linearities sees no drop in performance.

There are a couple reasons why we see this pattern. When the network is initialized, the activities are set to the predictions which are passed through the non-linearity (see 1). The same process is used to generate predictions at test time. Because sigmoid maps all values to non-negative numbers, the network, when tested, will have activities with only positive values. Thus, when ReLUs are applied, no activities are affected. This is not true when Tanh is applied without a bias, which will map some numbers to negative values. However, when a bias is added, the Tanh network will not produce negative predictions.

However, this does not explain why learning in the sigmoid network without bias is not affected by constrained activity values, since updates to the activities that occur after the initialization can still push the activities to negative values. If an update pushed activity values to be negative, then the negative values will be set to zero and the gradient information they encode will be lost. This loss of gradient information will presumably negatively affect learning.

We find through experimentation, however, that the top-down error term (the second term in equation 2) prevents activities from going too far outside the range of the predictions. When predictions are passed through a sigmoid, for example, the top-down error will pull the activities toward the prediction, which is some positive number between zero and one. This generally prevents activities from going below zero. The same is not true of predictions passed through Tanh without a bias, which will not always pull activities toward positive values. In this case, useful gradient information will be lost when negative activities are clamped to zero. A bias term, thus, is necessary in cases where non-linearities allow for negative predictions, but may not be necessary for non-linearities that map all input to non-negative numbers.

### 4.3. Subtraction versus Division Errors

In section , we outlined three error neuron encoding schemes: subtractive separated encoding, subtractive threshold encoding, and division mismatch encoding. We also showed how to compute the gradients for the new encoding schemes. As explained above, the two subtractive error encoding schemes will look quite different in spiking neurons. However, we showed that both of these encoding schemes are mathematically equivalent in the firing rate model, and that the separated subtractive encoding scheme works as well as backpropagation. Here we test how well the division threshold encoding scheme compares to backpropagation.

Positive activity values are required for the division encoding scheme (due to the square-root and logarithm in equations 7 and 8), so we set all models to have positive activity values. We apply sigmoid activation at hidden layers. We still find that adding a small bias is necessary to prevent loss of gradient information, so we add a small bias.

We can see in figure 4 that the PC models that use division based errors produce comparable results to backpropagation. The division mismatch model that uses true gradients achieves a mean test error of while the division model with random feedback achieves a mean test error of (see table 1), which is only slightly worse than the performance of backpropagation, which had a mean test error of . The results are similar for fashion MNIST as well, where both division mismatch models produce mean test errors with less the one percent difference of the mean test error of backpropagation (see 1).

## 5. Discussion

In this paper, we showed that a more biologically constrained version of Rao and Ballard’s (rao1999predictive) seminal model of predictive coding performed similarly to backpropagation on supervised learning tasks using MNIST data. We found this to be true under constraints where 1) separate feedback weights were used to propagate errors, 2) activity values were prevented from going negative, and 3) error neuron activities were prevented from going negative using either division or subtraction based encoding schemes. We also showed how the gradients for the new encoding schemes could be computed and incorporated into the model.

These results suggest that it is likely possible for more biologically constrained models of gradient-based predictive coding to be built using spiking neural networks. We computed the gradients for the division and subtraction threshold error encoding schemes, which prevent error neurons from having negative activity rates. These equations can potentially be used as a basis for forming new equations for spiking neuron models. We also discussed how separated subtraction error encoding could be implemented in spiking neurons. Although spiking models of PC have been previously demonstrated (e.g. (wacongne2012neuronal)), as far as we can find, spiking neural models of PC that utilize gradient-based inference and learning have yet to be developed.

Of course, in spiking neurons the true gradients cannot be computed because the step functions used as non-linearities are non-differentiable. However, surrogate gradient methods (neftci2019surrogate) used to approximate gradients in spiking neural networks can naturally be incorporated into predictive coding networks. Such surrogate-gradient spiking PC models could help further develop the empirical hypothesis that predictive coding is the general algorithm the brain uses to solve the credit assignment problem. Additionally, such networks could lead to useful local-learning algorithms that are compatible with neuromorphic hardware.

It is still an open question whether the cortex is performing some form of predictive coding. There is good evidence that top-down connections in the cortex do propagate predictions (huang2011predictive; kok2015predictive; walsh2020evaluating). However, the hypothesis that the cortex encodes prediction errors widely in specialized error neurons (which is a key implication of PC) is not yet widely accepted within neuroscience. There is much data that is consistent with the hypothesis that such error neurons exist in the cortex, but none of it is particularly conclusive (for recent review see (walsh2020evaluating)).

One reason why it has been difficult to locate error neurons is that different error encoding schemes yield different empirical predictions (walsh2020evaluating). However, some progress is being made. Spratling (spratling2019fitting), for example, recently found that division-based error encoding schemes better fit certain neurophysiological data than the subtractive encoding of the Rao-Ballard model. Spratling did not compare the physiological data to a subtractive threshold encoding scheme, like the one we proposed here (equation 6), so it is unclear whether division error encoding fits with the data better than subtractive schemes generally or only with a particular type of subtractive encoding present in the Rao-Ballard model. Nonetheless, this study shows how we can begin to build evidence in favor of one hypothesis over another.

Our model also illustrates that gradient-based PC is compatible with multiple different kinds of error encoding schemes. So the general hypothesis that some form of gradient-based PC is utilized by the cortex does not depend on there existing a subtractive rather than divisive error encoding. It will, instead, depend on the way the errors are propagated through the cortex and used to affect neural activity and learning. It may be that errors and weights are updated using some other optimization method (e.g. (spratling2009unsupervised)).

More empirical and computational work will be needed to settle these debates. We hope, however, that the work presented here provides new avenues to further develop and test neural models of gradient-based predictive coding and to further develop useful brain-inspired learning algorithms.

## 6. Conclusion

PC is a theory that is of interest to both cognitive science, neuroscience, and engineering. It, for one, is able to explain a wide array of neurophysiological and anatomical data. PC also has potential to provide a path toward understanding how the brain solves the credit assignment problem. We also suggested that the local learning rules used within PC models could lead to useful learning algorithms that are compatible with the constraints neuromorphic hardware. In this paper, we showed that the standard gradient-based PC model can learn and perform inference well under tighter biological constraints. This further supports the position that PC can be developed in a way that is useful for both neuroscience and neuromorphic computing, and marks a path along which PC can be further developed.

###### Acknowledgements.

This work was supported by the National Science Foundation under grant 1652159 and 1823366 (EN).## References

## Appendix A Research Methods

### a.1. Division Encoding Cost Function

The division encoding error is equal to . Under this encoding scheme, when . The goal then is to update weights and activities to reduce the difference between and 1. To do so, we use the following cost function:

(8) |

When the log of will equal 0 so the cost at level will equal 0. Deviations from 1, will lead the cost to increase. We use the log here, instead of simply subtracting from 1 because it simplifies the computations of the gradients.

#### a.1.1. Weight Updates for Division Encoding

Here we derive the gradients of C at level with respect to the forward weights at level :

Lets call the term inside the parentheses , such that . Now we need to compute

, which can be decomposed using the chain rule as follows.

Each term can then be computed individually.

Now if we combine terms we get the following weight update

(9) |

We can see this update is the outer product of pre-synaptic and post-synaptic information. In particular, the pre-synaptic activity is multiplied by the error neuron activities, which are first passed through a non-linearity (logarithm) and multiplied by information about the predictions at the same level.

#### a.1.2. Activity Updates for Division Encoding

Here we derive the gradients of the cost at level w.r.t. the activities. Like original equation, the gradients w.r.t. are derived from the cost at the same level and the next level (i.e. level and ). The gradient of w.r.t can be decomposed using the chain rule as follows.

We already computed all of these terms in the last section except for . The gradient is just equal to . This gives us the bottom-up error (the first term in equation 4) for the activity update :

Now we need to compute . This term can be decomposed using the chain rule as follows.

We saw in the last section that is equal to . is . With these terms we can now compute the top-down error for the activity update (second term in equation 4):

Combining the top-down and bottom up error now gets us the activity updates under the division encoding schema:

(10) |

Comments

There are no comments yet.