Cortico-cerebellar networks as decoupling neural interfaces

by   Joseph Pemberton, et al.
University of Bristol

The brain solves the credit assignment problem remarkably well. For credit to be assigned across neural networks they must, in principle, wait for specific neural computations to finish. How the brain deals with this inherent locking problem has remained unclear. Deep learning methods suffer from similar locking constraints both on the forward and feedback phase. Recently, decoupled neural interfaces (DNIs) were introduced as a solution to the forward and feedback locking problems in deep networks. Here we propose that a specialised brain region, the cerebellum, helps the cerebral cortex solve similar locking problems akin to DNIs. To demonstrate the potential of this framework we introduce a systems-level model in which a recurrent cortical network receives online temporal feedback predictions from a cerebellar module. We test this cortico-cerebellar recurrent neural network (ccRNN) model on a number of sensorimotor (line and digit drawing) and cognitive tasks (pattern recognition and caption generation) that have been shown to be cerebellar-dependent. In all tasks, we observe that ccRNNs facilitates learning while reducing ataxia-like behaviours, consistent with classical experimental observations. Moreover, our model also explains recent behavioural and neuronal observations while making several testable predictions across multiple levels. Overall, our work offers a novel perspective on the cerebellum as a brain-wide decoupling machine for efficient credit assignment and opens a new avenue between deep learning and neuroscience.



There are no comments yet.


page 23


Error Forward-Propagation: Reusing Feedforward Connections to Propagate Errors in Deep Learning

We introduce Error Forward-Propagation, a biologically plausible mechani...

Credit Assignment in Neural Networks through Deep Feedback Control

The success of deep learning sparked interest in whether the brain learn...

Surprisal-Driven Feedback in Recurrent Networks

Recurrent neural nets are widely used for predicting temporal data. Thei...

Ensemble perspective for understanding temporal credit assignment

Recurrent neural networks are widely used for modeling spatio-temporal s...

Ghost Units Yield Biologically Plausible Backprop in Deep Neural Networks

In the past few years, deep learning has transformed artificial intellig...

Dendritic cortical microcircuits approximate the backpropagation algorithm

Deep learning has seen remarkable developments over the last years, many...

Scalable Online Recurrent Learning Using Columnar Neural Networks

Structural credit assignment for recurrent learning is challenging. An a...

Code Repositories


Code used in "Cortico-cerebellar networks as decoupling neural interfaces"

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Efficient credit assignment in the brain is a critical part of learning. However, how the brain solves the credit assignment problem remains a mystery. One of the central issues of credit assignment across multiple stages of processing is the need to wait for later stages to finish their computation before learning can take place. In deep artificial neural networks, where processing is divided into a forward (i.e. prediction) and feedback (i.e. error backpropagation) phase, these constraints are explicit

(rumelhart1986learning; schmidhuber1990networks; lee2015difference; marblestone2016toward; jaderberg2017decoupled). Given hidden activity at a certain layer or timestep, network computations at subsequent layers/timesteps must first be computed before feedback gradients are available for network parameters to be updated. Not only does this introduce computational inefficiency, in particular for the temporal case where backpropagation through several timesteps is required, but it is also considered biologically unrealistic lillicrap2019backpropagation. Recently, a framework was introduced to decouple forward and feedback processing in artificial neural networks – decoupled neural interfaces (DNI; jaderberg2017decoupled)111DNIs are related to earlier work on using network critics to train neural networks (schmidhuber1990networks). – in which a connected but distinct neural network provides the main network with predicted forward activity or feedback gradients, thereby alleviating the locking problems.

Here, we propose that a specialised brain area, the cerebellum, performs a similar role in the brain. In the classical view the cerebellum is key for fine motor control and learning by constructing internal models of behaviour (Marr1969; ALBUS197125; raymond2018computational; wolpert1998internal; Miall1993). More recently, however, the idea that the cerebellum is also involved in cognition has gained significant traction (Schmahmann2019; Wagner2020; Brissenden2019). An increasing body of behavioural, anatomical and imaging studies points to a role of the cerebellum in cognition in both human and non-human primates (Guell2015; Brissenden2019; Schmahmann2019; Guell2018). In particular cerebellar impairments have been observed in language (Guell2015), working memory (deverett2019cerebellar), planning (baker1996neural), and others modalities (fiez1992impaired). Coupled with its notoriously uniform structure, these observations suggests that the cerebellum implements a universal function across the brain (Marr1969; ALBUS197125; raymond2018computational; Diedrichsen2019). Moreover, experimental studies looking at cortico-cerebellar interactions have directly demonstrated that cerebellar output is crucial for the development and maintenance of neocortical states (gao2018cortico; deverett2019cerebellar; chabrol2019cerebellar). However, to the best of our knowledge, no computational formulation exists which explicitly describes the function of such interactions between the cerebellum and cortical areas.

With this in mind and inspired by deep learning DNIs we introduce a systems-level model of cortico-cerebellar loops. In this model and consistent with the cerebellar universal role, the cerebellum serves to break the spatio-temporal locks inherent to both feedforward and feedback information processing in the brain, akin to DNI. In particular, we posit that the cerebellar forward internal model hypothesis of the cerebellum is equivalent to DNI-mediated unlocking of feedback gradients. Following this view the cerebellum not only provides motor or sensory feedback estimates, but also any other modality encoded by a particular brain region. In this regard we introduce a cortico-cerebellar RNN (ccRNN) model which we test on sensorimotor tasks: (i) a simple line drawing task

(sanes1990motor; Butcher2017; Nashef2019)

and (ii) drawing tasks with temporally complex input based on the MNIST dataset, but also (iii) on a cognitive task – caption generation

(Guell2015). Our results support the decoupling view of cortico-cerebellar networks by showing that they improve learning while reducing ataxia-like behaviours in a range of tasks, being qualitatively consistent with a wide range of experimental observations (sanes1990motor; Guell2015; Butcher2017; Nashef2019).

2 Cerebellum as a cortical feedback prediction machine

We first describe DNIs following jaderberg2017decoupled and then establish the link to cortico-cerebellar networks. We focus on “backward” (feedback) DNI which directly facilitates learning and is the model considered in this paper, but we also consider “forward” DNI to be strongly relevant to cerebellar computation (see SM A).

Although in this paper we focus on the DNI in a temporal setting, to better explain DNI we first briefly explain the spatial case. This case not only serves as a general basis for the temporal case (i.e. one can think of a recurrent neural network as a specific instance of a feedforward network), but also highlights the possibility to place cerebellar modules between different pathways. Assume that a feedforward neural network consists of layers, with the th layer () performing a “computational step” with parameters . Given input at layer 1, the output of the network at its final layer is therefore given by . Let denote the composition of steps from layer to layer (inclusively). Finally, let denote the (hidden) activity at layer , so that with .

We now illustrate the learning constraints of standard artificial neural networks used in deep learning. Suppose that a network is in its feedback (or backpropagation) phase, with current input-target pair . To update the layer parameters the gradient is required, where is the loss which compares the target value against the model output

under some loss function

; we then apply gradient descent on the parameters, , with learning rate . Suppose however that the network has only recently received the input and is currently only at layer of the forward computation. In order to update the corresponding parameters of that layer, , the layer must first wait for all remaining layers to finish () for the loss to be computed. Only then are the various gradients of the loss are backpropagated and is finally available. The enforced wait makes layer “feedback locked”222We use feedback locked networks to incorporate both the “update” and “backward” lock as described in jaderberg2017decoupled. to and imposes an complete dependence between learning of the layer in question and the computations in the remaining network.

The basis of DNI is to break the feedback lock by sending a copy of the hidden layer activity to a separate neural network , termed “synthesiser” in (jaderberg2017decoupled) and which we propose as being implemented by the cerebellum in the brain, which then returns a synthetic gradient – an estimate of the real loss gradient with respect to the layer, i.e. where . We can then update our original layer immediately according to gradient descent, effectively breaking the feedback lock with


How can we trust the synthetic gradient to improve performance? The parameters of are themselves learnt so as to best approximate the observed feedback gradient. That is, once the standard forward and feedback computations are performed in the rest of the network (which may take some time) and is available, we minimise the objective function . Assuming learns a decent approximation, the original layer parameters should then move (roughly) along the direction of the true loss gradient. Consistent with the need for this approximation a model with a fixed cerebellum (i.e. not learnt) gives poor or often no learning, indicating that output is not simply facilitating learning via a stabilising effect (Fig. S7).

Temporal feedback in RNNs

Using the same terminology but replacing layer indices with time and setting , we can apply the above to recurrent neural networks (RNNs; Fig. 1a). Now the synthesiser predicts the effect of the current hidden activity to future losses , effectively providing the RNN with (approximate) temporal feedback before the sequence may even be finished. As an optimization algorithm to obtain the true temporal feedback we use backpropagation through time (BPTT). In principle, this feedback is only available once all the future feedback is available and itself would be feedback locked. To circumvent this issue learns using a mixture of nearby true feedback signals and a bootstrapped (self-estimation) term (Fig. 1b), where for each we now minimise with (jaderberg2017decoupled). Here the first term includes the nearby feedback signals backpropogated within some limited time horizon

, and the second term includes the bootstrapped cerebellar prediction which estimates gradient information beyond this horizon (analogous to temporal difference algorithms used to estimate value functions in reinforcement learning). In this case the cerebellum feedback facilitates learning by enabling the RNN to learn using (predicted) future feedback signals that would otherwise be inaccessible, enabling a future-aware online learning setup. In particular, this model has already been demonstrated to thrive in the harsh but more biologically reasonable setting of reduced temporal windows for BPTT (small

; cf jaderberg2017decoupled).

Figure 1: Cerebellum as a cortical feedback prediction machine. (a) Learning in a given recurrent cortical area is modelled as feedback-encoding gradients (red arrow) originating from a loss or error region at the end of a task (time ). Meanwhile the cerebellum receives the current cortical activity (black arrow) which projects onto granule cells (GC; orange) and Purkinje cells (PC; orange) and sends back the predicted cortical feedback (blue arrow). Cerebellar learning is mediated by the inferior olive, which compares estimated feedback with observed feedback , computing (see text for more details). (b) Example of cortico-cerebellar model unfolded across three time steps. Starting at the cortical network receives feedback (red), which is then transmitted to the cerebellar network as cortical feedback (light blue). The cerebellum generates cortical feedback predictions (blue) given cortical activity (black), and learns using inferior olive (diamond) error signals (red arrow). Before , cortical feedback is not readily available, thus the cerebellum learns through bootstrapping. In this case the inferior olive (diamond) compares the old cerebellar prediction (e.g. ) with the new one (e.g. ) to generate cerebellar learning signals (red arrow; see main text for details).

2.1 Mapping between cortico-cerebellar networks and decoupled neural interfaces

Building on the temporal feedback DNI we introduce a systems model for cortico-cerebellar computation, which we denote as ccRNN (cortico-cerebellar recurrent neural network; Fig. 1). The model includes two principal components: an RNN which models a recurrent cortical network (or area, e.g. motor cortex or prefrontal cortex), and a feedforward neural network - the synthesiser - which receives RNN activity before returning predicted feedback and which we interpret as the cerebellum. The cerebellum receives the current cortical activity through its mossy fibres which then projects onto granule cells and Purkinje cells and sends back the predicted cortical feedback .

At the architectural level our model is in keeping with the observed anatomy of the cerebral-cerebellar circuitry. Recurrent connections are widespread in the neocortex and are considered functionally important for temporal tasks (mante2013context), whereas the cerebellum exhibits a relatively simple feedforward architecture (Marr1969; ALBUS197125; raymond2018computational). Moreover, an important defining feature of the cerebellum is its expansion at the granular layer: with billion granule cells (more than the rest of the brain’s neurons combined) there is a considerable divergence from the million input mossy fibres. To replicate this, in our experiments we include a high divergence ratio () between the RNN network size and the hidden layer of the cerebellar network (see SM for network sizes). In our experiments this ratio helped the cerebellar module learn more quickly (not shown).

At the functional level, our model suggests that the cerebellum should encode a variable dependent on error signals (in particular, the error gradient with respect to neocortical activity), consistent with neural recordings showing cerebellar-encoding of kinematic errors during motor learning (popa2012predictive; lang2017roles; 10.3389/fncel.2018.00524), but also reward prediction signals in cognitive tasks (wagner2017cerebellar).

Finally, we conform to the classical view that the inferior olive would be the site at which cerebellar prediction and feedback are compared to compute the cerebellar cost which drives cerebellar learning (Marr1969; ALBUS197125; ohmae2015climbing; raymond2018computational)

. Crucially, we also highlight the cerebellum’s ability to learn in the case of delayed feedback - which is necessarily the case under this model - via its specially adapted timing rules for plasticity at the parallel fibre synapse


2.1.1 Relationship to existing cerebellar computational models

For over half a century computational neuroscientists have posited that the feedforward circuitry and available error signals (stemming from the inferior olive) make the cerebellum well suited for pattern recognition (Marr1969; ALBUS197125; houk1996models; hausknecht2016machine; raymond2018computational). This computational perspective is in line with our model, but which patterns should be learnt, and how is cerebellar output used by the rest of the brain has remained unclear. The most prevalent theory points towards the cerebellum as an “internal model” of the nervous system wolpert1998internal, which may arise in two forms, an inverse model or a forward model.

We argue that ccRNN is in fact an instance of the forward model hypothesis (see SM A for a link between DNIs and the inverse model). In the classical forward model of sensorimotor control, the cerebellum receives an efferent copy of the motor command from the motor cortex (PMID:5499516; Miall1993), and the respective sensory feedback. With these two inputs the forward model learns to predict the sensory consequences of motor commands. Generalising to the non-motor domain, a similar predictive model can be applied to the brain activity between virtually any two brain areas (Ito2008). The prefrontal cortex and the temporo-parietal cortex, for instance, both involved in planning of cognitive behaviour and decision making, are possible cognitive-associated areas where a forward “mental” model may be applied (Ito2008; Schmahmann2019; Wagner2020; Brissenden2019). Our model also takes neural activity from any brain area and its associated feedback signals , which provides, to the best of our knowledge, the first explicit computational implementation of postulated brain-wide forward models. In addition, existing computational models have so far only been used in simplistic sensorimotor tasks, whereas our model can be readily applied to a wide range of tasks as we demonstrate below.

2.1.2 Encoding feedback gradients in the brain

Our systems level model uses BPTT to generate temporal feedback signals. We use weak forms of truncated-BPTT (i.e. with a reduced BPTT window), using only one BPTT step in some cases, which mimics a more biologically realistic setting. It is in principle possible to use models capable of generating biologically plausible feedback (lillicrap2019backpropagation; guerguiev2017towards; sacramento2018dendritic; richards2019dendritic; Payeur2020.03.30.015511; ahmad2020gait). Computational models of spatial backpropagation of gradients (i.e. across layers) suggest that gradient-encoding feedback signals must be encoded in distal dendrites (guerguiev2017towards; sacramento2018dendritic). In addition, these models demonstrate that explicit gradient feedback information is not needed, but rather that this can be reconstructed locally in dendritic microcircuits using cortical feedback (sacramento2018dendritic). The cerebellum provides feedback connections via higher order thalamocortical loops (gornati2018differentiating) that target apical dendrites of pyramidal cells in the neocortex (guo2018anterolateral; fujita2020modular; anastasiades2021mediodorsal). These cerebellar feedback loops thus provide a good substrate for driving cortical learning through cortical feedback prediction, which is consistent with our model. On the other hand bellec2019biologically have shown that temporal gradients as used in BPTT can be approximated by biologically plausible eligibility traces that transmit gradient information forward in time. Both of these solutions can, in principle, be incorporated into our framework, in which ccRNN would predict feedback activity originating from upstream brain areas (as in sacramento2018dendritic) and/or make forward predictions which are then integrated with locally computed eligibility traces (bellec2019biologically).

3 Results

We contrast a model with cerebellar feedback by comparing ccRNN to a purely cortical - that is, without cerebellar predicted feedback - RNN (cRNN) on a variety of tasks. Tasks are chosen so as to broadly approximate conditions in which cerebellar patients have shown deficits, including simple and then more complex sensorimotor control/planning paradigms before ending with a challenging language-based task.

Cortical recurrent neural networks are modelled as long short-term memory networks (LSTMs)

(hochreiter1997long) in line with previous work (costa2017cortical; wang2018prefrontal). A linear readout is added on top of the RNN to compute final model output. The window of cortical temporal feedback is modelled using BPTT of a specific truncation window. We refer to feedback originating from the cerebellum as predicted feedback.

For the sensorimotor tasks we also test the models with different external feedback intervals. This defines how often the model has access to an external teaching signal (e.g. only every 2 timesteps). This mirrors a biologically-relevant setting where visual feedback is not continually available but can only be sampled at some rate. Therefore, the model is trained with external feedback only for some timesteps, and that it should ideally generalise to the remaining timesteps. We also evaluated the model performance under these testing conditions (i.e. in which external feedback was not available) to yield an ataxia score, a measure designed to quantify the irregularity of movement associated with cerebellar ataxia (trouillas1997international). Concretely, we compute the ataxia score as the total deviation from the model output to the targets provided during training as well as those for which external feedback was unavailable.


where and denotes the target and model output at time of the sequence, respectively, is the task-associated error function (e.g. mean squared error), and denotes the set of times at which error feedback is given. Note that the first term is simply the training error (which defines the model optimisation problem), whereas the second term quantifies the model’s ability to generalise to unseen parts of the data (for example, intermediate points on a line).

3.1 Simple sensorimotor task: line drawing

Figure 2: Line drawing task. (a) Schematic of the task. One out of six cues is given at and the network must learn to draw a straight line to the respective target. Sparse feedback is used so that an explicit target is only provided at every other timestep during training. (b) Learning curves for cortical RNN (cRNN; grey) and cortico-cerebellar RNN (ccRNN; orange). These networks use cortical temporal feedback of (cf. d), increasing the difficulty of learning in the RNN. Trajectories produced by the models towards 6 targets are given as insets. (c) Ataxia score for both models at the end of training. (d) Mean squared error of the ccRNN normalised to that of cRNN over different cortical temporal feedback windows (reported as a percentage of total task length); arrow indicates level of feedback used in (b,c). All experiments used 10 seeds.

Inspired by classical sensorimotor studies in the cerebellum we first test a simple line drawing task (sanes1990motor; Butcher2017; Nashef2019), in which given a cue at time the network needs to produce a straight line towards one of 6 targets over 10 timesteps (Fig. 2a; see SM for more details). In order to model a more biologically realistic setting the external feedback (i.e. how far the model output is from a straight line) is only provided at every other timestep (sparse external feedback). Although the cRNN model learns to perform the task it does so more slowly and with higher variability (Fig. 2b), in line with cerebellar experiments (sanes1990motor; Butcher2017; Nashef2019). Importantly, we also observe differences in the model output (Fig. 2b). Even though the models were only trained with sparse external feedback the ccRNN generalises to a near-perfect straight line whereas the cRNN produces a much more irregular, oscillatory-like output (Fig. 2b), leading to an increased ataxia score (Fig. 2c). Moreover, our model predicts that the cerebellum module is beneficial when the main cortical recurrent network struggles to learn the task on its own (i.e. when using a short temporal feedback window; Fig. 2d). This is consistent with the observation that the cerebellum is more important to correct for larger errors during a motor task (criscimagna2010size). In addition to this prediction our model makes several other predictions from a systems level down to a cellular level. Below we highlight some of these predictions using the line drawing task and relate it to the experimental literature where possible.

Facilitation of learning depends on feedback interval

The benefits provided by the ccRNN model are stronger at intermediate feedback intervals (Fig. 3a). In particular, we find that when the external feedback is given continuously there are only minor benefits of having the cerebellar module, but if the external feedback interval is increased then having a predictive cerebellar module becomes more important until the feedback interval is just too large for the models to be able to learn the task (Fig. 3a). We observe these differences because the feedback predicted by the cerebellum is only useful once the external feedback is relatively infrequent. Although this is broadly in line with the important role of feedback in cerebellar-mediated learning (sanes1990motor; diener1992pathophysiology; honda2012adaptation) this exact prediction remains to be tested.

Figure 3: Model predictions using simple line drawing task. (a) Cerebellar feedback facilitates learning across a range external feedback intervals. (b

) Cortico-cerebellar coupling measured as pairwise correlations for first 100 epochs of learning. Insets show the change in distribution of correlations at respective epochs (left: early epochs; right: later epochs). (


) Importance of cerebellar feedback reduces over learning revealed by ablation study. Top: Learning curves for a few example points of ablation. Vertical lines represent points of ablation. Bottom: Summary plot of the training error (ccRNN normalised to cRNN) across epoch of ablation. Error bars represent standard error of the mean using 10 seeds.

Cortico-cerebellar coupling is training-dependent

As a measure of the cortico-cerebellar coupling we calculated the pairwise correlations between the activity of each neuron in the main cortical RNN and the activity of each hidden neuron (granule cell) in the cerebellar module (see SM). Our model reveals two distinct phases of cortico-cerebellar coupling during learning (Fig. 3b). During the fast phase of learning we observe a noticeable rise in the average cortico-cerebellar coupling (in line with (Wagner2020)). In addition, our model predicts that as training continues the population correlation should gradually decrease. In the model these changes in coupling are explained by the fact that initially the model needs to output relatively high feedback due to the high errors, but as learning progresses errors (and feedback) becomes gradually smaller, thus making the cerebellar module less correlated with the activity of the main cortical network.

Importance of cerebellar feedback reduces over learning

The decrease in coupling over learning predicts that the importance of the cerebellar feedback should become weaker over learning. To test this idea and which phases of learning are more crucial for the cerebellar-mediated facilitation of learning we completely ablated the cerebellar module at different points during learning (Fig. 3c). Although these results are consistent with the classical observations made in cerebellar patients (sanes1990motor; Butcher2017; Nashef2019) this ablation study predicts a nonlinear relationship between ablation time and performance impairment, which has not been tested experimentally. Our model predicts that ablating the cerebellum after learning has started actually has a more detrimental impact than if there was not a cerebellum to start with (Fig. 3c). This happens because when a cerebellum module is present the main cortical network starts learning by relying partially on the feedback predicted by the cerebellum, so that if this component is suddenly removed the learning trajectory of the main network is perturbed and needs to be readjusted. After this first critical phase of learning the cerebellum becomes gradually less important ( epochs). These results echo the existing literature (doyon2003distinct; galliano2013silencing) in that though shared cortico-cerebellar dynamics might emerge during the acquisition of a novel motor sequence, these dynamics can become less interdependent once knowledge becomes consolidated.

3.2 Advanced sensorimotor tasks

Figure 4: Advanced sensorimotor tasks based on MNIST dataset. (a

) Example model output given an MNIST digit as input (colour-coded by digit) from a trained ccRNN model for the line drawing sequential MNIST task (ld-seqMNIST, top) and the digit drawing sequential MNIST (dd-seqMNIST, bottom); underlying grey dots represent target output. Sparse feedback is used so that an explicit target is only provided at every 4 timesteps during training. (

b) Training mean-squared error. c Ataxia score for both models at end of training. (d) ccRNN error (normalised to cRNN error) over different cortical temporal feedback windows (reported as a percentage of total task length); arrow indicates level of feedback used in (a,b,c)). Error bars represent standard error of the mean using 3 seeds.

To test how the previous line drawing task generalises to a more realistic setting with continuous input we introduce two sensorimotor tasks which build on the classical MNIST dataset (lecun2010mnist). Here we present the model with a given MNIST image sequentially row by row (28 rows, each with 28 pixels), whilst training the model to simultaneously draw a digit-specific (i) straight line (ld-seqMNIST) or (ii) a digit template (dd-seqMNIST) (Fig. 4a). The latter case is inspired directly from known cases of cerebellar agraphia (de2011cerebellar). As before we employ sparse external feedback at a certain interval (every 4 timesteps).

As in the line drawing task ccRNN learns faster than cRNN while also showing a reduced ataxia score (Fig. 4b, c). The positive effect of cerebellar feedback is less apparent when the model was trained to write digits, possibly due to the less predictable, non-linear nature of target output and subsequently external feedback. As with the line drawing task we find that predicted feedback is most valuable where strong constraints on BPTT are enforced (Figs. 4d, S3). We observed a similar dependency to the simple line drawing task on the level of external feedback (Fig. S4), cortico-cerebral coupling (Fig. S5), and ablations (Fig. S6). However, in contrast to the simple line drawing task, cerebellar feedback is still relatively important even in later stages of learning. This is due to the non-zero error even after convergence in these tasks.

Finally, we also considered how ccRNN performs on the more standard sequential MNIST task, in which it is necessary to classify the digit at the end of the sequence

(le2015simple). In line with the regression tasks, we find significantly faster learning ( higher accuracy after about 10 epochs) when using the ccRNN model (Fig. S2). This is consistent with a role of the cerebellum in decision making or sensory discrimination tasks gao2018cortico; deverett2018cerebellar; king2019functional.

3.3 Cognitive task: Caption generation

Figure 5: Caption generation task. a Model schematic with CNN (blue), cortical RNN (LSTM in gray) and cerebellar module (orange). CNN is pretrained on a image dataset, while the RNN is trained to predict the next word in a caption. b Learning curves in bits per word (BPW). c Sample test image with associated model produced captions (colour coded as in b; black denotes gold standard caption). More examples are given in the SM. Error bars represent standard error of the mean using 5 seeds.

We emphasise that our framework does not only apply to sensorimotor tasks, but should generalise to virtually any task and learning paradigm (supervised, unsupervised and reinforcement learning). To demonstrate this and inspired by cognitive tasks in which cerebellar patients have shown deficits (gebhart2002role)

we test our models in a caption generation task. In this task the network learns directly from the data (i.e. unsupervised learning) to generate a textual description for a given image. All models have two components: a pretrained convolutional neural network (CNN) to extract a lower dimensional representation of the image, and an RNN to learn a simple model of language given a visual representation from the CNN (Fig. 


We use a standard dataset (ILSVRC-2012-CLS; ILSVRC15) and the networks are trained to maximise the likelihood of each word given an image (SM for more details). We find that also in this more challeging task the cerebellum module manages to learn good enough feedback to improve learning when compared with cRNN (Fig. 5b). Interestingly, the captions generated on test images are qualitatively more accurate for the ccRNN model (Figs. 5c, S8), suggesting that the ccRNN model better captures the contextual nature of this task, in line with experimental observations (gebhart2002role). Moreover, ccRNN also outputs a better model of language as assessed by language metrics (Table S2).

4 Conclusions and discussion

We have introduced a systems level model (jaderberg2017decoupled)

of cortico-cerebellar function, where the cerebellum’s role is to provide the neocortex with predicted temporal feedback. We propose that an existing deep learning framework can mimic the function of one of the largest projection pathways in the brain, cortico-cerebellar networks. This systems level framework abides to the classical notion of the cerebellum as a pattern learning multi-layer perceptron in the context of the forward model hypothesis. In particular, our model suggests that the cerebellum encodes future errors (or analogously, rewards) which are then transmitted to the neocortex for efficient learning. It thus proposes how the brain might learn efficiently from future teaching signals: a key requirement of temporal credit assignment.

One of the advantages of our model (ccRNN) is that it can be readily applied to a wide range of tasks as a systems-level model of cortico-cerebellar interactions. Indeed, our model makes explicit the concept of a “cognitive error” which is directly connected to growing empirical evidence for a cerebellar role in working memory, planning, language and beyond alexander2012cognitive; king2019functional. However, to make more specific predictions at the cellular, sub-cellular and cell-type level an important next step is model cortical feedback errors in a more biologically plausible fashion (sacramento2018dendritic; richards2019deep).

Our model makes a number of predictions (see above). For example, that cerebellar involvement is most beneficial where neocortical temporal learning mechanisms are not long enough for the task at hand. This offers an explanation for the often detrimental impact cerebellar impairments can have on motor learning which are often temporally challenging. It is important to highlight that our results do not predict that cerebellar impairments will necessarily lead to an inability to learn, but merely a slower learning curve and perhaps lower performance threshold. There are however specific conditions in which the cerebellum is optimally placed to facilitate learning according to our model, namely the regularity of the external feedback, the ability for learning of cortical networks, the difficulty of the task and the exact phase of learning. In addition our model also makes predictions about cortico-cerebellar coupling, predicting a steady decrease of the population correlation as learning progresses.

Our model puts forward a solution for how the brain deals with delayed feedback signals, by relying on the cerebellum to predict future feedback signals. A key experiment which could be used to test our model would be to demonstrate that brain areas responsible for computing prediction errors are not by themselves a requisite for plasticity. The cerebellum by itself should be sufficient for cortical learning, provided that it has learnt a good predictive model of future feedback.

There are a number of interesting directions to take forward with this model. For example, in the present study we do not explicitly model the long-range projections via the thalamus and pons that mediate cortico-cerebellar interactions (guo2018anterolateral; fujita2020modular; anastasiades2021mediodorsal), but for simplicity we have assumed direct connectivity. We predict that introducing these intermediate structures with low dimensionality has two potential important functions: (i) force the cerebellum to focus on learning only the most important feedback signals and (ii) gate in and out cerebellar feedback so that this is only allowed to modulate learning when beneficial. Moreover, here we have focused on a variant of the model that predicts feedback signals needed for learning, but it is likely that the cerebellum is also involved in speeding up feedforward processing as in forward DNIs (see SM).

On the deep learning side there are also a number of promising options to explore. Predicting feedback is a hard task, due to its continuously changing nature. There are a number of architectural and cellular features of the cerebellum that may make learning feedback predictions more efficient in these models, namely mossy sparse connectivity (as observed between mossy fibres and granule cells (schweighofer2001unsupervised; cayco2017sparse)) and modularity (as observed throughout the cerebellum (apps2018cerebellar)

). At the same time, it may also be fruitful to consider the problem and advantages regarding decoupling neural processing more generally. Various other, potentially non-cerebellar, methods for unlocking have been proposed in machine learning that may provide interesting avenues for neuroscience

(lee2019local; belilovsky2020decoupled; zhuang2021fully), though transferring such mechanisms to the temporal domain and ensuring biological plausibility remains an open problem.

Overall, ccRNN provides a natural link between deep learning and existing cortico-cerebellar ideas and more classical cerebellar models. Moving forward we hope the framework to offer a novel but concrete framework with which to study the cerebro-cerebellar loop for neuroscientists and as a source of inspiration for machine learners.

We would like to thank the Neural & Machine Learning group, Paul Anastasiades, Paul Chadderton and Paul Dodson for useful discussions. JP was funded by a EPSRC Doctoral Training Partnership award and EB by the Wellcome Trust (Neural Dynamics PhD Program). This work made use of the HPC system Blue Pebble at the University of Bristol, UK.



Appendix A Forward DNI

In this paper we have focused on the backward (or feedback) DNI, but there is another interesting paradigm between two neural networks dubbed “forward” DNI. Here we describe this variant of the model and below its link to the cerebellum.

The architecture remains the same; that is, we have a feedforward or recurrent main network and a separate feedforward network (synthesiser; ) to which it forms a loop. The difference to backward DNI is that now the synthesiser predicts forward activity, not backward. Specifically, we have , where is the activity of the main network at layer (or time) and with is the activity at some later layer (or time). As with backward DNI we constantly update the synthesiser parameters based on its difference to its target, in this case .

Though more nuanced, the goal of forward DNI as presented in jaderberg2017decoupled is also to hasten learning. As an example, suppose we have a feedforward network as the main model and equip a backward synthesiser (one which predicts same layer error gradients) at each layer as well as a forward synthesiser which projects from the original network input onto each layer. The result is that given , there are only two stages of processing before the parameters of layer , for any , can be updated: approximating with , then apply the backward synthesiser . In this case, we have what jaderberg2017decoupled dub as a “full unlock” of the forward and backward pass.

a.1 Relationship to cerebellum

As with backward DNI we can frame forward DNI as an instance of the internal model hypothesis (wolpert1998internal). Now, however, we seek an analogy to the inverse model hypothesis. Under the classical formulation of the inverse model, the cerebellum receives the current and desired sensory state before issuing a motor command; i.e. the cerebellum now acts as the controller. An analogous model can be derived in the cognitive domain, where the cerebellum might learn to manipulate some given “instruction” (desired/current state) in a manner which approximates the prefrontal cortex in its expression of a “mental model” (controlled object) (Ito2008).

If we consider the spatial case of forward DNI and interpret the layers between which approximates as, for example, the motor or prefrontal cortex, then the link towards the inverse model becomes clear. In both schemes, the cerebellum receives initial states upstream (instructions) and learns to mimic the forward computations which then take place in the neocortex. We also point out that though the temporal case of forward DNI was not originally considered in jaderberg2017decoupled, there remain clear analogies to proposed cerebellar function. In fact, it was recently suggested that the cerebellum mimics motor processing over several timesteps (Fig. 7 in 10.3389/fnsys.2020.00019), exactly analogous to temporal backward DNI where the main model is an motor-associated RNN.

In general, the likeness in formulation between DNI and the cerebellar internal model hypothesis can be summarised in table S1.

Forward Model Backward DNI Inverse Model Forward DNI
Controller Neocortex Main model Cerebellum Synthesiser
Input ,
Ouput destination Neocortex Main model Controlled object Main model
Table S1: Relationship of the internal models of the cerebellum with DNIs. The properties of the forward model of the cerebellum can be set against those of backward DNI (blue); similarly, the properties of the inverse model of the cerebellum can be set against those of forward DNI (red). Notation is largely consistent with section 2 of the main text: denotes the hidden activity of a motor area and sensory area, respectively; denotes the computation of backward DNI and forward DNI, respectively; denotes the loss function. In addition, the inverse model of the cerebellum traditionally also has access to a desired state (in particular, one can consider this a special case of “context” provided to the synthesiser; cf (jaderberg2017decoupled)). There are no explicit equations for the computational processes of the forward and inverse models, and both are thus represented by the unknown function .

Appendix B Experimental details

In each of our tasks we use a long-short term memory network (LSTM; (hochreiter1997long)) as the main “cortical” network (costa2017cortical), and a simple feedforward network of one (sensorimotor tasks) or two hidden layers (caption generation) as the synthesiser “cerebellar” network. As in (jaderberg2017decoupled) the cerebellar network predicts gradients for both the memory cell and output state of the LSTM, so that the cerebellar input and output size is two times the number of LSTM units; furthermore, the synthetic gradient is scaled by a factor of 0.1 before being used by the main model for stability purposes. The final readout of the model is a (trained) linear sum of the LSTM output states. All networks are optimised using ADAM (kingma2014adam) (see learning rates below).

In each experiment all initial LSTM parameters are drawn from the uniform distribution

, where is the number of LSTM units. The weights of the readout network and the feedforward weights of the cerebellar network (other than the final layer) are initialised according to where denotes the “kaiming bound” as computed in (he2015delving) (slope ), and the biases are draw from , where denotes the input size of the layer. As in (jaderberg2017decoupled), the last layer (both weights and bias) of the cerebellar network is zero-initialised, so that the produced synthetic gradients at the start are zero.

During learning, backpropogation through time (BPTT) takes place strictly within distinct time windows (truncations), and computed gradients are not propagated between them: truncated BPTT. We split the original input into truncations as follows. Given an input sequence of timesteps and a truncation size , we divide the sequence into sized truncations with any remainder going to the last truncation. In other words, the sequence is now made up truncations of , where for positive integers with . Note that, along with the value , how well the sequence is divided into truncations (i.e. values ) is an important factor for learning and can cause noticeable non-linearities (Fig. 2d, 4d, LABEL:fig:cossim

). For the line drawing and seqMNIST based tasks where there might be truncation windows where the error gradient is purely synthetic, we accumulate error gradients and apply gradient descent strictly at the end of the sequence. This is to ensure that for each model there is the same overall number of weight updates, a potentially important detail for fairness when using optimisation tools such as ADAM. For the image captioning task, where there is sure to be an error signal at every truncation, we update the model weights as soon as the error gradients become available.

In the line drawing and seqMNIST based tasks, to test the effect of predicted feedback against the availability of “organic” error signals which occur at any timestep where a target is provided, we vary the external feedback interval. Given feedback interval , the target is only available every timesteps. An arguable but helpful analogy might be the rate at which one receives visual information whilst performing a task (e.g. drawing freehand).

In general, (sensible) hyperparameters were selected by hand after a few trial runs. We used the PyTorch library for all neural network models. In particular, our DNI implementation is based on code available at The code used for our experiments is available at [url will be provided after the review process].

Normalised error

To calculate the normalised error with respect to a given model (Figs. 2d, 3a, c, 4d, S4) we take the ratio of total errors during learning (all epochs). For example, the normalised error of ccRNN with respect to cRNN is . Note that in the ablation case we compare against an unaffected ccRNN and only consider the respective errors post-ablation. e.g. the normalised error for a model with cerebellar ablation at epoch 50 is .

Cortico-cerebellar coupling

To analyse how the coupling between the two model components changes over learning (inspired by Wagner2019) we consider the (absolute) Pearson correlation between a given LSTM unit (both cell and output states) and a given unit in the cerebellar hidden (granular) layer over different bins during training. Values presented are the average over all RNN/cerebellar hidden unit pairs.

Computing details

All experiments were conducted on the X (for Anonymity) super computer at X (for Anonymity); mostly on GPUs (GeForce RTX 2080 Ti) and some on CPUs (Intel(R) Xeon(R) Silver 4112 CPU @ 2.60GHz). We did not record the total computing time for the experimental results presented in this paper, but this can be estimated as follows. For the linedraw task a given run (with one seed and hyperparameter setting) took roughly 0.2 hours. For each cRNN and ccRNN model we ran across 10 different seeds, 4 different truncation values, 5 different levels of feedback; for the ccRNN we also considered 5 different ablation times. This results in approximately hours. For the seqMNIST tasks (ld-seqMNIST, dd-seqMNIST and standard seqMNIST) were also run across the different conditions as named above, but for 3 seeds, 6 truncation windows, 8 feedback intervals, 7 ablation times, with each run taking approximately 0.5 hours. This results in approximately hours. For image captioning we only present one case of ccRNN against cRNN across 5 seeds. With each run taking roughly 5 hours, this results in hours. The total time used for the presented results is therefore somewhere around hours. Of course, other testing took place and also rerunning due to bugs, so we imagine the closer total compute time including these factors to be closer to double this time at approximately hours.

b.1 Line drawing task

In the line drawing task, an LSTM network receives a discrete input cue which signals the network to either 1. stay at zero or 2. move across an associated line (equally spaced points) in 2D space over a period of 10 timesteps. Here we set 6 distinct non-zero input-target pairs , where each input is a (one dimensional) integer , throughout, and the remaining targets are lines whose end points lie equidistantly on a circle centred on the origin with radius 10. Once an input cue is received at timestep , the model receives no new information (i.e. all future input is zero). The model is trained to minimise the mean squared error (MSE) between its output and the cue-based target.

The cortical network has one hidden layer of 50 LSTM units and the cerebellar network contains one hidden layer of 400 neurons. The initial learning rate is set to 0.001. Each epoch comprises 20 batches of 50 randomised examples. Unless explicitly stated we use a truncation size of which covers of the total task duration. Model results are averaged over 10 random seeds (with error bars), where each seed determines the initial weights of the network.

Figure S1: (a) Comparison of the cRNN and ccRNN with of cortical feedback information, as presented in the main text, to a model trained with full BPTT (i.e. cortical temporal feedback = 100%). (b) Learning under different cortical temporal feedback windows intervals for the simple sensorimotor task tasks (cf Fig.  2d). Size of window is presented as percentage of task duration (10 timesteps).

b.2 Sequential MNIST tasks

For each seqMNIST based task the model receives the same temporal MNIST input, and the tasks are only differentiated by the model output. Given a MNIST image, at timestep the model receives the pixels from row of the image, so that there are 28 total timesteps and an input size of 28.

In each case we have one hidden layer of 30 LSTM units in the main model and one hidden layer of 300 hidden units in the feedforward cerebellar network. Data was presented in batches of 50 with an initial learning rate of 0.0001.

Training and validation data was assigned a split, containing 48000 and 12000 distinct image/number pairs respectively. Unless explicitly stated, the truncation value was which is of the task duration. Model results are averaged (with error bars) over 3 random seeds for weight initialisation.

b.2.1 ld-seqMNIST

In this variant each number 0-9 MNIST image is allocated an associated position on the edge of a circle centred at 0 with radius 10, and must follow a line (equally spaced points) towards that position (Fig. 4

a, top). With the model output then a vector of size 2, the training loss is defined at the end by the mean squared error (MSE) between the output of the model and the points forming the target line.

b.2.2 dd-seqMNIST

Like ld-seqMNIST, in this variant the model outputs a sequence of 2D coordinates corresponding the given image. The target sequence however is now of a highly non-linear form, and in this case actually resembles the shape of the number itself (Fig. 4a, bottom; number 9 not shown). The model is then trained at each timestep to minimise the MSE between the model output and that target shape.

For each number, the corresponding target drawing lies in , with the gap between each successive point roughly the same. To prevent the model being too harshly judged at timestep 1, all drawings begin in the top left corner (apart from the drawing of 1 which begins slightly beneath/to the right). MSE scores are reported as 100 times their raw values to ease comparison with ld-seqMNIST.

b.2.3 SeqMNIST

This is the standard form of seqMNIST (see Fig. S2

) and used as a case of a discrimination (or decision making) task and one at which the target is hardly available to the model (only at the end). In this case at the end of the presentation of the image the model must classify in the image as a number between 0 and 9. The output of the model is a vector probabilities of size 10 (one entry for each number), and the model was trained to maximise the likelihood of the correct number.

Figure S2: Learning curve using the validation set across epochs for the sequential MNIST task. Truncation window ( of sequence length) applied.
Figure S3: Learning under different cortical temporal feedback windows intervals for the ld-seqMNIST (left) and dd-seqMNIST (right) tasks (cf Fig.  4d). Size of window is presented as percentage of task duration (28 timesteps). Results presented in main text (Fig.  4b) shown on top row along with RNN trained with full backpropagation through time (i.e. cortical temporal feedback = 100%).
Figure S4: Learning under different external feedback intervals for the ld-seqMNIST (left) and dd-seqMNIST (right) tasks. Top row: training curves with different feedback intervals. Bottom row: ccRNN error normalised to cRNN across different feedback intervals. Opaque arrows colour coded to the text above. Transparent arrow designates the feedback interval used for experiments presented in main text.
Figure S5: Evolution of cortico-cerebellar correlations (average absolute Pearson correlation between an LSTM unit and unit in the cerebellar hidden layer). Same as Fig. 3b but for ld-seqmnist and dd-seqmnist tasks.
Figure S6: Effect of ablation for ld-seqMNIST (left) and dd-seqMNIST (right) tasks . Top row: learning curves for different ablation times. cRNN (i.e. ablation at epoch 0) in grey, ccRNN (no ablation) in orange, and other ablation times as given by the dotted vertical lines. Bottom row: model error normalised against (unablated) ccRNN for different ablation times.
Figure S7: Learning with a fixed cerebellum. In this case cerebellar parameters are learnt for the first 10% epochs (to obtain non-zero weights) before being fixed for the remainder of training, forcing the RNN to use stale, out-of-date predicted gradients.

b.3 Caption Generation

The architecture for the caption generation task consists of a pretrained convolutional neural network (CNN) coupled with an RNN (LSTM). The synthesiser (cerebellar network) only communicates with the LSTM. The LSTM network has one layer of 256 LSTM units and the cerebellar network has two hidden layers of 1024 neurons.

The dynamics from image to caption is as follows. As part of image preprocessing and data augmentation, a given image is randomly cropped to size , flipped horizontally with even chance, and appropriately normalised to be given to a pretrained Resnet model (he2016deep). A feature vector of size 256 is thus obtained and is passed to the LSTM at timestep 0. The LSTM is subsequently presented the “gold standard” caption one word per timestep, each time learning to predict the next word; i.e., at timestep the model learns . The network simultaneously learns a word embedding so that each word is first transformed to a feature vector of size 256 before being served as input. With a preset vocabulary of 9956 distinct words, the final output of the model () is a probability vector of size 9956.

We found the models to be generally prone to overfitting the training data. For this reason, we apply dropout (during training) on the input to the LSTM, where a given input element is set to zero with probability.

Figure S8: Example images from the validation set with corresponding model captions (cRNN in grey and ccRNN in orange) and gold standard captions (black). Here we show a combination of examples of how the models describe the presented image. In some case all or some models fail to give an accurate description of the image. In other cases all models are able to an accurate caption describing the image, with each model displaying subtle differences in the generated captions.

Once training is complete the models can generate their own unique captions to previously unseen images (Figs. 5,  S8). Given an image at timestep 0,, the model applies a ‘greedy search’ where the output of the model at timestep is the word with the highest probability, and the same word is then provided as input to the model at timestep . In this way the model can autonomously output an entire sequence of words which forms a predicted caption. In the (highly) rare case where the model generates a sequence of words, we consider only the first words as its caption.

The coco training set ILSVRC-2012-CLS

(ILSVRC15) holds 414113 total image-caption pairs with 82783 unique images while the held-out validation set (used for Fig 5b, c) holds 202654 with 40504 unique images; note that each image therefore has distinct gold standard captions. Training takes place in batches of 100 image/caption pairs, with an initial learning rate of 0.001. Model performance averaged (with error bars) over 5 random seeds for weight initialisation.

In order to judge the models beyond their learning curves in bits per word (BPW), we quantify their ability to generate captions using a variety of language modelling metrics popular in the realm of language evaluation (image captioning, machine translation, etc). In particular, we compare model-generated captions against the gold standard captions using standard metrics in language modeling (Table S2).

Code for this task was based on (but altered from) the code provided at

cRNN 0.2288 0.4790 0.2114 0.6928 0.1418
ccRNN 0.2294 0.4800 0.2112 0.6940 0.1422
Table S2: Mean metric scores identifying similarity between model generated and gold standard captions for test data. Metrics considered are BLEU (BLEU_4, i.e. for -grams with ), Rouge-L, METEOR, CIDEr, SPICE (papineni2002bleu; lin-2004-rouge; denkowski2014meteor; vedantam2015cider; anderson2016spice)