Ghost Units Yield Biologically Plausible Backprop in Deep Neural Networks

11/15/2019 ∙ by Thomas Mesnard, et al. ∙ Universität Bern ETH Zurich Montréal Institute of Learning Algorithms 37

In the past few years, deep learning has transformed artificial intelligence research and led to impressive performance in various difficult tasks. However, it is still unclear how the brain can perform credit assignment across many areas as efficiently as backpropagation does in deep neural networks. In this paper, we introduce a model that relies on a new role for a neuronal inhibitory machinery, referred to as ghost units. By cancelling the feedback coming from the upper layer when no target signal is provided to the top layer, the ghost units enables the network to backpropagate errors and do efficient credit assignment in deep structures. While considering one-compartment neurons and requiring very few biological assumptions, it is able to approximate the error gradient and achieve good performance on classification tasks. Error backpropagation occurs through the recurrent dynamics of the network and thanks to biologically plausible local learning rules. In particular, it does not require separate feedforward and feedback circuits. Different mechanisms for cancelling the feedback were studied, ranging from complete duplication of the connectivity by long term processes to online replication of the feedback activity. This reduced system combines the essential elements to have a working biologically abstracted analogue of backpropagation with a simple formulation and proofs of the associated results. Therefore, this model is a step towards understanding how learning and memory are implemented in cortical multilayer structures, but it also raises interesting perspectives for neuromorphic hardware.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 4

page 5

page 8

page 9

page 10

page 13

page 18

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recently, deep learning Goodfellow-et-al-2016-Book

has revolutionized artificial intelligence and led to impressive performance in various tasks such as computer vision 

krizhevsky2012imagenet , speech recognition hinton2012deep and machine translation bahdanau2014neural . Deep learning, thanks to backpropagation almeida1987learning ; pineda1987generalization

, is able to take advantage of the multilayer structure of the neural network to learn high level features relevant for the given task it is trained to perform. Those features are more and more abstract while going deeper in the network, and give rise to a high level representation of the data. In the case of visual tasks, when using convolutional neural networks 

lecun1990handwritten , the high level features learned thanks to the backpropagation algorithm are similar to the ones experimentally observed in the visual cortex khaligh2014deep .

However, it is still an open question how the brain is able to perform credit assignment in deep neural structures spanning multiple areas. The core algorithm used to train deep neural networks (i.e backpropagation) has been seen by the neuroscience community as being biologically implausible because the implementation used in deep learning relies on assumptions that cannot be met in the brain bengio2015towards ; neftci2017event (i.e need for symmetric weights and a separate circuit for feedforward and gradient computations, precise timing between the forward and the backward paths with fixed activity of the neurons and knowledge of the derivative of the forward activation to correctly update the weights, high precision numbers to characterize forward activities and backpropagate errors compared to binary values occurring in the brain). Thanks to recent work lillicrap2016random ; nokland2016direct the assumption that the feedback weights must be the exact transpose of the forward ones is no longer required to have efficient credit assignment thanks to the feedback-alignment mechanism. courbariaux2016binarized

showed it was possible to train deep neural networks using backpropagation with binarized activation functions and binary weights.

scellier2017equilibrium showed how the same neurons could be used for feedforward and gradient computations thanks to the recurrence induced by feedback connections and how nudging output units towards a lower-error configuration propagates, via feedback connections, error gradients in the inner layers of the circuit.

Recent studies have shown that local contrastive hebbian plasticity in an energy based model can implement backpropagation in deep neural structures thanks to the recurrent dynamics 

bengio2017stdp ; scellier2017equilibrium while having promising results when using leaky integrate-and-fire neurons mesnard2016towards . Others guerguiev2017towards were able to train deep learning networks by approximating backpropagation in systems with multicompartment neurons. Finally, a recent study sacramento2018dendritic ; NIPS2018_8089 used recurrent networks with inhibitory neurons to approximate backpropagation. Those inhibitory units aim to predict and cancel the feedback signal coming from the upper layers. When they are not able to correctly predict the incoming feedback, weights are updated proportionally to this prediction error which is closely related to the correct gradient that would have been obtained with classical backpropagation. This idea of predicting the upper incoming feedback activity to predict the correct gradient can be related to jaderberg2016decoupled , where side networks are introduced between each layer of the main network that learn in a supervised manner to predict the correct gradients based only on the feedforward inputs. A more thorough comparison with previous works on how backpropagation could be implemented in the brain is done in Section 4.

In this paper, we consider a recurrent network composed of pyramidal units (PU) that can be identified with the feedforward units of a multilayer perceptron, with a dynamic of learning inspired from 

scellier2017equilibrium with two different phases. These cells integrate feedforward activity coming from the lower layers but also feedback activity coming from the upper layers. Moreover, in order to enable backpropagation of errors, we introduce a new type of interneurons, referred to as ghost units (GU). Their goal is to predict and cancel feedback from pyramidal units in the upper layer by integrating the same feedforward input without having access to any feedback from the following layers. This property enables the network to converge quickly during the feedforward computation, in spite of the presence of recurrent connections, by canceling feedback coming from the upper layers thanks to the ghost units. This cancellation effect of the ghost units allows top-down corrective feedback to be correctly backpropagated when targets are provided in the weakly-clamped phase. This gives the network the capacity to perform credit assignment in a multilayer structure by simply following its dynamics and updating the weights according to local plasticity learning rules.

2 Backpropagation thanks to ghost units in a recurrent and dynamical neural network

2.1 Architecture

We consider a biologically plausible implementation for backpropagation in a directed acyclic graph of feedforward connections with network input . We consider that the network has layers. We will use for the first layer that represents the inputs and for the last layer which is the output of the network, see Figure 1 for a schematic representation of the architecture.

At each layer , a node of this graph is associated with one or multiple pyramidal units (PU)***Note that a single feedforward unit (called pyramidal unit) could in reality be implemented by multiple pyramidal units with similar input and output connectivity, allowing the network to reduce the spiking noise, if integrate-and-fire neurons where used for example. whose activity is denoted by (the state of unit in layer ). Pyramidal units have an output non-linearity which maps their activity to their firing rate . Technically, this transfer function must be -Lipschitz continuous. is the set of pyramidal units in layer .

For a connectivity matrix , we define which corresponds to the synaptic weight from unit in the input layer to unit in the output layer.

Both feedforward and feedback connections are considered in this model. The main feedforward synaptic weights correspond to the influence of presynaptic unit (in layer ) on the postsynaptic unit (in layer ). The feedback weights encapsulate the effect of the pyramidal unit (of layer ) on pyramidal unit (layer ). The network output is defined by the firing rate of the output pyramidal units. This output is compared to target values and the performance is measured through a scalar cost function which somehow compares the network output and the target. In the simulations, the mean-squared error function was used as cost function: , but other losses could also be implemented in the same way.

We also define the cost function which corresponds to the cost function of the associated (purely feedforward) multilayer perceptron (MLP), where the associated activity is defined by:

(1)

Training is decomposed in a free phase and a weakly-clamped phase following scellier2017equilibrium . During the free phase, the network evolves thanks to its recurrent dynamics with only inputs provided. During the weakly-clamped phase, both inputs and targets are presented to the network. A top-down error signal pushes the output units towards a value corresponding to a smaller loss . corresponds to the free-phase and to the weakly-clamped one.

In addition, we consider a lateral network of ghost units (GU), which could be implemented by inhibitory interneurons. These units are represented by a scalar variable for each unit in layer . A ghost unit in a layer is only connected to the pyramidal units of the same layer, through two matrices (for lateral connections from the pyramidal unit to ghost unit ) and (for the lateral connections from the ghost unit to the pyramidal unit ). These units aim to reproduce the feedback activity from the pyramidal units of the next layer during the forward phase, and therefore enable the network to directly compute the gradient during the weakly-clamped phase. These units are considered as inhibitory when projecting to the pyramidal neurons (expressed here as minus sign in , although the synaptic weights can themselves be negative). These ghost units are only present at each hidden layer ( and ).

We will show in the following section that the combination of lateral recurrent and feedback connections propagates the error through the network in a way that closely approximates backpropagation, so long as some assumptions are satisfied, regarding the ability of feedback connections to mimic feedforward connections (approximate symmetry) and of lateral connections to learn to cancel the feedback connections when there is no nudging.

2.2 Notations

network input time constant
PU pyramidal unit GU ghost unit
activity of PU in layer cost function
activity of PU in layer in the MLP cost function of the MLP
neuronal transfer function activity of GU in layer
set of pyramidal units in layer
feedforward connection from PU (layer ) to PU (layer )
feedback connection from PU (layer ) to PU (layer )
lateral (recurrent) connection from PU to GU (layer )
lateral (recurrent) connection from GU back to PU (layer )

2.3 Dynamics of the neurons

Three different inputs are integrated by pyramidal units in layer :

  • [noitemsep,topsep=0pt]

  • is the bottom-up input coming from the pyramidal units of layer .

  • is the top-down feedback coming from pyramidal units of layer .

  • is the lateral feedback coming from the ghost units of the same layer .

The pyramidal units are evolving through:

(2)

where represents an error term whose expression depends on the layer.

For the hidden layers, is the difference between the top-down feedback (the local target ) and the cancelling contribution from the inhibitory ghost units (, counted negatively because of the inhibitory nature of the ghost units). For the output layer, is the nudging term that indicates in which direction should move to reduce the output cost function , with

the target output values (

in the free phase and in the weakly-clamped phase).

In particular, at the end of the forward phase when , we have: . Because of perfect cancellation , the network behaves like a feedforward multilayer perceptron.

The ghost units of layer follow:

(3)

3 Different architectures and learning procedures

3.1 Network with 1-1 correspondence between the pyramidal units and the ghost units (MA)

Model description

In this section, we consider that each pyramidal unit in layer has a corresponding ghost unitIn biology there are more pyramidal neurons than inhibitory neurons. Yet, GU may also include sub-classes of pyramidal neurons, such that the number of GU must not be smaller than the number of PU. Moreover, the code formed by pyramidal neurons may show some redundancy so that it could be compressed to a smaller number of effective PU. in the previous layer , and that the ghost units aim to replicate the activity of their associated pyramidal units by integrating the same inputs. In order to make the reading easier in this part, we use the same indices in the brackets for the ghost unit and its associated pyramidal unit. For example, ghost unit of layer (with activity ) will be associated to the pyramidal unit of layer (activity ). This architecture can be seen in Figure 1.

Figure 1: Architecture for Model A (MA) network with 1-1 correspondence between ghost units and associated pyramidal units from the following layer. Pyramidal units (in grey) and ghost units (in orange) are connected through different weights matrices (, in grey for PU-PU connectivity and , in orange for PU-GU lateral connections). Ghost units of layer tend to copy the pyramidal unit activity of the following layer , thanks to the evolution of eq. 4 (this interaction is represented by the green dotted line) and at the same time, to cancel the feedback coming from the pyramidal units of the next layer, because of updates eq. 5 (blue dotted line). The nudging (-) is presented at the output layer ( in the free phase and in the weakly-clamped phase).

During the free-phase, only the lateral connections between the pyramidal and ghost units are updated. The local learning rules for the synaptic weights and are defined as follow:

  • [noitemsep,topsep=0pt]

  • acts like a target for the ghost units to learn :

    (4)

    This minimizes , i.e., the inhibitory ghost unit learns to imitate its associated pyramidal unit. is the learning rate.

  • The top-down feedback onto layer acts as a target for the weights forming the canceling feedback :

    (5)

    This minimizes for each layer , with the same learning rate .

During the weakly-clamped phase, only the and are updated through the following learning rules:

  • The main weights (feedforward, from pyramidal units of layer to pyramidal units of layer ) are updated, using a local learning rule:

    (6)

    This approximates gradient descent on , see Theorem 1.

  • The feedback weights are set equal to the transpose of the feedforward ones: .

This was implemented using Euler discretization, see Algorithm 1 for a more precise description of the algorithm.

We also consider a variant (MA’) where all the updates are performed during all the phases. This version, with continuous updates of the weights is more close to what can be expected to happen in the brain.

A good approximation of backpropagation

The combination of these update rules, leads to the following theorem, where eq. 6

can be seen as a close estimate to backpropagation.

Theorem 1.

We set the backward weights equal to the (adapting) forward weights, , and assume that the ghost unit circuit (, ) converged during the free phases, , for all hidden layer (MA). Then for weak output nudging ( small, ) errors converge to for each hidden layer . So, the forward weights () become updated according to the classical backpropagated error gradient.

Proof.

Let’s introduce which is the L2 norm if

is a vector and the maximum singular value norm if

is a matrix.

We suppose that all the matrices are bounded during the procedure. In other words, we suppose that there exists such that , , and . This could be ensured by clipping the weights between extremal values.

We remind the definition of , that corresponds to the cost function of the multilayer perceptron built from the feedforward graph (with no recurrent dynamics) and nodes .

Free phase:

Firstly, we study the dynamics of the weights during the free phase, where both loss functions

and are minimized thanks to the updates eq. 4 and eq. 5.

In the stationary limit (where ), we have and so, for all hidden layer :

So if both and are minimized for all in the stationary limit, then tends to . Considering that the space spanned by

during learning is large enough (this means having a large set of training data, which was true in the cases we tested), the mean-squared error of the associated linear regression converges also to

, and therefore during the free phase.

Secondly, still in the stationary limit,

because is -Lipschitz. So if both and are minimized for all , then . As before, if we consider that the space spanned by is big enough, we also minimize the mean-squared error of the linear regression, and thus during the free phase.

Weakly-clamped phase:

For the weakly-clamped phase, we prove the theorem by induction over the layers. We suppose that the learning of the free phase is done and so that .

Firstly, it is easy to see that for the output layer (index ), in the stationary limit and with small nudging, we have (at zero order in ) and so:

(7)

Then we just have to prove that this property is true for the last hidden layer and the rest of the proof will follow by induction.

Considering that the layer is still at equilibrium, and that the nudging was small, we have and so in the stationary limit:

Contrarily to just above, we need the first order approximation for in here (otherwise we would get in the following).

As we have and ,

Starting from the definition of , we substitute the definition of and . Using that (from free phase) and , we have:

Then, by assuming is small (because is small):

(8)
(9)

By the chain rule on the feedforward graph, we have:

(10)

From eq. 8, eq. 10 and :

(11)

Using induction across layers, we have for every layer :

(12)

Due to the stacked nonlinearities the approximation for deeper layers may get worse and worse as we go deeper, this can be compensated by choosing smaller .

initialization
while not done do
       Sample batch from the training set
       for _ in range free_steps do
             for output units
             for hidden units
             layer ,
            
            
            
            
            
       end for
      for _ in range weakly_clamped_steps do
             for output units
             for hidden units
             layer ,
            
            
            
            
            
       end for
      
end while
algocf
Algorithm 0 Learning procedure for Model A
(MA) network.
initialization
while not done do
       Sample batch from the training set
       for _ in range free_steps do
             for output units
             for hidden units
             layer ,
            
            
            
            
       end for
      for _ in range weakly_clamped_steps do
             for output units
             for hidden units
             layer ,
            
            
            
       end for
      
      
      
end while
algocf
Algorithm 0 Learning procedure for model B
(MB) network.
Corollary 1.

Under the assumptions of Theorem 1, the weight change proposed in eq. 6

corresponds to approximate stochastic gradient descent, i.e.,

(13)
Proof.
(14)

Hence, if , we obtain that and the corollary. ∎

3.2 Deep neural network with ghost units replicating online the feedback from the pyramidal units (MB)

Model description

We also developed a different class of models that we introduce in this section. In this model (MB), we do not make any hypothesis on the number of pyramidal units and ghost units. We also consider that the lateral connections from the pyramidal units of layer to the ghost units of the same layer are fixed to a randomly initialized value, therefore . evolves so as to have, example by example, the feedback coming from the ghost units of layer replicating the feedback coming from the pyramidal units of layer , see Figure 2. Thanks to this property, we have for a given example at the end of the free phase after the efficient and rapid learning of . This enables the network to correctly learn in the weakly-clamped phase.
This highly modular and fast changing plasticity could be implemented in real neural circuits by Post-Tetanic Potentiation. This type of plasticity evolves rapidly and only lasts on a time scale of seconds storozhuk2002post ; xue2010post .

As just described, the top-down feedback onto layer acts as a target for the weights forming the canceling lateral feedback . Therefore, the weights are updated during the free phase as follow:

(15)

which minimizes .

The main weights are updated at the end of the weakly-clamped phase through the same local rule as in (MA):

(16)

We used different learning rates for each layer in the case of (MB).

For a more detailed description of the algorithm, see Algorithm 2.

A good approximation of backpropagation

These update rules lead to the following theorem, where eq. 16 can be again seen as close to backpropagation.

Figure 2: Architecture for Model B (MB) network. Pyramidal units (in grey) and ghost units (in orange) are connected through different weights matrices (, in grey for PU-PU connectivity and , in orange for PU-GU lateral connections). is fixed. thanks to the plasticity rules eq. 15 evolves so that the lateral feedback from the ghost units of layer replicates the pyramidal units feedback of the following layer (this interaction is represented by the blue dotted line). The nudging (-) is presented at the output layer ( in the free phase and in the weakly-clamped phase).
Theorem 2.

We set the backward weights equal to the (adapting) forward weights, , and assume that the ghost unit circuit () converged during the free phase (at each different presentation of inputs), for each hidden layer (MB). Then for weak output nudging ( small, ) errors converge to for each hidden layer . So, the forward weights () become updated according to the classical backpropagated error gradient.

Proof.

Free phase:

After settling in the free phase (), we have, because :

In particular, we have for all units and, as :

(17)

Weakly-clamped phase:

As in the previous proof, we use induction.

We have clearly that in the output layer :

(18)

We will prove the property for the last hidden layer , and induction will follow. Starting from the definition of , we substitute the definition of and :

(19)
(20)

We consider that the layer is still at equilibrium, and that the nudging is small, so . And by the same arguments than for (MA):

(21)

From eq. 17 (still valid because has not been impacted by the feedback from above), eq. 20, eq. 21 and small, we have:

(22)
(23)
(24)

Starting from this point, the proof is the same as for Theorem 1. ∎

3.3 Transpose feedback (TF) versus Feedback-alignment (FA)

The feedback weights are assumed to be equal to the transpose of the feedforward ones, and are updated as such during the training: , as in classical backpropagation. We refer to this hypothesis as transpose-feedback (TF). In practice this characteristic could be implemented using an additional reconstruction cost (between consecutive pyramidal layers), which has been shown to encourage symmetry of the weights Vincent-JMLR-2010-small . This assumption can also be relaxed thanks to lillicrap2016random and the feedback-alignment effect (FA). In this case, feedback weights are fixed and randomly initialized. During learning, the feedforward matrix tends to align with the transpose of the feedback matrix. Both hypotheses (TF) and (FA) were tested here.

4 Related Work

Backpropagation in the brain has been a very active topic of research for the last few years and various models have been proposed.

Constrastive Hebbian learning Ackley85 ; Hinton+McClelland-1988 introduced the idea of learning in two different phases, a free phase where the inputs are presented to the network, followed by a weakly-clamped phase, with a target signal that nudges the output layer towards the right solution. scellier2017equilibrium made the parallel between contrastive Hebbian learning and backpropagation with the definition of a framework for energy-based models, Equilibrium Backpropagation. The idea of using two different phases during the training procedure was kept in this work, however, contrarily to the previous studies, we were also able to train the network while allowing synaptic updates during both phases.

Segregated dendrites and multicompartment neurons were recently used guerguiev2017towards to implement backpropagation in a biologically plausible manner. This study gave a very interesting explanation of how neurons can store feedforward activity and how feedback connections can carry the backpropagated error without interfering with the feedforward activity. Training can then be performed without dealing with recurrent activity caused by feedback connections, which makes the theory simpler and closer to deep learning methodology. This study achieved great results even if they were using a spiking neural network with update rules being computed using average of the neural potential.

A recent study sacramento2018dendritic ; NIPS2018_8089 introduced the idea of canceling the feedback from the next layer with the inhibitory lateral feedback in order to leave out only the backpropagated error as remaining from the feedback signal. They used recurrent networks of two-compartment neurons. Links can also be drawn with Lee:2015:DTP:3120485.3120521 ; jaderberg2016decoupled ; DBLP:journals/corr/abs-1803-01834 where local credit assignment is also performed.

As in [17, 18, 19], we effectively consider the dynamics of a single quantity per neuron, the somatic activity. However we do not describe the input currents as coming from different compartments and consider instead a more abstract single-compartment neuron (that, to implement plasticity, is able to represent two quantities, the target rate and its actual rate). This has the advantage of simplifying the terminology of the model as we do not need to introduce dendritic quantities that enter in the representation of the errors. Moreover, this does not induce any scaling of the approximated error by dendritic attenuation factors as in sacramento2018dendritic ; NIPS2018_8089 (the same scaling can be recovered by multiplying the learning rate by the inverse of the dendritic attenutation factor). As such, it is possible to approximate the backpropagated gradient without leading to an exponential decay of its magnitude when it is propagated through several layers. We used a reduced system with only the required ingredients in order to obtain a working biologically abstracted analogue of backpropagation. Model A implements in a simple and condensed way the principles from sacramento2018dendritic ; NIPS2018_8089 where the ghost units network copy the pyramidal one. Diverging from the ideas of Model A, we also postulate a short-term plasticity according to which the local circuit adapts for a single pattern such as in Post-Tetanic Potentiation storozhuk2002post ; xue2010post or in the FORCE algorithm sussilloabbott . We develop accordingly Model B where the ghost units dynamically adapt their feedback to replicate in an online manner the feedback coming from the pyramidal units. This single compartment model is sufficient in order to obtain the required credit assignment mechanism and it also simplifies the mathematics to the bare necessities required to obtain the desired results.

Figure 3: Top: Evolution of the train accuracy (dotted) and test accuracy (plain line) on the MNIST classification task for a 500-unit (MA) neural network. Bottom: Evolution across learning of the Frobenius norm between and (Diff Vb) and between and (Diff Vf). Curves are averaged over 5 experiments.

5 Results

5.1 Credit assignment with replicating units

We consider a (784-500-10) network with one hidden layer with MSE (mean-squared error) loss. No preconditioning of the inputs is used. Batch size is equal to 100 for (MA) and 1 for (MB). The activation is sigmoid. We train on the 55000 MNIST training set and test on the 10000 examples of the test set. We initialized the weights randomly with a uniform distribution over

(Table 2 and 4).

(MA) dynamics

In (MA), learning is composed of two phases. During the free phase, only inputs are provided to the network. The weights push the ghost units to mimic their corresponding pyramidal units (eq. 4) while having learning to minimize the mismatch between the feedback coming from the ghost units and the pyramidal units from the next layer (eq. 5). This pushes the matrix to reproduce and to copy . This can be seen at the bottom of Figure 3, where the Frobenius norms between these matrices during training are reproduced.

This leads to the correct computation of the feedforward path because the feedback terms cancel each other, despite happening in a dynamical way.

During the weakly-clamped phase, the output units are nudged towards the correct values. This shift is backpropagated through the dynamics of the network. This gives rise to an error term at each hidden layer thanks to the mismatch between the feedback coming from the pyramidal units and the corresponding ghost units. evolves in order to minimize this mismatch (eq. 6).

As can be seen at the top of Figure 3

, the network learns well to classify MNIST digits (100% on the train set and 98.27% on the test set). It generalizes well without any regularization or tricks.

(MB) dynamics

We also studied learning in a neural network following (MB) hypothesis. In the one hidden layer network, 5 ghost units were used which aim to replicate the feedback signal from the 10 output pyramidal units. For the testing of the feedforward network on the train and test MNIST sets, we ran the forward graph without the dynamical part to have quicker simulations.

Different inputs are presented to the network sequentially (no batch in this setting). At each input presentation, the network goes through two different phases (see Figure 4). First, the free phase (in blue) where there is no nudging of the output layer. In particular, the feedback weights adapt their synaptic variables to have the feedback signal from the ghost unit cancel the feedback from the pyramidal cells (), as can be seen in Figure 4

(bottom). This cancellation of the local error leads to a correct computation of the feedforward graph of the neural network. Output probabilities of the classification task can then be read from the output pyramidal cells as seen in Figure 

4 (top). Then, during the weakly-clamped phase (in green), the output neurons are nudged toward the right solution. This error is then backpropagated through the network by its own dynamics. When the equilibrium is reached, the feedforward weights are updated. Another input is then presented to the network and as such, this process enables learning of the classification task.

Figure 4: Dynamics of a neural network (MB) for 3 different inputs at 3 different times in learning. Level of neuronal activity of the output neurons represented in the top panel (red for the correct class, black for the others). Norm of the difference between the feedback from the ghost units and the pyramidal ones (bottom) as a function of the time step. The free phase is represented in blue and the weakly-clamped phase in green.

Learning can be studied through the responses of the output neurons at different epochs. At epoch 0 (Figure 

4 (left)), the output neurons are mainly wrong. The gradients that are backpropagated have a large amplitude (as can be seen with the jumps in activity after the beginning of the weakly-clamped phase). In particular, we clearly see that the neuron representing the right class (in red) is nudged towards 1 whereas the others are nudged down to 0. After one epoch (Figure 4 (middle)), the output states are moving towards the right solution. However it becomes harder to cancel the error in the free phase, because the weights grow bigger, making the ghost units work harder after a switch between two different inputs. Finally, after two epochs (Figure 4 (right)), the output neurons already start to saturate to 0 or 1. In conclusion, using (MB) dynamics, the neural network is able to quickly learn a classification task.

Figure 5: Learning in a 500-unit (MB) neural network. Test and training accuracy (top), test and training mean-squared error (middle), relative error to the gradient from classical backpropagation (bottom) as a function of the number of epochs.

On a longer scale of learning (50 epochs), the accuracy and mean-squared error are plotted in Figure 5

for both train and test sets. In particular learning goes well as the training accuracy reach 0.9976 (top). This is correlated with the mean-squared error that goes down and tends to 0 (middle). The gradients computed through the ghost units are almost the same (up to 7% of relative error to the classical backpropagated gradient, bottom) as the ones that can be computed using the usual backpropagation and the chain-rule. Generalization is also quite good with the accuracy on the test set that reaches 0.981 after only 50 epochs of training, which is quite quick and efficient, considering that none of the usual tricks (Adam, RMSProp, …) were used in this setting.

As a conclusion, this biologically inspired neural network, following (MB) hypothesis is able to quickly learn in a robust way the classification MNIST benchmark, with locally computed gradients that closely approximate backpropagation.

5.2 Classification on MNIST

(a) Model A (MA)
(b) Model B (MB)
Figure 6:

Classification on MNIST using networks with one hidden layer. Test error on MNIST as a function of the number of epochs for one hidden-layer network with 100 (red), 300 (black) and 500 (blue) neurons. Both TF (plain lines) and FA (dotted lines) are represented. On the right are detailed the mean values of test and train accuracies (between parenthesis). For each network, the standard deviation over 5 experiments is represented by the width of the area below the ticked mean values.

We have tested several types of networks on the MNIST dataset LeCun+98 , see Table 1 for the results. We looked at both (MA) and (MB) models while using either Transpose-feedback (TF) or Feedback-alignment (FA) (see Figure 5(a) for (MA) and Figure 5(b) for (MB)). We tested different numbers of units per hidden layer and different numbers of hidden layers. Simulations were run on the GPU cluster Cedar, Compute Canada (www.computecanada.ca).

We ran 5 experiments for each model, and represent these results in Table 1. The results approach state-of-the-art accuracies for multilayer perceptron, with (MA) performing slightly better than (MB). For both models, raising the number of neurons resulted in a rise in performance.

(MA) performance on the MNIST task

1-layer and 2-layer (MA) networks compete with the state-of-the-art when using a multilayer perceptron trained with backpropagation. Training with 1-layer (MA) networks is stable and works well when using transpose-feedback and feedback-alignment. Increasing the number of units per layer helps improving performances as shown in Figure 5(a). Let’s note that we were able to use a relatively big during the weakly-clamped phase that speeds up and stabilize learning.

Training with 2-layer (MA) networks is stable and works also well across a large range of hyperparameters when using feedback-alignment. However, when using transpose-feedback training is a bit less stable and requires a more precise hyperparameters search to achieve great performances. Accuracies and hyperparameters are shown in Table 

1 and Table 2.

We also realized 1-layer experiments with all synaptic updates occurring at all times during the free and the weakly-clamped phases (MA’) (Table 3). Theses networks were harder to train but are still able to perform quite well considering all the assumptions that are made (around 3.5% of test error). These results are comparable to the results from guerguiev2017towards ; sacramento2018dendritic . To get stable behaviors, the updates for were clipped. Otherwise it sometimes diverged at the beginning of the learning. This hypothesis is biologically plausible, through some saturation mechanisms.

(MB) performance on the MNIST task

For (MB) networks, 1-layer network were associated with one layer of ghosts units (to replicate the feedback from the pyramidal output units). For 2-layer network, we used two layers of ghost units of respectively and neurons. The other parameters are gathered in Table 4.

1-layer (MB) networks perform really well compared to state-of-the-art results, both in the case of transpose-feedback and feedback-alignment. Using networks with more neurons (100 - 300 - 500) also helps reaching higher performance.

For 2-layer (MB) networks with transpose-feedback weights, the networks were harder to train and sometimes unstable. This could be easily explain by the fact that the error that is made when backpropagating the gradient through the biological backpropagation is proportional to the amplitude of the feedback weights. In 2-layer networks with transpose feedback, they grow at the same speed than . If they grow too large, they can induce some unwanted errors and make the network unstable. These problems were totally solved when using the feedback-alignment version of (MB), where the amplitude of the feedback weights is fixed. In particular, the higher accuracies for (MB) are reached with the 2-layer version, with feedback-alignment. Adding layers helps the network to perform better and the biological backpropagation is efficient through several layers.

Architecture TF FA
Model #units Train Test Train Test
(MA) 100 99.97 97.66 99.90 97.47
(MA) 300 100 98.21 99.99 97.97
(MA) 500 100 98.27 100 98.12
(MA) 500/500 99.67 97.86 99.93 98.05
(MA’) 500 97.77 96.57 - -
(MB) 100 99.31 97.22 98.93 97.39
(MB) 300 99.70 98.05 99.48 97.98
(MB) 500 99.76 98.13 99.56 98.01
(MB) 300/300 99.84 97.95 99.78 98.05
(MB) 500/500 99.91 98.13 99.85 98.21
Lillicrap-et-al-nature2016 1000 - 97.6 - 97.9
nokland2016direct 800/800 - 98.33 - 98.18
guerguiev2017towards 500 - 96.4 - 95.9
guerguiev2017towards 500/100 - - - 96.8
sacramento2018dendritic ; NIPS2018_8089 500/500 - 98.04 - -
Table 1: Accuracy results on MNIST with (MA) and (MB) in a network with one and two hidden layers.

Comparison to other works

We saw that both models (MA) and (MB) were competing with state-of-the-art multilayer perceptrons on MNIST, as in Lillicrap-et-al-nature2016 ; nokland2016direct where classical backpropagation is used. As we can see in Table 1, we get higher accuracies than other previous biologically-plausible models guerguiev2017towards ; sacramento2018dendritic ; NIPS2018_8089 in our setting with different updates rules during the two phases. Even when updating all weights during both phases (MA’), we get similar results than a model with segregated dendrites compartment  guerguiev2017towards , where updates are done in two different phases, but with a spiking neural network. We however cannot reach an accuracy as high as in sacramento2018dendritic ; NIPS2018_8089 when considering updates in both phases as they do, this may be due to the fact they use different compartments. In conclusion, both models presented in this work were able to reach accuracies comparable to backpropagation, with a simple biologically-plausible setting.

6 Conclusion

Deep learning has been the focus of intense studies in the past decade and has become more and more efficient in solving diverse and complex tasks, that range from pattern recognition, to image generation, NLP (natural language processing) and many others. Many ideas that had some success in deep learning originally come from neuroscience, and making link between both worlds has been the focus of many studies recently 

Marblestone-et-al-2016 . In particular, different mechanisms have been developed to update the parameters of an artificial neural network. Backpropagation almeida1987learning has been the canonical and most used way of training a network. Many frameworks have been developed to enable efficient gradient computations thanks to the use of the chain rule 2016arXiv160502688short ; tensorflow2015-whitepaper . However backpropagation as it is commonly used has many properties that are not biologically plausible bengio2015towards ; neftci2017event .

We developed in this work a class of neural networks models, that enables learning as would backpropagation but with local learning rules. To the feedforward neural network of pyramidal units (which represents a classical multilayer perceptron) we add a second network of inhibitory interneurons, which we denote as ghost units. Connections between all layers from both networks exist making the whole network a complex recurrent one. The dynamics of the models are separated into two phases as in scellier2017equilibrium . During the free phase, the ghost units network learns to perfectly replicate the feedback from the pyramidal units, and therefore cancel any feedback coming from the upper layers. At equilibrium, the network propagates the correct feedforward graph and can perform a classification task. During the weakly-clamped phase, the output pyramidal units are nudged in order to reduce the output cost function. The error is backpropagated through the layers thanks to the recurrent dynamic itself. We considered two different models, the first one (MA) where each pyramidal unit has an associated ghost unit in the previous layer. The ghost unit network learns throughout the training to replicated the pyramidal one. In the second model (MB), we consider fewer ghost units that dynamically learn to replicate the feedback from the pyramidal units at each different pattern presentation. We prove for both models that, under some hypotheses, the locally defined learning rules approximate classical backpropagation, with condensed notations and proofs. Moreover, we tested both model on the MNIST classification tasks, with different architectures (number of neurons, number of layers) and proved that these networks were able to accomplish such tasks as well as with backpropagation. We also made links with feedback-alignment Lillicrap-et-al-nature2016 . Finally, we were able to loosen some of the assumptions made in our models such as updating all the synaptic weights at all time, while still having a trained network that was able to perform correctly in the pattern recognition task. This single compartment model only expresses the bare properties needed to have a credit assignment mechanism, and as such, with simple notations and proof, straightforwardly highlights a possible implementation of credit assignment in the brain.

The class of models presented in this work was built upon some properties of the mammalian cortex. In particular, the different layers of the pyramidal neural network represent the integration of the sensory stimulus (from other brains areas, thalamus for example) through several layers of cortical neurons (through higher order regions of the brain). As in sacramento2018dendritic ; NIPS2018_8089 , we also consider a population of inhibitory interneurons that would be coherent with the neurophysiological properties and role of SST interneurons. In particular, their role would be to cancel top-down feedback from the pyramidal neurons. Some recent experiments LEINWEBER20171204 showed that pyramidal neurons project back through top-down projections to the interneurons of the previous layer, which would be coherent with these interneurons replicating the activity from upper layers. Other works such as ENIKOLOPOV2018135 showed that synaptic plasticity could generate a negative image of the input in the electric fish and they illustrate the importance of this kind of signal for improvements in neural coding and detection of perturbations. The models presented in this work are able to backpropagate the output layer error, through the different layers without needing gradients computations. The network is able to learn thanks to local learning rules, which have been proven to have some biological relevance and links to STDP (spike-timing dependent plasticity) bengio2017stdp ; Feldman-2012 . We use single-compartment leaky-integrate neurons, that can be seen as a really simple approximation of biological neurons.

However some properties of the presented network can hardly be seen as biologically plausible. Firstly, in (MA), we considered that there was a 1-1 correspondence between the pyramidal cells and the ghost units. This is hardly true in biological neuronal network, however we could consider that each pyramidal cell models the activity of several pyramidal neurons and as such, forget this hypothesis. We also developed (MB), where the number of ghost units can be set to an arbitrary (smaller) number and forget 1-1 correspondence. For this we suppose that the network of ghost units is able to dynamically replicate the feedback from the pyramidal cells through some rapid plastic mechanisms (as can be seen with Post-Tetanic Potentiation storozhuk2002post ; xue2010post ), which may not be biologically implemented directly without any neurotransmitter modulation. However, as we present two models that rely on opposite hypotheses, it would be possible to compose a large class of models sharing properties from both of these extreme cases. We could then have, at the same time, an adaptive process, that learns on long time scale how to cancel feedback but that can also adapt to variations in the input, making it closer to biological observations. One possible way to implement such a neural network would be to make the ghost units of (MA) replicate a linear combination of the pyramidal units activity and as such use fewer ghost units. Training 2-layer networks with transposed feedback weights was in some case unstable due to the fact that the transmission of the backpropagated error was proportional to the amplitude of the feedback weights, and therefore of the forward ones. Using feedback-alignment solves this issue, as the scale of the feedback weights was fixed. However we can imagine that this problem could also be overcome in the brain thanks to normalization mechanisms such as weight decay or weights regularization, that could be implemented by biologically plausible mechanisms.

We could also implement integrate-and-fire neurons and make the link with spiking deep networks NIPS2011_4383 in order to go further towards biological networks. We focused on pattern recognition in this work, but ghost units could also help to make biologically plausible networks for other tasks of deep learning such as Generative Adversarial Networks Goodfellow-et-al-NIPS2014-small

or Deep Reinforcement Learning, by adding some other features such as the influence of neurotransmitters on local learning rules.

This work is a step towards apprehending the mechanics of learning and memory in the brain, but at the same time, it raises interesting perspectives for implementation of deep networks in neuromorphic hardware.

References

  • [1] Ian J. Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016.
  • [2] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
  • [3] Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N Sainath, et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6):82–97, 2012.
  • [4] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
  • [5] Luis B Almeida. A learning rule for asynchronous perceptrons with feedback in a combinatorial environment. In Proceedings, 1st First International Conference on Neural Networks, volume 2, pages 609–618. IEEE, 1987.
  • [6] Fernando J Pineda.

    Generalization of backprop to recurrent neural networks.

    Physical review letters, 59(19):2229, 1987.
  • [7] Yann LeCun, Bernhard E Boser, John S Denker, Donnie Henderson, Richard E Howard, Wayne E Hubbard, and Lawrence D Jackel. Handwritten digit recognition with a back-propagation network. In Advances in neural information processing systems, pages 396–404, 1990.
  • [8] Seyed-Mahdi Khaligh-Razavi and Nikolaus Kriegeskorte. Deep supervised, but not unsupervised, models may explain it cortical representation. PLoS computational biology, 10(11):e1003915, 2014.
  • [9] Yoshua Bengio, Dong-Hyun Lee, Jorg Bornschein, Thomas Mesnard, and Zhouhan Lin. Towards biologically plausible deep learning. arXiv preprint arXiv:1502.04156, 2015.
  • [10] Emre O Neftci, Charles Augustine, Somnath Paul, and Georgios Detorakis. Event-driven random back-propagation: Enabling neuromorphic deep learning machines. Frontiers in neuroscience, 11:324, 2017.
  • [11] Timothy P Lillicrap, Daniel Cownden, Douglas B Tweed, and Colin J Akerman. Random synaptic feedback weights support error backpropagation for deep learning. Nature communications, 7:13276, 2016.
  • [12] Arild Nøkland. Direct feedback alignment provides learning in deep neural networks. In Advances in Neural Information Processing Systems, pages 1037–1045, 2016.
  • [13] Matthieu Courbariaux, Itay Hubara, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Binarized neural networks: Training deep neural networks with weights and activations constrained to+ 1 or-1. arXiv preprint arXiv:1602.02830, 2016.
  • [14] Benjamin Scellier and Yoshua Bengio. Equilibrium propagation: Bridging the gap between energy-based models and backpropagation. Frontiers in computational neuroscience, 11:24, 2017.
  • [15] Yoshua Bengio, Thomas Mesnard, Asja Fischer, Saizheng Zhang, and Yuhuai Wu. Stdp-compatible approximation of backpropagation in an energy-based model. Neural computation, 29(3):555–577, 2017.
  • [16] Thomas Mesnard, Wulfram Gerstner, and Johanni Brea. Towards deep learning with spiking neurons in energy based models with contrastive hebbian plasticity. arXiv preprint arXiv:1612.03214, 2016.
  • [17] Jordan Guerguiev, Timothy P Lillicrap, and Blake A Richards. Towards deep learning with segregated dendrites. eLife, 6, 2017.
  • [18] João Sacramento, Rui Ponte Costa, Yoshua Bengio, and Walter Senn. Dendritic error backpropagation in deep cortical microcircuits. arXiv preprint arXiv:1801.00062, 2018.
  • [19] João Sacramento, Rui Ponte Costa, Yoshua Bengio, and Walter Senn. Dendritic cortical microcircuits approximate the backpropagation algorithm. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 8735–8746. Curran Associates, Inc., 2018.
  • [20] Max Jaderberg, Wojciech Marian Czarnecki, Simon Osindero, Oriol Vinyals, Alex Graves, and Koray Kavukcuoglu. Decoupled neural interfaces using synthetic gradients. arXiv preprint arXiv:1608.05343, 2016.
  • [21] Maksim V Storozhuk, Svetlana Y Ivanova, Tatyana A Pivneva, Igor V Melnick, Galina G Skibo, Pavel V Belan, and Platon G Kostyuk. Post-tetanic depression of gabaergic synaptic transmission in rat hippocampal cell cultures. Neuroscience letters, 323(1):5–8, 2002.
  • [22] Lei Xue and Ling-Gang Wu. Post-tetanic potentiation is caused by two signalling mechanisms affecting quantal size and quantal content. The Journal of Physiology, 588(24):4987–4994, 2010.
  • [23] Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, and Pierre-Antoine Manzagol.

    Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion.

    J. Machine Learning Res.

    , 11, 2010.
  • [24] D. H. Ackley, G. E. Hinton, and T. J. Sejnowski.

    A learning algorithm for Boltzmann machines.

    Cognitive Science, 9:147–169, 1985.
  • [25] Geoffrey E. Hinton and James L. McClelland. Learning representations by recirculation. In D. Z. Anderson, editor, Neural Information Processing Systems, pages 358–366. American Institute of Physics, 1988.
  • [26] Dong-Hyun Lee, Saizheng Zhang, Asja Fischer, and Yoshua Bengio. Difference target propagation. In Proceedings of the 2015th European Conference on Machine Learning and Knowledge Discovery in Databases - Volume Part I, ECMLPKDD’15, pages 498–515, Switzerland, 2015. Springer.
  • [27] Alexander G. Ororbia II, Ankur Mali, Daniel Kifer, and C. Lee Giles. Conducting credit assignment by aligning local representations. CoRR, abs/1803.01834, 2018.
  • [28] David Sussillo and L F Abbott. Generating coherent patterns of activity from chaotic neural networks. Neuron, 63:544–57, 09 2009.
  • [29] Yann LeCun, Leon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, November 1998.
  • [30] Timothy P Lillicrap, Daniel Cownden, Douglas B Tweed, and Colin J Akerman. Random synaptic feedback weights support error backpropagation for deep learning. Nature communications, 7, 2016.
  • [31] Adam H Marblestone, Greg Wayne, and Konrad P Kording. Toward an integration of deep learning and neuroscience. Frontiers in Computational Neuroscience, 10, 2016.
  • [32] Theano Development Team. Theano: A Python framework for fast computation of mathematical expressions. arXiv e-prints, abs/1605.02688, May 2016.
  • [33] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. Software available from tensorflow.org.
  • [34] Marcus Leinweber, Daniel R. Ward, Jan M. Sobczak, Alexander Attinger, and Georg B. Keller. A sensorimotor circuit in mouse cortex for visual flow predictions. Neuron, 96(5):1204, 2017.
  • [35] Armen G. Enikolopov, L.F. Abbott, and Nathaniel B. Sawtell. Internally generated predictions enhance neural and behavioral detection of sensory stimuli in an electric fish. Neuron, 99(1):135 – 146.e3, 2018.
  • [36] Daniel E. Feldman. The spike timing dependence of plasticity. Neuron, 75(4):556–571, 2012.
  • [37] Johanni Brea, Walter Senn, and Jean-Pascal Pfister. Sequence learning with hidden units in spiking neural networks. In J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 24, pages 1422–1430. Curran Associates, Inc., 2011.
  • [38] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks. In NIPS’2014, 2014.