1 Introduction
Stateoftheart artificial neural networks (ANNs) achieve impressive results in a variety of machine intelligence tasks sejnowski2020unreasonable . However, they largely rely on mechanisms that diverge from the original inspiration from biological neural networks bengio2015towards ; illing2019biologically
. As a result, only a small part of this prolific field also contributes to computational neuroscience. In fact, this biological implausibility is also an important issue for machine intelligence. For their impressive performance, ANNs trade off other desired properties, which are present in biological systems. For example, ANN training often demands very large and labelled datasets. When labels are unavailable, selfsupervised learning schemes exist, where supervisory error signals generated by the network itself are exploited and backpropagated from the output towards the input to update the network’s parameters
goodfellow2014generative ; devlin2018bert ; chen2020simple . However, this global propagation of signals in deep networks introduces another limitation. Namely, it prevents the implementation of efficient distributed computing hardware that would be based on only local signals from neighbouring physical nodes in the network, and is in contrast to the local synaptic plasticity rules that are believed to govern biological learning. Several pieces of work have been addressing parts of the biological implausibility and drawbacks of backpropagation in ANNs bengio2015towards ; lillicrap2016random ; guerguiev2017towards ; pfeiffer2018deep ; illing2019biologically ; pogodin2020kernelized ; millidge2020predictive. Recently, an approximation to backpropagation that is mostly Hebbian, i.e. relies on mostly pre and postsynaptic activity of each synapse, has been achieved by reducing the global error requirements to 1bit information
pogodin2020kernelized . Two schemes that further localize the signal that is required for a weight update are Equilibrium Propagation scellier2017equilibrium and Predictive Coding millidge2020predictive . Both methods approximate backpropagation through Hebbianlike learning, by delegating the global aspect of the computation, from a global error signal, to a global convergence of the network state to an equilibrium. This equilibrium is reached through several iterative steps of feedforward and feedback communication throughout the network, before the ultimate weight update by one training example. The biological plausibility and hardwareefficiency of this added iterative and heavily feedbackdependent process are open questions that begin to be addressed ernoult2020equilibrium .Moreover, learning through backpropagation, and presumably also its approximations, has another indication of biological implausibility, which also significantly limits ANN applicability. Namely, it produces networks that are confused by small adversarial perturbations of the input, which are imperceptible by humans. It has recently been proposed that a defence strategy of "deflection" of adversarial attacks may be the ultimate solution to that problem qin2020deflecting . Through this strategy, to cause confusion in the network’s inferred class, the adversary is forced to generate such a changed input that really belongs to the distribution of a different input class. Intuitively, but also strictly by definition, this deflection is achieved if a human assigns to the perturbed input the same label that the network does. Deflection of adversarial attacks in ANNs has been demonstrated by an elaborate scheme that is based on detecting the attacks qin2020deflecting . However, the human ability to deflect adversarial perturbations likely does not rely on detecting them, but rather on effectively ignoring them, making the deflecting type of robustness an emergent property of biological computation rather than a defence mechanism. The biological principles that underlie this property of robustness are unclear, but it might emerge from the distinct algorithms that govern learning in the brain.
Therefore, what is missing is a biologically plausible model that can learn from fewer datapoints, without labels, through local plasticity, and without feedback from distant layers. This model could then be tested for emergent adversarial robustness. A good candidate category of biological networks and learning algorithms is that of competitive learning. Neurons that compete for their activation through lateral inhibition are a common connectivity pattern in the superficial layers of the cerebral cortex douglas2004neuronal ; binzegger2004quantitative . This pattern is described as winnertakeall (WTA), because competition suppresses activity of weakly activated neurons, and emphasizes strong ones. Combined with Hebbianlike plasticity rules, WTA connectivity gives rise to competitivelearning algorithms. These networks and learning schemes have been long studied von1973self and a large literature based on simulations and analyses describes their functional properties. A WTA neuronal layer, depending on its specifics, can restore missing input signals rutishauser2011collective ; diehl2016learning , perform decision making i.e. winner selection hahnloser1999feedback ; maass2000computational ; rutishauser2011collective generate oscillations such as those that underlie brain rhythms cannon2014neurosystems . Perhaps more importantly, its neurons can learn to become selective to different input patterns, such as orientation of visual bars in models of the primary visual cortexvon1973self , MNIST handwritten digits nessler2013PLoS ; diehl2015unsupervised ; krotov2019unsupervised , CIFAR10 objects krotov2019unsupervised , spatiotemporal spiking patterns nessler2013PLoS , and can adapt dynamically to model changing objects moraitis2020shortterm . The WTA model is indeed biologically plausible, Hebbian plasticity is local, and learning is inputdriven, relying on only feedforward communication of neurons – properties that seem to address several of the limitations of ANNs. However, the model’s applicability is limited to simple tasks, because a theory that describes the Hebbian WTA learning algorithms in formal terms of objective optimization has been lacking, except under the assumption of stochastic spiking neurons, combined with population coding of the inputs nessler2009stdp ; nessler2013PLoS
. These assumptions limit the compatibility of Hebbian WTA theory with the powerful theoretical and practical tools of ANNs. Recently, when WTA networks were studied in a theoretical framework compatible with conventional machine learning (ML), but in the context of shortterm as opposed to longterm Hebbian plasticity, it resulted in surprising practical advantages over supervised ANNs
moraitis2020shortterm . A similar theoretical approach could also reveal unknown advantages of longterm Hebbian plasticity in WTA networks. In addition, it could provide insights into how a WTA microcircuit could participate in largerscale computation by deeper cortical or artificial networks.Here we construct "SoftHebb", a biologically plausible WTA model that is based on standard ratebased neurons as in ANNs, can accommodate various activation functions, and learns without labels, using local plasticity and only feedforward communication, i.e. the properties we seek in an ANN. Importantly, it is equipped with a simple normalization of the layer’s activations, and an optional temperaturescaling mechanism
hinton2015distilling, producing a soft WTA instead of selecting a single "hard" winner neuron. This allows us to prove formally that a SoftHebb layer is a generative mixture model that objectively minimizes its KullbackLeibler (KL) divergence from the input distribution through Bayesian inference, thus providing a formal MLtheoretic understanding of these networks. We complement our main results, which are theoretical, with experiments that are smallscale but produce intriguing results. As a generative model, SoftHebb has a broader scope than classification, but we test it in simulations on the tasks of recognizing MNIST handwritten digits and FashionMNIST fashion products. First, we confirm that SoftHebb is more accurate than a hardWTA model. Second, we validate that it minimizes a loss function (crossentropy) even though it has no access to it or to labels during learning. In addition, likely owing to its Bayesian and generative properties, the unsupervised WTA model outperforms a supervised twolayer perceptron in several aspects: learning speed and accuracy in the first presentation of the training dataset, robustness to noisy data, and increased robustness to one of the strongest whitebox adversarial attacks, i.e. projected gradient descent (PGD)
madry2017towards , and without any explicit defence. Interestingly, the SoftHebb model also exhibits inherent properties of deflection qin2020deflecting of the adversarial attacks, and generates object interpolations.2 Theory
Definition 2.1 (The input assumptions).
Each observation is generated by a hidden "cause" or "class" from a finite set of possible such causes:
Therefore, the data is generated by a mixture of the probability distributions attributed to each of the
classes :(1) 
In addition, the dimensions of , are conditionally independent from each other, i.e.
The number of the true causes or classes of the data is assumed to be known.
The term "cause" is used here in the sense of causal inference. It is important to emphasize that the true cause of each input is hidden, i.e. not known. In the case of a labelled dataset, the labels are deleted before presenting the training data to the model. We choose a mixture model that corresponds to the data assumptions but is also interpretable in neural terms (Paragraph 2.4):
Definition 2.2 (The generative probabilistic mixture model).
We consider a mixture model distribution : approximating the data distribution . We choose specifically a mixture of exponentials and we parametrize also as an exponential, specifically:
(2)  
(3) 
The model we have chosen is a reasonable choice because it factorizes similarly to the data of Definition 2.1:
(4) 
where
, i.e. the cosine similarity of the two vectors. A similar probabilistic model was used in related previous theoretical work
nessler2009stdp ; nessler2013PLoS ; moraitis2020shortterm , but for different data assumptions, and with certain further constraints to the model. Namely, nessler2009stdp ; nessler2013PLoS considered data that was binary, and created by a population code, while the model was stochastic. These works provide the foundation of our derivation, but here we consider the more generic scenario where data are continuousvalued and input directly into the model, which is deterministic and, as we will show, more compatible with standard ANNs. In moraitis2020shortterm , data had particular shortterm temporal dependencies, whereas here we consider the distinct case of independent and identically distributed (i.i.d.) input samples. The Bayesoptimal parameters of a model mixture of exponentials can be found analytically as functions of the input distribution’s parameters, and the model is equivalent to a soft winnertakeall neural network moraitis2020shortterm . After describing this, we will prove here that Hebbian plasticity of synapses combined with local plasticity of the neuronal biases sets the parameters to their optimal values.Theorem 2.3 (The optimal parameters of the model).
The parameters that minimize the KL divergence of such a mixture model from the data are, for every ,
(5)  
(6) 
where is the mean of the distribution , and .
In other words, the optimal parameter vector of each component in this mixture is proportional to the mean of the corresponding component of the input distribution, i.e. it is a centroid of the component. In addition, the optimal parameter of the model’s prior
is the logarithm of the corresponding component’s prior probability. The Theorem’s proof was provided in the supplementary material of
moraitis2020shortterm, but for completeness we also provide it in our Appendix. These centroids and priors of the input’s component distributions, as well as the method of their estimation, however, are different for different input assumptions, and we will derive a learning rule that provably sets the parameters to their Maximum Likelihood Estimate for the inputs addressed here. The learning rule is a Hebbian type of synaptic plasticity combined with a plasticity for neuronal biases. Before providing the rule and the related proof, we will describe how our mixture model is equivalent to a WTA neural network.
2.4 Equivalence of the probabilistic model to a WTA neural network
The cosine similarity between the input vector and each centroid’s parameters underpins the model (Eq. 4). This similarity is precisely computed by a linear neuron that receives normalized inputs , and that normalizes its vector of synaptic weights: . Specifically, the neuron’s summed weighted input then determines the cosine similarity of an input sample to the weight vector, thus computing the likelihood function of each component of the input mixture (Eq. 2). The bias term of each neuron can store the parameter of the prior . Based on these, it can also be shown that a set of such neurons can actually compute the Bayesian posterior, if the neurons are connected in a configuration that implements softmax. Softmax has a biologicallyplausible implementation through lateral inhibition between neurons nessler2009stdp ; nessler2013PLoS ; moraitis2020shortterm . Specifically, based on the model of Definition 2.2
, the posterior probability is
(7) 
But in the neural description, the activation of the th linear neuron. That is, Eq. 7 shows that the result of Bayesian inference of the hidden cause from the input is found by a softmax operation on the linear neural activations. In this equivalence, we will be using to symbolize the softmax output of the th neuron, i.e. the output after the WTA operation, interchangeably with . It can be seen in Eq. 7 that the probabilistic model has one more, alternative, but equivalent neural interpretation. Specifically, can be described as the output of a neuron with exponential activation function (numerator in Eq. 7) that is normalized by its layer’s total output (denominator). This is equally accurate, and more directly analogous to the biological description nessler2009stdp ; nessler2013PLoS ; moraitis2020shortterm . This shows that the exponential activation of each individual neuron directly equals the th exponential component distribution of the generative mixture model (Eq. 4). Therefore the softmaxconfigured linear neurons, or the equivalent normalized exponential neurons, fully implement the generative model of Definition 2.2
, and also infer the Bayesian posterior probability given an input and the model parameters. However, the problem of calculating the model’s parameters from data samples is a difficult one, if the input distribution’s parameters are unknown. In the next sections we will show that this neural network can find these optimal parameters through Bayesian inference, in an unsupervised and online manner, based on only local Hebbian plasticity.
2.5 A Hebbian rule that optimizes the weights
Several Hebbianlike rules exist and have been combined with WTA networks. For example, in the case of stochastic binary neurons and binary populationcoded inputs, it has been shown that weight updates with an exponential weightdependence find the optimal weights nessler2009stdp ; nessler2013PLoS . Oja’s rule is another candidate oja1982simplified . An individual linear neuron equipped with this learning rule finds the first principal component of the input data oja1982simplified . A variation of Oja’s rule combined with hardWTA networks and additional mechanisms has achieved good experimental results performance on classification tasks krotov2019unsupervised , but lacks the theoretical underpinning that we aim for. Here we propose a Hebbianlike rule for which we will show it optimizes the soft WTA’s generative model. The rule is similar to Oja’s rule, but considers, for each neuron , both its linear weighted summation of the inputs , and its nonlinear output of the WTA :
(8) 
where is the synaptic weight from the th input to the th neuron, and
is the learning rate hyperparameter. By solving the equation
where is the expected value over the input distribution, we can show that, with this rule, there exists a stable equilibrium value of the weights, and this equilibrium value is an optimal value according to Theorem 2.3:Theorem 2.5.
The equilibrium weights of the SoftHebb synaptic plasticity rule are
(9) 
The proof is provided in the Appendix. Therefore, our update rule (Eq. 8) optimizes the weights of the neural network.
2.6 Local learning of neuronal biases as Bayesian priors
For the complete optimization of the model, the neuronal biases must also be optimized to satisfy Eq. 5, i.e. to optimize the Bayesian prior belief for the probability distribution over the input causes. We define the following ratebased rule inspired from the spikebased bias rule of nessler2013PLoS :
(10) 
With the same technique we used for Theorem 2.5, we also provide proof in the Appendix that the equilibrium of the bias with this rule matches the optimal value of Theorem 2.3:
Theorem 2.6.
The equilibrium biases of the SoftHebb bias learning rule are
(11) 
2.7 Alternate activation functions
The model of Definition 2.2 uses for each component an exponential probability distribution with a base of Euler’s e, equivalent to a model using similarly exponential neurons (Subsection 2.4). Depending on the task, different probability distribution shapes may be better models, i.e. different neuronal activation functions. This is compatible with our theory. Firstly, the base of the exponential activation function can be chosen differently, resulting in a softmax function with a different base, such that Eq. 7 becomes more generally This is reminiscent of Temperature Scaling hinton2015distilling , a different mechanism that could also be used as it also maintains the probabilistic interpretation of the output, and changes the exponent of the function rather than the base. Both types of change to the model can be implemented by a normalized layer of exponential neurons, and are compatible with our theoretical derivations and the optimization by the plasticity rule of Eq. 8
. Moreover, we show in the Appendix that soft WTA models can be constructed by rectified linear units (ReLU) or in general by neurons with any nonnegative monotonically increasing activation function, and their weights are optimized by the same plasticity rule.
2.8 Crossentropy and true causes, as opposed to labels
It is important to note that, in labelled datasets, the labels that have been assigned by a human supervisor may not correspond exactly to the true causes that generate the data, which SoftHebb infers. For example, consider MNIST. The 10 labels indicating the 10 decimal digits do not correspond exactly to the true cause of each example image. In reality, the cause of each MNIST example in the sense implied by causal inference is not the digit cause itself, but a combination of a single digit cause , which is the MNIST label, with one of many handwriting styles . That is, the probabilistic model is such that in the Eq. of Definition 2.1, the cause of each sample is dual, i.e. there exists a digit and a style such that
(12)  
(13) 
This is important for our unsupervised model, because, first, it can make the assumption from Definition 2.1 that the number of input causes is known problematic. Practically speaking,
can be chosen using common heuristics from cluster analysis. Second, it makes the evaluation of the loss of a trained SoftHebb model based on test labels not straightforward. We will now provide the theoretical tools for achieving this. Even though SoftHebb is a generative model, it can be used for discrimination of the input classes
, using Bayes’ theorem. More formally, the proof of Theorem
2.3 involved showing that SoftHebb minimizes the KL divergence of the model from the data . Based on this it can be shown that the algorithm also minimizes its crossentropy of the causes that it infers, from the true causes of the data : An additional consequence is that by minimizing , SoftHebb also minimizes its labelbased crossentropy between the true labels and the implicitly inferred labels :(14)  
(15) 
This is because, in Eqs. 13 and 14, the dependence of the labels on the true causes is fixed by the data generation process. To obtain and measure the crossentropy, the causal structure
is missing, but it can be represented by a supervised classifier
of SoftHebb’s outputs, trained using the labels . Therefore, by (a) unsupervised training of SoftHebb, then (b) training a supervised classifier on top, and finally (c1) repeating the training of SoftHebb with the same initial weights and ordering of the training inputs, while (c2) measuring the trained classifier’s loss, we can observe the crossentropy loss of SoftHebb while it is being minimized, and infer that is also minimized (Eq. 15). We call this the posthoc crossentropy method, and we have used it in our experiments (Section 3.2 and Fig. 1 C and D).3 Experiments
3.1 MNIST accuracy vs hard WTA
We implemented the theoretical SoftHebb model in simulations and tested it in the task of learning to classify MNIST handwritten digits. The network received the MNIST frames normalized by their Euclidean norm, and the plasticity rule we derived updated its weights and biases in an unsupervised manner. We used
neurons. First we trained the network for 100 epochs, i.e. randomly ordered presentations of the 60000 training digits. In our validation testing we found that softmax with a base of 1000 (see Section
2.7) performed best. The learning rate of Eq. 8decreased linearly from 0.03 to 0 throughout training. Each training experiment we will describe was repeated five times with varying random initializations and input order. We will report the mean and standard deviation of accuracies. Inference of the input labels by the WTA network of 2000 neurons was performed in two different ways. The first approach is singlelayer, where after training the network we assigned a label to each of the 2000 neurons, in a standard approach that is used in unsupervised clustering. Namely, for each neuron, we found the label of the training set that makes it win the WTA competition most often. In this singlelayer approach, this is the only time when labels were used, and at no point were weights updated using labels. The second approach was twolayer and based on supervised training of a perceptron classifier on top of the WTA layer. The classifier layer was trained with the Adam optimizer and crossentropy loss for 60 epochs, while the previouslytrained WTA parameters were frozen.
SoftHebb achieved an accuracy of and in its 1 and 2layer form respectively. To confirm the strength of the soft WTA approach combined with training the priors through biases, which makes the network Bayesian, we also trained the weights of a network with a hardWTA setup, i.e. where the strongestactivated neuron’s output is 1, and the other neurons are suppressed to 0, for each input. We found that an initial learning rate of 0.05 was best for the hardWTA network. The SoftHebb model outperformed the hard WTA (Fig. 1A). However, SoftHebb’s accuracy is significantly lower than a multilayer perceptron (MLP) with one hidden layer of also 2000 neurons that is trained endtoend exhaustively. The MLP achieves a accuracy (not shown in the figure). This is expected, due to endtoend training, supervision, and the MLP being a discriminative model as opposed to a generative model merely applied to a classification task, as SoftHebb is. If the Bayesian and generative aspects that follow from our theory were not required, several additional mechanisms exist to enhance the discriminative power of WTA networks krotov2019unsupervised , and even a random projection layer instead of a trained WTA performs well illing2019biologically . The generative approach however has its own advantages even for a discriminative task, and we will show some of these here.
3.2 Crossentropy minimization and singleepoch advantage over backpropagation
First, we show as a validation of the theory that the SoftHebb model minimizes crossentropy of its activations from its input’s causes, even though no explicit loss is provided. According to our posthoc crossentropy method (Section 2.8), as a proxy we observed the minimization of during the first epoch of online Hebbian learning. The loss on the training inputs as they appear (running loss), as well as on the whole testing dataset can be seen in Fig. 1C and D respectively (blue curves). The method allows us to observe the discriminative aspect of the generative model, as it is optimized. After this one epoch, the accuracy of the 1layer form of the SoftHebb model is . The 2layer form is again obtained by training a supervised classifier output layer for 60 epochs, and its accuracy is (Fig. 1
B, blue bars). We then also train for a single epoch a 2layer MLP with a hiddenlayer of 2000 neurons, with backpropagation of stochastic gradient descent (SGD) and crossentropy loss. We found, through grid search, the optimal minibatch size and learning rate of the MLP (4 and 0.2 respectively). The MLP achieves an accuracy of
(Fig. 1B, orange bar), if we exclude one run of the experiment which only achieved an accuracy of 86.92%. Surprisingly, it not surpass SoftHebb, not even in its 1layer form. In addition, the crossentropy of the SoftHebb model is visibly minimized faster than through SGD (orange curves of Fig. 1C & D). It is possible that SoftHebb’s advantage in terms of loss and accuracy is a sideeffect of pretraining the second layer when obtaining SoftHebb’s posthoc crossentropy, or of that layer’s 60epoch training. To test this possibility, we similarly obtained a trained version of the MLP’s output layer alone, and then trained its first layer with backpropagation and the second layer frozen. Meanwhile, we recorded its loss, thus obtaining its own version of posthoc crossentropy (1C & D, yellow curve). SoftHebb still showed an advantage in terms of loss minimization speed, and its 2layer form’s accuracy is still not surpassed (1B, blue & yellow bars), despite the fully unsupervised and local learning in the core of the network. Moreover, the figure shows that the minimization of the loss on the general test set by SoftHebb is smoother than the running loss, while SGD’s testset loss is influenced by the specifics of the individual training examples. This may indicate stronger generalization by the SoftHebb model, emerging from its Bayesian and generative nature. If this is true, SoftHebb may be more robust to input perturbations.3.3 Robustness to noise and adversarial attacks  Generative adversarial properties
Indeed, we tested the trained SoftHebb and MLP models for robustness, and found that SoftHebb is significantly more robust than the backproptrained MLP, both to added Gaussian noise and to PGD adversarial attacks (see Fig. 2). PGD madry2017towards produces perturbations in a direction that maximizes the loss of each targeted network, and in size controlled by a parameter . Strikingly, SoftHebb has a visible tendency to deflect the attacks, i.e. its confusing examples actually belong to a perceptually different class (Fig. 2B and 3). This effectively nullifies the attack and was previously shown in elaborate stateoftheart adversarialdefence models qin2020deflecting
. The pair of the adversarial attacker with the generative SoftHebb model essentially composes a generative adversarial network (GAN), even though the term is usually reserved for pairs
trained in tandem goodfellow2014generative ; creswell2018generative . As a result, the model could inherit certain properties of GANs. It can be seen that it is able to generate interpolations between input classes (Fig. 3). The parameter of the adversarial attack can control the balance between the interpolated objects. Similar functionality has existed in the realm of GANs radford2015unsupervised berthelot2018understanding , and other deep neural networks bojanowski2017optimizing , but was not known for simple biologicallyplausible models.3.4 Generalizability of the algorithm to other datasets: FashionMNIST
Finally, we trained the SoftHebb model on a more difficult dataset, namely that of FashionMNIST xiao2017/online which contains greyscale images of clothing products. A supervised MLP of the same size that we trained as a reference achieved a test accuracy of on this dataset. We used the exact same SoftHebb model and hyperparameters that we used on MNIST to learn FMNIST, without any adjustment for the changed dataset. Despite this, the model achieved an accuracy of . In addition, with very small adversarial perturbations, the MLP drops to an accuracy lower than the SoftHebb model despite our generic training, while SoftHebb’s adversarial and noiserobustness is reconfirmed (dashed lines in Fig. 2) as well as its generative interpolations (Fig. 3B).
4 Discussion
In summary, we have described SoftHebb, a biologically plausible model that is completely unsupervised, local, and requires no error or other feedback from upper layers. The model consists of elements fully compatible with conventional ANNs. We have shown the importance of soft competition in ratebased WTA networks, and derived formally the type of plasticity that optimizes the network through Bayesian computation. SoftHebb learns a generative model of the input distribution. We also formalized its unsupervised discriminative properties, and we developed a method for quantifying its discriminative loss in a theoretically sound manner. Our experiments are small, but they confirm our optimization theory, and demonstrate that SoftHebb has significant strengths that emerge from its unsupervised, generative, and Bayesian nature. It is intriguing that, through the model’s biological plausibility, emerge properties commonly associated with biological intelligence, such as speed of learning, robustness to noise and adversarial attacks, and deflection of the attacks. In particular, its ability to learn better than even supervised networks when training time is limited is interesting for neuromorphic applications targeting resourcelimited learning tasks. Its robustness to noise and adversarial attacks are impressive, considering that they are intrinsic and were not instilled by specialized defences. SoftHebb has the inherent tendency to not merely be robust to attacks, but actually deflect them. We also showed that these networks can generate interpolations between points in the latent space. However, the model is quite limited compared to state oftheart ML, if classification accuracy, exhaustive training, and unperturbed inputs are the benchmark. To address this, its potential integration into multilayer networks should be explored using theoretical tools such as those we developed here, and others from the literature nessler2009stdp ; nessler2013PLoS ; krotov2019unsupervised . This could also provide insights into the role of WTA microcircuits in the context of largerscale computations in cortex.
References

(1)
Sejnowski, T. J.
The unreasonable effectiveness of deep learning in artificial intelligence.
Proceedings of the National Academy of Sciences 117, 30033–30038 (2020).  (2) Bengio, Y., Lee, D.H., Bornschein, J., Mesnard, T. & Lin, Z. Towards biologically plausible deep learning. arXiv preprint arXiv:1502.04156 (2015).
 (3) Illing, B., Gerstner, W. & Brea, J. Biologically plausible deep learning—but how far can we go with shallow networks? Neural Networks 118, 90–101 (2019).
 (4) Goodfellow, I. J. et al. Generative adversarial networks. arXiv preprint arXiv:1406.2661 (2014).
 (5) Devlin, J., Chang, M.W., Lee, K. & Toutanova, K. Bert: Pretraining of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
 (6) Chen, T., Kornblith, S., Norouzi, M. & Hinton, G. A simple framework for contrastive learning of visual representations. In International conference on machine learning, 1597–1607 (PMLR, 2020).
 (7) Lillicrap, T. P., Cownden, D., Tweed, D. B. & Akerman, C. J. Random synaptic feedback weights support error backpropagation for deep learning. Nature communications 7, 1–10 (2016).
 (8) Guerguiev, J., Lillicrap, T. P. & Richards, B. A. Towards deep learning with segregated dendrites. ELife 6, e22901 (2017).
 (9) Pfeiffer, M. & Pfeil, T. Deep learning with spiking neurons: opportunities and challenges. Frontiers in neuroscience 12, 774 (2018).
 (10) Pogodin, R. & Latham, P. E. Kernelized information bottleneck leads to biologically plausible 3factor hebbian learning in deep networks. arXiv preprint arXiv:2006.07123 (2020).
 (11) Millidge, B., Tschantz, A. & Buckley, C. L. Predictive coding approximates backprop along arbitrary computation graphs. arXiv preprint arXiv:2006.04182 (2020).

(12)
Scellier, B. & Bengio, Y.
Equilibrium propagation: Bridging the gap between energybased models and backpropagation.
Frontiers in computational neuroscience 11, 24 (2017).  (13) Ernoult, M., Grollier, J., Querlioz, D., Bengio, Y. & Scellier, B. Equilibrium propagation with continual weight updates. arXiv preprint arXiv:2005.04168 (2020).
 (14) Qin, Y., Frosst, N., Raffel, C., Cottrell, G. & Hinton, G. Deflecting adversarial attacks. arXiv preprint arXiv:2002.07405 (2020).
 (15) Douglas, R. J. & Martin, K. A. Neuronal circuits of the neocortex. Annu. Rev. Neurosci. 27, 419–451 (2004).
 (16) Binzegger, T., Douglas, R. J. & Martin, K. A. A quantitative map of the circuit of cat primary visual cortex. Journal of Neuroscience 24, 8441–8453 (2004).
 (17) Von der Malsburg, C. Selforganization of orientation sensitive cells in the striate cortex. Kybernetik 14, 85–100 (1973).
 (18) Rutishauser, U., Douglas, R. J. & Slotine, J.J. Collective stability of networks of winnertakeall circuits. Neural computation 23, 735–773 (2011).
 (19) Diehl, P. U. & Cook, M. Learning and inferring relations in cortical networks. arXiv preprint arXiv:1608.08267 (2016).
 (20) Hahnloser, R., Douglas, R. J., Mahowald, M. & Hepp, K. Feedback interactions between neuronal pointers and maps for attentional processing. nature neuroscience 2, 746–752 (1999).
 (21) Maass, W. On the computational power of winnertakeall. Neural computation 12, 2519–2535 (2000).
 (22) Cannon, J. et al. Neurosystems: brain rhythms and cognitive processing. European Journal of Neuroscience 39, 705–719 (2014).
 (23) Nessler, B., Pfeiffer, M., Buesing, L. & Maass, W. Bayesian computation emerges in generic cortical microcircuits through spiketimingdependent plasticity. PLoS computational biology 9, e1003037 (2013).
 (24) Diehl, P. U. & Cook, M. Unsupervised learning of digit recognition using spiketimingdependent plasticity. Frontiers in computational neuroscience 9, 99 (2015).
 (25) Krotov, D. & Hopfield, J. J. Unsupervised learning by competing hidden units. Proceedings of the National Academy of Sciences 116, 7723–7731 (2019).
 (26) Moraitis, T., Sebastian, A. & Eleftheriou, E. Shortterm synaptic plasticity optimally models continuous environments (2020). 2009.06808.
 (27) Nessler, B., Pfeiffer, M. & Maass, W. Stdp enables spiking neurons to detect hidden causes of their inputs. Advances in neural information processing systems 22, 1357–1365 (2009).
 (28) Hinton, G., Vinyals, O. & Dean, J. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015).
 (29) Madry, A., Makelov, A., Schmidt, L., Tsipras, D. & Vladu, A. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083 (2017).
 (30) Oja, E. Simplified neuron model as a principal component analyzer. Journal of mathematical biology 15, 267–273 (1982).
 (31) Creswell, A. et al. Generative adversarial networks: An overview. IEEE Signal Processing Magazine 35, 53–65 (2018).
 (32) Radford, A., Metz, L. & Chintala, S. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434 (2015).
 (33) Berthelot, D., Raffel, C., Roy, A. & Goodfellow, I. Understanding and improving interpolation in autoencoders via an adversarial regularizer. arXiv preprint arXiv:1807.07543 (2018).
 (34) Bojanowski, P., Joulin, A., LopezPaz, D. & Szlam, A. Optimizing the latent space of generative networks. arXiv preprint arXiv:1707.05776 (2017).
 (35) Xiao, H., Rasul, K. & Vollgraf, R. Fashionmnist: a novel image dataset for benchmarking machine learning algorithms (2017). cs.LG/1708.07747.
Appendix A Theorem Proofs
Proof of Theorem 2.3.
The parameters of model are optimal
if they minimize the model’s KullbackLeibler divergence with the data distribution
. . Because is independent from , and is independent from for every , we can find the set of parameters that minimize the KL divergence of the mixtures, by minimizing the KL divergence of each component : and simultaneously setting(16) 
From Eq. 3 and this last condition, Eq. 5 of the Theorem is proven:
Further,
(17)  
(18)  
(19) 
where we used for Eq. 17 the fact that is a constant because it is determined by the environment’s data and not by the model’s parametrization on . Eq. 18 follows from the definition of . The result in Eq. 19 is the mean value of the cosine similarity .
Due to the symmetry of the cosine similarity, it follows that
(20)  
(21) 
Enforcement of the requirement for normalization of the vector leads to the unique solution
.
∎
Proof of Theorem 2.5.
We will find the equilibrium point of the SoftHebb plasticity rule, i.e. the weight that implies .
To find this, we will first find the equilibrium point of a similar plasticity rule with a simpler equation:
(22) 
by setting . We will expand this expected value based on the plasticity rule itself, and on the joint probability distribution of the input and the neuronal activation .
To arrive at Eq. 24, we assume that the probability densities that correspond to the components of the input mixture are distributed without a bias throughout the input dimension , and therefore
(26) 
The result is equivalent if a more relaxed assumption is made. Specifically, if we assume that the input distribution is in fact biased, such that
(27) 
then the result in 25 is modified by an added constant :
(28) 
But is the same for all neurons , therefore this effect is canceled by the softmax normalization, such that the output of each neuron remains unaffected by :
(29) 
As a result, the assumption
that we used in our derivation is of little importance.
The difference of the SoftHebb plasticity rule
(30) 
from the simplified rule of Eq. 22 is the multiplicative factor . This factor is common between our rule and Oja’s rule oja1982simplified . The effect of this factor is known to normalize the weight vector of each neuron to a length of one oja1982simplified , as also shown in similar rules with this multiplicative factor krotov2019unsupervised . We prove that this is the effect of the factor also in the SoftHebb rule, separately in Theorem A.1 and its Proof, provided at the end of the present Appendix A.
Proof of Theorem 2.6.
Similarly to the Proof of Theorem 2.5, we find the equilibrium parameter of the SoftHebb plasticity rule.
(32)  
(33)  
(34)  
(35)  
(36)  
(37) 
Theorem A.1.
The equilibrium weights of the SoftHebb synaptic plasticity rule of Eq. 8 are implicitly normalized by the rule to a vector of length 1.
Proof of Theorem a.1.
Using a technique similar to krotov2019unsupervised , we write the SoftHebb plasticity rule as a differential equation
(39) 
The derivative of the norm of the weight vector is
(40) 
Replacing in this equation with the SoftHebb rule of Eq. 39, it is
(41) 
This differential equation shows that the derivative of the norm of the weight vector increases if and decreases if , such that the weight vector tends a sphere of radius , which proves the Theorem. ∎
Appendix B Details to Alternate activation functions (Section 2.7)
Theorem 2.3, which concerns the synaptic plasticity rule in Eq. 8, was proven for the model of Definition 2.2, which uses a mixture of natural exponential component distributions, i.e. with base e (Eq. 4):
(42) 
This implied an equivalence to a WTA neural network with natural exponential activation functions (Section 2.4). However, it is simple to show that these results can be extended to other model probability distributions, and thus other neuronal activations.
Firstly, in the simplest of the alternatives, the base of the exponential function can be chosen differently. In that case, the posterior probabilities that are produced by the model’s Bayesian inference, i.e. the network outputs, are given by a softmax with a different base. If the base of the exponential is , then
(43) 
It is obvious in the Proof of Theorem 2.3 in Appendix A that the same proof also applies to the changed base, if we use the appropriate logarithm for describing KL divergence. Therefore, the optimal parameter vector does not change, and the SoftHebb plasticity rule also applies to the SoftHebb model with a different exponential base. This change of the base in the softmax bears similarities to the change of its exponent, in a technique that is called Temperature Scaling and has been proven useful in classification hinton2015distilling .
Secondly, the more conventional type of Temperature Scaling, i.e. that which scales exponent, is also possible in our model, while maintaining a Bayesian probabilistic interpretation of the outputs, a neural interpretation of the model, and the optimality of the plasticity rule. In this case, the model becomes
(44) 
The Proof of Theorem 2.3 in Appendix A also applies in this case, with a change in Eq. 18, but resulting in the same solution. Therefore, the SoftHebb synaptic plasticity rule is applicable in this case too. The solution for the neuronal biases, i.e. the parameters of the prior in the Theorem (Eq. 5), also remains the same, but with a factor of : .
Finally, and most generally, the model can be generalized to use any nonnegative and monotonically increasing function for the component distributions, i.e. for the activation function of the neurons, assuming
is appropriately normalized to be interpretable as a probability density function. In this case the model becomes
(45) 
Note that there is a change in the parametrization of the priors into a multiplicative bias , compared to the additive bias in the previous versions above. This change is necessary in this general case, because not all functions have the property that we used in the exponential case. We can show that the optimal weight parameters remain the same as in the previous case of an exponential activation function, also for this more general case of activation . It can be seen in the Proof of Theorem 2.3, that for a more general function than the exponential, Eq. 18 would instead become:
(46) 
where . We have assumed that is an increasing function, therefore is also increasing. The cosine similarity is symmetrically decreasing as a function of around . Therefore, the function also decreases symmetrically around . Thus, the mean of that function under the probability distribution is maximum when . As a result, Eq. 46 implies that in this more general model too, the optimal weight vector is , and, consequently, it is also optimized by the same SoftHebb plasticity rule.
The implication of this is that the SoftHebb WTA neural network can use activation functions such as rectified linear units (ReLU), or other nonnegative and increasing activations, such as rectified polynomials krotov2019unsupervised etc., and maintain its generative properties, its Bayesian computation, and the theoretical optimality of the plasticity rule. A more complex derivation of the optimal weight vector for alternative activation functions, which was specific to ReLU only, and did not also derive the associated longterm plasticity rule for our problem category (Definition 2.1), was provided by moraitis2020shortterm .
Comments
There are no comments yet.