SoftHebb: Bayesian inference in unsupervised Hebbian soft winner-take-all networks

07/12/2021 ∙ by Timoleon Moraitis, et al. ∙ HUAWEI Technologies Co., Ltd. 0

State-of-the-art artificial neural networks (ANNs) require labelled data or feedback between layers, are often biologically implausible, and are vulnerable to adversarial attacks that humans are not susceptible to. On the other hand, Hebbian learning in winner-take-all (WTA) networks, is unsupervised, feed-forward, and biologically plausible. However, an objective optimization theory for WTA networks has been missing, except under very limiting assumptions. Here we derive formally such a theory, based on biologically plausible but generic ANN elements. Through Hebbian learning, network parameters maintain a Bayesian generative model of the data. There is no supervisory loss function, but the network does minimize cross-entropy between its activations and the input distribution. The key is a "soft" WTA where there is no absolute "hard" winner neuron, and a specific type of Hebbian-like plasticity of weights and biases. We confirm our theory in practice, where, in handwritten digit (MNIST) recognition, our Hebbian algorithm, SoftHebb, minimizes cross-entropy without having access to it, and outperforms the more frequently used, hard-WTA-based method. Strikingly, it even outperforms supervised end-to-end backpropagation, under certain conditions. Specifically, in a two-layered network, SoftHebb outperforms backpropagation when the training dataset is only presented once, when the testing data is noisy, and under gradient-based adversarial attacks. Adversarial attacks that confuse SoftHebb are also confusing to the human eye. Finally, the model can generate interpolations of objects from its input distribution.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

State-of-the-art artificial neural networks (ANNs) achieve impressive results in a variety of machine intelligence tasks sejnowski2020unreasonable . However, they largely rely on mechanisms that diverge from the original inspiration from biological neural networks bengio2015towards ; illing2019biologically

. As a result, only a small part of this prolific field also contributes to computational neuroscience. In fact, this biological implausibility is also an important issue for machine intelligence. For their impressive performance, ANNs trade off other desired properties, which are present in biological systems. For example, ANN training often demands very large and labelled datasets. When labels are unavailable, self-supervised learning schemes exist, where supervisory error signals generated by the network itself are exploited and backpropagated from the output towards the input to update the network’s parameters

goodfellow2014generative ; devlin2018bert ; chen2020simple . However, this global propagation of signals in deep networks introduces another limitation. Namely, it prevents the implementation of efficient distributed computing hardware that would be based on only local signals from neighbouring physical nodes in the network, and is in contrast to the local synaptic plasticity rules that are believed to govern biological learning. Several pieces of work have been addressing parts of the biological implausibility and drawbacks of backpropagation in ANNs bengio2015towards ; lillicrap2016random ; guerguiev2017towards ; pfeiffer2018deep ; illing2019biologically ; pogodin2020kernelized ; millidge2020predictive

. Recently, an approximation to backpropagation that is mostly Hebbian, i.e. relies on mostly pre- and post-synaptic activity of each synapse, has been achieved by reducing the global error requirements to 1-bit information

pogodin2020kernelized . Two schemes that further localize the signal that is required for a weight update are Equilibrium Propagation scellier2017equilibrium and Predictive Coding millidge2020predictive . Both methods approximate backpropagation through Hebbian-like learning, by delegating the global aspect of the computation, from a global error signal, to a global convergence of the network state to an equilibrium. This equilibrium is reached through several iterative steps of feed-forward and feed-back communication throughout the network, before the ultimate weight update by one training example. The biological plausibility and hardware-efficiency of this added iterative and heavily feedback-dependent process are open questions that begin to be addressed ernoult2020equilibrium .

Moreover, learning through backpropagation, and presumably also its approximations, has another indication of biological implausibility, which also significantly limits ANN applicability. Namely, it produces networks that are confused by small adversarial perturbations of the input, which are imperceptible by humans. It has recently been proposed that a defence strategy of "deflection" of adversarial attacks may be the ultimate solution to that problem qin2020deflecting . Through this strategy, to cause confusion in the network’s inferred class, the adversary is forced to generate such a changed input that really belongs to the distribution of a different input class. Intuitively, but also strictly by definition, this deflection is achieved if a human assigns to the perturbed input the same label that the network does. Deflection of adversarial attacks in ANNs has been demonstrated by an elaborate scheme that is based on detecting the attacks qin2020deflecting . However, the human ability to deflect adversarial perturbations likely does not rely on detecting them, but rather on effectively ignoring them, making the deflecting type of robustness an emergent property of biological computation rather than a defence mechanism. The biological principles that underlie this property of robustness are unclear, but it might emerge from the distinct algorithms that govern learning in the brain.

Therefore, what is missing is a biologically plausible model that can learn from fewer data-points, without labels, through local plasticity, and without feedback from distant layers. This model could then be tested for emergent adversarial robustness. A good candidate category of biological networks and learning algorithms is that of competitive learning. Neurons that compete for their activation through lateral inhibition are a common connectivity pattern in the superficial layers of the cerebral cortex douglas2004neuronal ; binzegger2004quantitative . This pattern is described as winner-take-all (WTA), because competition suppresses activity of weakly activated neurons, and emphasizes strong ones. Combined with Hebbian-like plasticity rules, WTA connectivity gives rise to competitive-learning algorithms. These networks and learning schemes have been long studied von1973self and a large literature based on simulations and analyses describes their functional properties. A WTA neuronal layer, depending on its specifics, can restore missing input signals rutishauser2011collective ; diehl2016learning , perform decision making i.e. winner selection hahnloser1999feedback ; maass2000computational ; rutishauser2011collective generate oscillations such as those that underlie brain rhythms cannon2014neurosystems . Perhaps more importantly, its neurons can learn to become selective to different input patterns, such as orientation of visual bars in models of the primary visual cortexvon1973self , MNIST handwritten digits nessler2013PLoS ; diehl2015unsupervised ; krotov2019unsupervised , CIFAR10 objects krotov2019unsupervised , spatiotemporal spiking patterns nessler2013PLoS , and can adapt dynamically to model changing objects moraitis2020shortterm . The WTA model is indeed biologically plausible, Hebbian plasticity is local, and learning is input-driven, relying on only feed-forward communication of neurons – properties that seem to address several of the limitations of ANNs. However, the model’s applicability is limited to simple tasks, because a theory that describes the Hebbian WTA learning algorithms in formal terms of objective optimization has been lacking, except under the assumption of stochastic spiking neurons, combined with population coding of the inputs nessler2009stdp ; nessler2013PLoS

. These assumptions limit the compatibility of Hebbian WTA theory with the powerful theoretical and practical tools of ANNs. Recently, when WTA networks were studied in a theoretical framework compatible with conventional machine learning (ML), but in the context of short-term as opposed to long-term Hebbian plasticity, it resulted in surprising practical advantages over supervised ANNs

moraitis2020shortterm . A similar theoretical approach could also reveal unknown advantages of long-term Hebbian plasticity in WTA networks. In addition, it could provide insights into how a WTA microcircuit could participate in larger-scale computation by deeper cortical or artificial networks.

Here we construct "SoftHebb", a biologically plausible WTA model that is based on standard rate-based neurons as in ANNs, can accommodate various activation functions, and learns without labels, using local plasticity and only feed-forward communication, i.e. the properties we seek in an ANN. Importantly, it is equipped with a simple normalization of the layer’s activations, and an optional temperature-scaling mechanism

hinton2015distilling

, producing a soft WTA instead of selecting a single "hard" winner neuron. This allows us to prove formally that a SoftHebb layer is a generative mixture model that objectively minimizes its Kullback-Leibler (KL) divergence from the input distribution through Bayesian inference, thus providing a formal ML-theoretic understanding of these networks. We complement our main results, which are theoretical, with experiments that are small-scale but produce intriguing results. As a generative model, SoftHebb has a broader scope than classification, but we test it in simulations on the tasks of recognizing MNIST handwritten digits and Fashion-MNIST fashion products. First, we confirm that SoftHebb is more accurate than a hard-WTA model. Second, we validate that it minimizes a loss function (cross-entropy) even though it has no access to it or to labels during learning. In addition, likely owing to its Bayesian and generative properties, the unsupervised WTA model outperforms a supervised two-layer perceptron in several aspects: learning speed and accuracy in the first presentation of the training dataset, robustness to noisy data, and increased robustness to one of the strongest white-box adversarial attacks, i.e. projected gradient descent (PGD)

madry2017towards , and without any explicit defence. Interestingly, the SoftHebb model also exhibits inherent properties of deflection qin2020deflecting of the adversarial attacks, and generates object interpolations.

2 Theory

Definition 2.1 (The input assumptions).

Each observation is generated by a hidden "cause" or "class" from a finite set of possible such causes:

Therefore, the data is generated by a mixture of the probability distributions attributed to each of the

classes :

(1)

In addition, the dimensions of , are conditionally independent from each other, i.e.
The number of the true causes or classes of the data is assumed to be known.

The term "cause" is used here in the sense of causal inference. It is important to emphasize that the true cause of each input is hidden, i.e. not known. In the case of a labelled dataset, the labels are deleted before presenting the training data to the model. We choose a mixture model that corresponds to the data assumptions but is also interpretable in neural terms (Paragraph 2.4):

Definition 2.2 (The generative probabilistic mixture model).

We consider a mixture model distribution : approximating the data distribution . We choose specifically a mixture of exponentials and we parametrize also as an exponential, specifically:

(2)
(3)

In addition, the parameter vectors are subject to the normalization constraints:

, and

The model we have chosen is a reasonable choice because it factorizes similarly to the data of Definition 2.1:

(4)

where

, i.e. the cosine similarity of the two vectors. A similar probabilistic model was used in related previous theoretical work

nessler2009stdp ; nessler2013PLoS ; moraitis2020shortterm , but for different data assumptions, and with certain further constraints to the model. Namely, nessler2009stdp ; nessler2013PLoS considered data that was binary, and created by a population code, while the model was stochastic. These works provide the foundation of our derivation, but here we consider the more generic scenario where data are continuous-valued and input directly into the model, which is deterministic and, as we will show, more compatible with standard ANNs. In moraitis2020shortterm , data had particular short-term temporal dependencies, whereas here we consider the distinct case of independent and identically distributed (i.i.d.) input samples. The Bayes-optimal parameters of a model mixture of exponentials can be found analytically as functions of the input distribution’s parameters, and the model is equivalent to a soft winner-take-all neural network moraitis2020shortterm . After describing this, we will prove here that Hebbian plasticity of synapses combined with local plasticity of the neuronal biases sets the parameters to their optimal values.

Theorem 2.3 (The optimal parameters of the model).

The parameters that minimize the KL divergence of such a mixture model from the data are, for every ,

(5)
(6)

where is the mean of the distribution , and .

In other words, the optimal parameter vector of each component in this mixture is proportional to the mean of the corresponding component of the input distribution, i.e. it is a centroid of the component. In addition, the optimal parameter of the model’s prior

is the logarithm of the corresponding component’s prior probability. The Theorem’s proof was provided in the supplementary material of

moraitis2020shortterm

, but for completeness we also provide it in our Appendix. These centroids and priors of the input’s component distributions, as well as the method of their estimation, however, are different for different input assumptions, and we will derive a learning rule that provably sets the parameters to their Maximum Likelihood Estimate for the inputs addressed here. The learning rule is a Hebbian type of synaptic plasticity combined with a plasticity for neuronal biases. Before providing the rule and the related proof, we will describe how our mixture model is equivalent to a WTA neural network.

2.4 Equivalence of the probabilistic model to a WTA neural network

The cosine similarity between the input vector and each centroid’s parameters underpins the model (Eq. 4). This similarity is precisely computed by a linear neuron that receives normalized inputs , and that normalizes its vector of synaptic weights: . Specifically, the neuron’s summed weighted input then determines the cosine similarity of an input sample to the weight vector, thus computing the likelihood function of each component of the input mixture (Eq. 2). The bias term of each neuron can store the parameter of the prior . Based on these, it can also be shown that a set of such neurons can actually compute the Bayesian posterior, if the neurons are connected in a configuration that implements softmax. Softmax has a biologically-plausible implementation through lateral inhibition between neurons nessler2009stdp ; nessler2013PLoS ; moraitis2020shortterm . Specifically, based on the model of Definition 2.2

, the posterior probability is

(7)

But in the neural description, the activation of the -th linear neuron. That is, Eq. 7 shows that the result of Bayesian inference of the hidden cause from the input is found by a softmax operation on the linear neural activations. In this equivalence, we will be using to symbolize the softmax output of the -th neuron, i.e. the output after the WTA operation, interchangeably with . It can be seen in Eq. 7 that the probabilistic model has one more, alternative, but equivalent neural interpretation. Specifically, can be described as the output of a neuron with exponential activation function (numerator in Eq. 7) that is normalized by its layer’s total output (denominator). This is equally accurate, and more directly analogous to the biological description nessler2009stdp ; nessler2013PLoS ; moraitis2020shortterm . This shows that the exponential activation of each individual neuron directly equals the -th exponential component distribution of the generative mixture model (Eq. 4). Therefore the softmax-configured linear neurons, or the equivalent normalized exponential neurons, fully implement the generative model of Definition 2.2

, and also infer the Bayesian posterior probability given an input and the model parameters. However, the problem of calculating the model’s parameters from data samples is a difficult one, if the input distribution’s parameters are unknown. In the next sections we will show that this neural network can find these optimal parameters through Bayesian inference, in an unsupervised and on-line manner, based on only local Hebbian plasticity.

2.5 A Hebbian rule that optimizes the weights

Several Hebbian-like rules exist and have been combined with WTA networks. For example, in the case of stochastic binary neurons and binary population-coded inputs, it has been shown that weight updates with an exponential weight-dependence find the optimal weights nessler2009stdp ; nessler2013PLoS . Oja’s rule is another candidate oja1982simplified . An individual linear neuron equipped with this learning rule finds the first principal component of the input data oja1982simplified . A variation of Oja’s rule combined with hard-WTA networks and additional mechanisms has achieved good experimental results performance on classification tasks krotov2019unsupervised , but lacks the theoretical underpinning that we aim for. Here we propose a Hebbian-like rule for which we will show it optimizes the soft WTA’s generative model. The rule is similar to Oja’s rule, but considers, for each neuron , both its linear weighted summation of the inputs , and its nonlinear output of the WTA :

(8)

where is the synaptic weight from the -th input to the -th neuron, and

is the learning rate hyperparameter. By solving the equation

where is the expected value over the input distribution, we can show that, with this rule, there exists a stable equilibrium value of the weights, and this equilibrium value is an optimal value according to Theorem 2.3:

Theorem 2.5.

The equilibrium weights of the SoftHebb synaptic plasticity rule are

(9)

The proof is provided in the Appendix. Therefore, our update rule (Eq. 8) optimizes the weights of the neural network.

2.6 Local learning of neuronal biases as Bayesian priors

For the complete optimization of the model, the neuronal biases must also be optimized to satisfy Eq. 5, i.e. to optimize the Bayesian prior belief for the probability distribution over the input causes. We define the following rate-based rule inspired from the spike-based bias rule of nessler2013PLoS :

(10)

With the same technique we used for Theorem 2.5, we also provide proof in the Appendix that the equilibrium of the bias with this rule matches the optimal value of Theorem 2.3:

Theorem 2.6.

The equilibrium biases of the SoftHebb bias learning rule are

(11)

2.7 Alternate activation functions

The model of Definition 2.2 uses for each component an exponential probability distribution with a base of Euler’s e, equivalent to a model using similarly exponential neurons (Subsection 2.4). Depending on the task, different probability distribution shapes may be better models, i.e. different neuronal activation functions. This is compatible with our theory. Firstly, the base of the exponential activation function can be chosen differently, resulting in a softmax function with a different base, such that Eq. 7 becomes more generally This is reminiscent of Temperature Scaling hinton2015distilling , a different mechanism that could also be used as it also maintains the probabilistic interpretation of the output, and changes the exponent of the function rather than the base. Both types of change to the model can be implemented by a normalized layer of exponential neurons, and are compatible with our theoretical derivations and the optimization by the plasticity rule of Eq. 8

. Moreover, we show in the Appendix that soft WTA models can be constructed by rectified linear units (ReLU) or in general by neurons with any non-negative monotonically increasing activation function, and their weights are optimized by the same plasticity rule.

2.8 Cross-entropy and true causes, as opposed to labels

It is important to note that, in labelled datasets, the labels that have been assigned by a human supervisor may not correspond exactly to the true causes that generate the data, which SoftHebb infers. For example, consider MNIST. The 10 labels indicating the 10 decimal digits do not correspond exactly to the true cause of each example image. In reality, the cause of each MNIST example in the sense implied by causal inference is not the digit cause itself, but a combination of a single digit cause , which is the MNIST label, with one of many handwriting styles . That is, the probabilistic model is such that in the Eq. of Definition 2.1, the cause of each sample is dual, i.e. there exists a digit and a style such that

(12)
(13)

This is important for our unsupervised model, because, first, it can make the assumption from Definition 2.1 that the number of input causes is known problematic. Practically speaking,

can be chosen using common heuristics from cluster analysis. Second, it makes the evaluation of the loss of a trained SoftHebb model based on test labels not straightforward. We will now provide the theoretical tools for achieving this. Even though SoftHebb is a generative model, it can be used for discrimination of the input classes

, using Bayes’ theorem. More formally, the proof of Theorem

2.3 involved showing that SoftHebb minimizes the KL divergence of the model from the data . Based on this it can be shown that the algorithm also minimizes its cross-entropy of the causes that it infers, from the true causes of the data : An additional consequence is that by minimizing , SoftHebb also minimizes its label-based cross-entropy between the true labels and the implicitly inferred labels :

(14)
(15)

This is because, in Eqs. 13 and 14, the dependence of the labels on the true causes is fixed by the data generation process. To obtain and measure the cross-entropy, the causal structure

is missing, but it can be represented by a supervised classifier

of SoftHebb’s outputs, trained using the labels . Therefore, by (a) unsupervised training of SoftHebb, then (b) training a supervised classifier on top, and finally (c1) repeating the training of SoftHebb with the same initial weights and ordering of the training inputs, while (c2) measuring the trained classifier’s loss, we can observe the cross-entropy loss of SoftHebb while it is being minimized, and infer that is also minimized (Eq. 15). We call this the post-hoc cross-entropy method, and we have used it in our experiments (Section 3.2 and Fig. 1 C and D).

3 Experiments

[width = 140mm]Figures/performance.pdf

Figure 1: Performance of SoftHebb on MNIST compared to hard-WTA and backpropagation.

3.1 MNIST accuracy vs hard WTA

We implemented the theoretical SoftHebb model in simulations and tested it in the task of learning to classify MNIST handwritten digits. The network received the MNIST frames normalized by their Euclidean norm, and the plasticity rule we derived updated its weights and biases in an unsupervised manner. We used

neurons. First we trained the network for 100 epochs, i.e. randomly ordered presentations of the 60000 training digits. In our validation testing we found that softmax with a base of 1000 (see Section

2.7) performed best. The learning rate of Eq. 8

decreased linearly from 0.03 to 0 throughout training. Each training experiment we will describe was repeated five times with varying random initializations and input order. We will report the mean and standard deviation of accuracies. Inference of the input labels by the WTA network of 2000 neurons was performed in two different ways. The first approach is single-layer, where after training the network we assigned a label to each of the 2000 neurons, in a standard approach that is used in unsupervised clustering. Namely, for each neuron, we found the label of the training set that makes it win the WTA competition most often. In this single-layer approach, this is the only time when labels were used, and at no point were weights updated using labels. The second approach was two-layer and based on supervised training of a perceptron classifier on top of the WTA layer. The classifier layer was trained with the Adam optimizer and cross-entropy loss for 60 epochs, while the previously-trained WTA parameters were frozen.

SoftHebb achieved an accuracy of and in its 1- and 2-layer form respectively. To confirm the strength of the soft WTA approach combined with training the priors through biases, which makes the network Bayesian, we also trained the weights of a network with a hard-WTA setup, i.e. where the strongest-activated neuron’s output is 1, and the other neurons are suppressed to 0, for each input. We found that an initial learning rate of 0.05 was best for the hard-WTA network. The SoftHebb model outperformed the hard WTA (Fig. 1A). However, SoftHebb’s accuracy is significantly lower than a multi-layer perceptron (MLP) with one hidden layer of also 2000 neurons that is trained end-to-end exhaustively. The MLP achieves a accuracy (not shown in the figure). This is expected, due to end-to-end training, supervision, and the MLP being a discriminative model as opposed to a generative model merely applied to a classification task, as SoftHebb is. If the Bayesian and generative aspects that follow from our theory were not required, several additional mechanisms exist to enhance the discriminative power of WTA networks krotov2019unsupervised , and even a random projection layer instead of a trained WTA performs well illing2019biologically . The generative approach however has its own advantages even for a discriminative task, and we will show some of these here.

3.2 Cross-entropy minimization and single-epoch advantage over backpropagation

First, we show as a validation of the theory that the SoftHebb model minimizes cross-entropy of its activations from its input’s causes, even though no explicit loss is provided. According to our post-hoc cross-entropy method (Section 2.8), as a proxy we observed the minimization of during the first epoch of on-line Hebbian learning. The loss on the training inputs as they appear (running loss), as well as on the whole testing dataset can be seen in Fig. 1C and D respectively (blue curves). The method allows us to observe the discriminative aspect of the generative model, as it is optimized. After this one epoch, the accuracy of the 1-layer form of the SoftHebb model is . The 2-layer form is again obtained by training a supervised classifier output layer for 60 epochs, and its accuracy is (Fig. 1

B, blue bars). We then also train for a single epoch a 2-layer MLP with a hidden-layer of 2000 neurons, with backpropagation of stochastic gradient descent (SGD) and cross-entropy loss. We found, through grid search, the optimal minibatch size and learning rate of the MLP (4 and 0.2 respectively). The MLP achieves an accuracy of

(Fig. 1B, orange bar), if we exclude one run of the experiment which only achieved an accuracy of 86.92%. Surprisingly, it not surpass SoftHebb, not even in its 1-layer form. In addition, the cross-entropy of the SoftHebb model is visibly minimized faster than through SGD (orange curves of Fig. 1C & D). It is possible that SoftHebb’s advantage in terms of loss and accuracy is a side-effect of pre-training the second layer when obtaining SoftHebb’s post-hoc cross-entropy, or of that layer’s 60-epoch training. To test this possibility, we similarly obtained a trained version of the MLP’s output layer alone, and then trained its first layer with backpropagation and the second layer frozen. Meanwhile, we recorded its loss, thus obtaining its own version of post-hoc cross-entropy (1C & D, yellow curve). SoftHebb still showed an advantage in terms of loss minimization speed, and its 2-layer form’s accuracy is still not surpassed (1B, blue & yellow bars), despite the fully unsupervised and local learning in the core of the network. Moreover, the figure shows that the minimization of the loss on the general test set by SoftHebb is smoother than the running loss, while SGD’s test-set loss is influenced by the specifics of the individual training examples. This may indicate stronger generalization by the SoftHebb model, emerging from its Bayesian and generative nature. If this is true, SoftHebb may be more robust to input perturbations.

[width = 140mm]Figures/attack_curves_f.pdf

Figure 2: Noise and adversarial attack robustness of SoftHebb and of backpropagation-trained MLP on MNIST and Fashion-MNIST. The insets show one example from the testing set and its perturbed versions, for increasing perturbations. (A) SoftHebb is highly robust to noise. (B) MLP’s MNIST accuracy drops to  60% by hardly perceptible perturbations (), while SoftHebb requires visually noticeable perturbations () for similar drop in performance. At that degree of perturbation, the MLP has already dropped to zero. SoftHebb deflects the attack: it forces the attacker to produce examples of truly different classes - the original digit "4" is perturbed to look like a "0" (see also Fig. 3).

3.3 Robustness to noise and adversarial attacks - Generative adversarial properties

Indeed, we tested the trained SoftHebb and MLP models for robustness, and found that SoftHebb is significantly more robust than the backprop-trained MLP, both to added Gaussian noise and to PGD adversarial attacks (see Fig. 2). PGD madry2017towards produces perturbations in a direction that maximizes the loss of each targeted network, and in size controlled by a parameter . Strikingly, SoftHebb has a visible tendency to deflect the attacks, i.e. its confusing examples actually belong to a perceptually different class (Fig. 2B and 3). This effectively nullifies the attack and was previously shown in elaborate state-of-the-art adversarial-defence models qin2020deflecting

. The pair of the adversarial attacker with the generative SoftHebb model essentially composes a generative adversarial network (GAN), even though the term is usually reserved for pairs

trained in tandem goodfellow2014generative ; creswell2018generative . As a result, the model could inherit certain properties of GANs. It can be seen that it is able to generate interpolations between input classes (Fig. 3). The parameter of the adversarial attack can control the balance between the interpolated objects. Similar functionality has existed in the realm of GANs radford2015unsupervised

, autoencoders

berthelot2018understanding , and other deep neural networks bojanowski2017optimizing , but was not known for simple biologically-plausible models.

[width = 135mm]Figures/adv_examples_softhebb.pdf

Figure 3: Examples generated by the adversarial pair PGD attacker/SoftHebb model. SoftHebb’s inherent tendency to deflect the attack towards truly different classes is visible. This tendency can be repurposed to generate interpolations between different classes of the data distribution, a generative property previously unknown for such simple networks.

3.4 Generalizability of the algorithm to other datasets: Fashion-MNIST

Finally, we trained the SoftHebb model on a more difficult dataset, namely that of Fashion-MNIST xiao2017/online which contains grey-scale images of clothing products. A supervised MLP of the same size that we trained as a reference achieved a test accuracy of on this dataset. We used the exact same SoftHebb model and hyperparameters that we used on MNIST to learn F-MNIST, without any adjustment for the changed dataset. Despite this, the model achieved an accuracy of . In addition, with very small adversarial perturbations, the MLP drops to an accuracy lower than the SoftHebb model despite our generic training, while SoftHebb’s adversarial and noise-robustness is reconfirmed (dashed lines in Fig. 2) as well as its generative interpolations (Fig. 3B).

4 Discussion

In summary, we have described SoftHebb, a biologically plausible model that is completely unsupervised, local, and requires no error or other feedback from upper layers. The model consists of elements fully compatible with conventional ANNs. We have shown the importance of soft competition in rate-based WTA networks, and derived formally the type of plasticity that optimizes the network through Bayesian computation. SoftHebb learns a generative model of the input distribution. We also formalized its unsupervised discriminative properties, and we developed a method for quantifying its discriminative loss in a theoretically sound manner. Our experiments are small, but they confirm our optimization theory, and demonstrate that SoftHebb has significant strengths that emerge from its unsupervised, generative, and Bayesian nature. It is intriguing that, through the model’s biological plausibility, emerge properties commonly associated with biological intelligence, such as speed of learning, robustness to noise and adversarial attacks, and deflection of the attacks. In particular, its ability to learn better than even supervised networks when training time is limited is interesting for neuromorphic applications targeting resource-limited learning tasks. Its robustness to noise and adversarial attacks are impressive, considering that they are intrinsic and were not instilled by specialized defences. SoftHebb has the inherent tendency to not merely be robust to attacks, but actually deflect them. We also showed that these networks can generate interpolations between points in the latent space. However, the model is quite limited compared to state of-the-art ML, if classification accuracy, exhaustive training, and unperturbed inputs are the benchmark. To address this, its potential integration into multilayer networks should be explored using theoretical tools such as those we developed here, and others from the literature nessler2009stdp ; nessler2013PLoS ; krotov2019unsupervised . This could also provide insights into the role of WTA microcircuits in the context of larger-scale computations in cortex.

References

Appendix A Theorem Proofs

Proof of Theorem 2.3.

The parameters of model are optimal

if they minimize the model’s Kullback-Leibler divergence with the data distribution

. . Because is independent from , and is independent from for every , we can find the set of parameters that minimize the KL divergence of the mixtures, by minimizing the KL divergence of each component : and simultaneously setting

(16)

From Eq. 3 and this last condition, Eq. 5 of the Theorem is proven:

Further,

(17)
(18)
(19)

where we used for Eq. 17 the fact that is a constant because it is determined by the environment’s data and not by the model’s parametrization on . Eq. 18 follows from the definition of . The result in Eq. 19 is the mean value of the cosine similarity .

Due to the symmetry of the cosine similarity, it follows that

(20)
(21)

Enforcement of the requirement for normalization of the vector leads to the unique solution
. ∎

Proof of Theorem 2.5.

We will find the equilibrium point of the SoftHebb plasticity rule, i.e. the weight that implies .

To find this, we will first find the equilibrium point of a similar plasticity rule with a simpler equation:

(22)

by setting . We will expand this expected value based on the plasticity rule itself, and on the joint probability distribution of the input and the neuronal activation .

(23)
(24)

Thus, Eq. 24 shows that

(25)

To arrive at Eq. 24, we assume that the probability densities that correspond to the components of the input mixture are distributed without a bias throughout the input dimension , and therefore

(26)

The result is equivalent if a more relaxed assumption is made. Specifically, if we assume that the input distribution is in fact biased, such that

(27)

then the result in 25 is modified by an added constant :

(28)

But is the same for all neurons , therefore this effect is canceled by the softmax normalization, such that the output of each neuron remains unaffected by :

(29)

As a result, the assumption

that we used in our derivation is of little importance.

The difference of the SoftHebb plasticity rule

(30)

from the simplified rule of Eq. 22 is the multiplicative factor . This factor is common between our rule and Oja’s rule oja1982simplified . The effect of this factor is known to normalize the weight vector of each neuron to a length of one oja1982simplified , as also shown in similar rules with this multiplicative factor krotov2019unsupervised . We prove that this is the effect of the factor also in the SoftHebb rule, separately in Theorem A.1 and its Proof, provided at the end of the present Appendix A.

Therefore, the equilibrium weights of the SoftHebb synaptic plasticity rule are proportional to those of the simplified rule:

(31)

which proves Theorem 2.5, and satisfies the optimality condition derived in Theorem 2.3. ∎

Proof of Theorem 2.6.

Similarly to the Proof of Theorem 2.5, we find the equilibrium parameter of the SoftHebb plasticity rule.

(32)
(33)
(34)
(35)
(36)
(37)

Therefore,

(38)

which proves Theorem 2.6 and shows the SoftHebb plasticity rule of the neuronal bias finds the optimal parameter of the Bayesian generative model as defined by Eq. 5 of Theorem 2.3. ∎

Theorem A.1.

The equilibrium weights of the SoftHebb synaptic plasticity rule of Eq. 8 are implicitly normalized by the rule to a vector of length 1.

Proof of Theorem a.1.

Using a technique similar to krotov2019unsupervised , we write the SoftHebb plasticity rule as a differential equation

(39)

The derivative of the norm of the weight vector is

(40)

Replacing in this equation with the SoftHebb rule of Eq. 39, it is

(41)

This differential equation shows that the derivative of the norm of the weight vector increases if and decreases if , such that the weight vector tends a sphere of radius , which proves the Theorem. ∎

Appendix B Details to Alternate activation functions (Section 2.7)

Theorem 2.3, which concerns the synaptic plasticity rule in Eq. 8, was proven for the model of Definition 2.2, which uses a mixture of natural exponential component distributions, i.e. with base e (Eq. 4):

(42)

This implied an equivalence to a WTA neural network with natural exponential activation functions (Section 2.4). However, it is simple to show that these results can be extended to other model probability distributions, and thus other neuronal activations.

Firstly, in the simplest of the alternatives, the base of the exponential function can be chosen differently. In that case, the posterior probabilities that are produced by the model’s Bayesian inference, i.e. the network outputs, are given by a softmax with a different base. If the base of the exponential is , then

(43)

It is obvious in the Proof of Theorem 2.3 in Appendix A that the same proof also applies to the changed base, if we use the appropriate logarithm for describing KL divergence. Therefore, the optimal parameter vector does not change, and the SoftHebb plasticity rule also applies to the SoftHebb model with a different exponential base. This change of the base in the softmax bears similarities to the change of its exponent, in a technique that is called Temperature Scaling and has been proven useful in classification hinton2015distilling .

Secondly, the more conventional type of Temperature Scaling, i.e. that which scales exponent, is also possible in our model, while maintaining a Bayesian probabilistic interpretation of the outputs, a neural interpretation of the model, and the optimality of the plasticity rule. In this case, the model becomes

(44)

The Proof of Theorem 2.3 in Appendix A also applies in this case, with a change in Eq. 18, but resulting in the same solution. Therefore, the SoftHebb synaptic plasticity rule is applicable in this case too. The solution for the neuronal biases, i.e. the parameters of the prior in the Theorem (Eq. 5), also remains the same, but with a factor of : .

Finally, and most generally, the model can be generalized to use any non-negative and monotonically increasing function for the component distributions, i.e. for the activation function of the neurons, assuming

is appropriately normalized to be interpretable as a probability density function. In this case the model becomes

(45)

Note that there is a change in the parametrization of the priors into a multiplicative bias , compared to the additive bias in the previous versions above. This change is necessary in this general case, because not all functions have the property that we used in the exponential case. We can show that the optimal weight parameters remain the same as in the previous case of an exponential activation function, also for this more general case of activation . It can be seen in the Proof of Theorem 2.3, that for a more general function than the exponential, Eq. 18 would instead become:

(46)

where . We have assumed that is an increasing function, therefore is also increasing. The cosine similarity is symmetrically decreasing as a function of around . Therefore, the function also decreases symmetrically around . Thus, the mean of that function under the probability distribution is maximum when . As a result, Eq. 46 implies that in this more general model too, the optimal weight vector is , and, consequently, it is also optimized by the same SoftHebb plasticity rule.

The implication of this is that the SoftHebb WTA neural network can use activation functions such as rectified linear units (ReLU), or other non-negative and increasing activations, such as rectified polynomials krotov2019unsupervised etc., and maintain its generative properties, its Bayesian computation, and the theoretical optimality of the plasticity rule. A more complex derivation of the optimal weight vector for alternative activation functions, which was specific to ReLU only, and did not also derive the associated long-term plasticity rule for our problem category (Definition 2.1), was provided by moraitis2020shortterm .