Insect cyborgs: Biological feature generators improve machine learning accuracy on limited data

08/23/2018
by   Charles B. Delahunt, et al.
University of Washington
0

Despite many successes, machine learning (ML) methods such as neural nets often struggle to learn given small training sets. In contrast, biological neural nets (BNNs) excel at fast learning. We can thus look to BNNs for tools to improve performance of ML methods in this low-data regime. The insect olfactory network, though simple, can learn new odors very rapidly. Its two key structures are a layer with competitive inhibition (the Antennal Lobe, AL), followed by a high dimensional sparse plastic layer (the Mushroom Body, MB). This AL-MB network can rapidly learn not only odors but also handwritten digits, better in fact than standard ML methods in the few-shot regime. In this work, we deploy the AL-MB network as an automatic feature generator, using its Readout Neurons as additional features for standard ML classifiers. We hypothesize that the AL-MB structure has a strong intrinsic clustering ability; and that its Readout Neurons, used as input features, will boost the performance of ML methods. We find that these "insect cyborgs", ie classifiers that are part-moth and part-ML method, deliver significantly better performance than baseline ML methods alone on a generic (non-spatial) 85-feature, 10-class task derived from the MNIST dataset. Accuracy improves by an average of 6 training samples per class, and by 6 moth-generated features increase ML accuracy even when the ML method's baseline accuracy already exceeds the AL-MB's own limited capacity. The two structures in the AL-MB, a competitive inhibition layer and a high-dimensional sparse layer with Hebbian plasticity, are novel in the context of artificial NNs but endemic in BNNs. We believe they can be deployed either prepended as feature generators or inserted as layers into deep NNs, to potentially improve ML performance.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

02/15/2018

Putting a bug in ML: The moth olfactory network learns to read MNIST

We seek to (i) characterize the learning architectures exploited in biol...
05/19/2019

A Novel Chaos Theory Inspired Neuronal Architecture

The practical success of widely used machine learning (ML) and deep lear...
08/17/2020

Intelligence plays dice: Stochasticity is essential for machine learning

Many fields view stochasticity as a way to gain computational efficiency...
02/20/2020

Using Machine Learning to predict extreme events in the Hénon map

Machine Learning (ML) inspired algorithms provide a flexible set of tool...
10/19/2017

Machine Learning as Statistical Data Assimilation

We identify a strong equivalence between neural network based machine le...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Machine learning (ML) methods in general, and neural nets (NNs) with backprop in particular, have posted tremendous successes in recent years [1, 2]. However, these methods, and NNs in particular, typically require large amounts of training data to attain high performance. This creates bottlenecks to deployment, and constrains the types of problems that can be addressed [3]

. Thus it is desirable to improve ML methods’ ability to learn from small training sets. This limited-data constraint is typical of a large and important group of ML targets, including tasks that use medical, scientific, or field-collected data, and also artificial intelligence efforts focused on rapid learning.

In this work, we seek to improve the input feature space which an arbitrary ML method will use for training. In particular, we propose an architecture that can be bolted onto the front end of an ML method, and which automatically generates, from the existing feature set, a new set of strongly class-separating features to supplement (or even replace) the existing feature set.

Biological neural nets (BNNs) are able to learn rapidly, even from just one or two samples. On the assumption that rapid learning requires effective ways to separate classes given limited data, we may look to BNNs for effective feature-generators [4]. One of the simplest BNNs that can learn is the insect olfactory network [5], containing the Antennal Lobe (AL) [6] and Mushroom Body(MB) [7], which can learn a new odor given only five exposures. This simple but effective feedforward network is built around three key elements that are ubiquitous in BNN designs: Competitive inhibition, high-dimensional sparse layers, and Hebbian update mechanisms. Specifically, the AL-MB network contains: (i) A pre-processing layer (the AL) built of units that competitively inhibit each other [8]; (ii) Projection, with sparse connectivity, up into and then down out of a sparsely-firing high-dimensional layer (the MB) [9, 10], where the dimension shift is typically 10x to 100x [11] ; and (iii) Hebbian updates of plastic synaptic connections to train the system. Roughly speaking, the Hebbian rule is “fire together, wire together”, i.e. updates are proportional to the product of firing rates of the sending and receiving neurons, [12, 13]. Synaptic connections are largely random [14]. A schematic is given in Fig 1.

MothNet is a computational model of the Manduca sexta moth AL-MB [15] that demonstrated very rapid learning of vectorized MNIST digits, with performance superior to standard ML methods in the 1 to 10 training sample regime [16]. That is, it was able to encode substantial class-relevant information from very few samples. But MothNet also appears to have limited capacity: Accuracy leveled off at about 80%, consistent with related results in [17] and the biological fact that a moth can only learn about 8 odors.

In this work we examine whether the MothNet architecture can usefully serve, not as a classifier itself, but rather as the first stage of a multi-stage network. Our goal is to harness its class-information encoding abilities to generate strong features that can improve performance of a main downstream classifier. In particular, we test the following hypotheses111See Acknowledgements:
1. The AL-MB architecture has an intrinsic clustering ability, due specifically of the competitive inhibition layer and/or the sparse high-dimensional layer. That is, these structures have an inductive bias towards separating classes (just as convolutional neural nets have an inductive bias towards distinguishing visual data).
2. Despite its limitations, the trained AL-MB is an effective feature generator: Its Readout neurons contain class-separating information that will boost an arbitrary ML algorithm’s ability to classify test samples.

We test these hypotheses by combining MothNet with a downstream ML module, so that the Readouts of the trained AL-MB model feed into the ML module as additional features (from the ML perspective, the AL-MB acts as an automatic feature generator; from the biological perspective, the ML module stands in for the downstream processing in more complex BNNs). Our Test Case is a non-spatially-correlated, 85-feature, 10-class task derived from the downsampled, vectorized MNIST dataset (hereafter “MNIST” to emphasize its vectorized, non-spatial, structure). We restrict training set size to samples per class, so that the ML modules do not attain full accuracy on the task using the 85 features (pixels) alone.

We find evidence that these hypotheses are correct: The high-dimensional sparse layer and (to lesser extent) the competitive inhibition layer, in combination with a Hebbian update rule, significantly improved the abilities of ML methods (NN, SVM, and Nearest Neighbors) to classify the test set in all cases, and especially when training samples per class. That is, the input pixel features contain class-separating information that is not being extracted by the ML methods alone. The MothNet module encodes this information in a form that is accessible to the ML methods. If the learning performance of BNNs is any guide, these layers are simple, general-purpose feature generators that can potentially improve performance of ML methods in tasks where training data is limited.

In addition, the cyborgs significantly out-performed models that used features generated by PCA (Principal Components Analysis), PLS (Projection to Latent Structures), and NNs. They also out-performed NNs that were pre-trained on the Omniglot dataset

[18] to initialize weights. These results indicate that the insect-derived network generated significantly stronger features than these other feature generator methods.


Figure 1: Schematic of the Moth Olfactory Network. Input features feed 1-to-1 into a 85-unit layer with competitive all-to-all inhibition (the AL). The AL projects with sparse, random connectivity (about 15%) into a 2500-unit sparsely-firing layer (the MB, with 5% to 10% activity). The MB projects densely to the Readout Neurons. The AL is not plastic. The only plastic synaptic weights are those that enter or leave the MB (in our experiments, the bulk of updates occurred between the MB and Readout Neurons). Training updates are done by Hebbian rule () and unused connections decay towards 0, as in [16]. MothNet instances were generated by randomly assigning connectivity maps and synaptic weights according to template distributions.

2 Experimental setup

To generate MNIST, we downsampled and preprocessed the MNIST dataset [19, 20] to give samples with 85 pixels-as-features stripped of spatial information, as in [16]. We note that MNIST is not the “MNIST dataset" considered in its usual context of a task with spatial structure and large pools of training data. Rather, here the MNIST data served as raw material for a generic non-spatial Test Case. MNIST had the advantage that our baseline ML methods (Nearest Neighbors, SVM, and Neural Net) did not attain full accuracy at low N. So it acted as a good test of whether the AL-MB can improve classification by ML methods.

Full wiring details of the AL-MB model are given in [15]. Full Matlab code for MothNet simulations and these cyborg experiments, including comparison methods, can be found at https://github.com/charlesDelahunt/PuttingABugInML

Competitive inhibition in the moth AL works roughly as follows. Each neural unit in the AL receives input from one feature, and has two outputs: An inhibitory signal to other neural units in the AL, and an excitatory signal to the MB. Thus, each feature tries to dampen other features’ presence in the sample’s output signature from the AL.

Sparsity in the MB is of two types: First, the projections from the AL to the MB are non-dense (15% non-zero). Second, MB neurons fire sparsely, in the sense that only the strongest 5% to 10% of the total population are allowed to fire (through a mechanism of global inhibition).

All weights are non-negative, and are initialized randomly. Weight updates affect only MBReadout connections (the AL is not plastic, and ALMB learning rates are slow). Hebbian updates occur according to: (if ), and (if ).

Nearest-Neighbors and SVM used Matlab built-in functions as in [16]

. The Neural Nets used Matlab’s NN toolbox, with one layer (more layers did not help) and as many hidden units as features (i.e. 85 or 95; more units did not help). MothNet instances were generated randomly from templates. All hyperparameter details can be found in the online codebase. We note that our goal was to see if the MothNet-generated features improved on the baseline accuracy of the ML methods, whatever that baseline was, and that we deliberately varied the baseline by restricting training data. So the exact ML method hyperparameters were not central, as long as they were reasonable.


We ran four sets of experiments:

1. Cyborg vs baseline ML methods experiments

The main experiments were structured as follows:
1. A random set of N training samples per class were drawn from MNIST.
2. The ML methods trained on these samples, to provide a baseline (switch in Fig 2).
3. MothNet was trained on these same samples, using time-evolved stochastic differential equation simulations and Hebbian updates as in [16] (switch in Fig 2).
4. The ML methods were then retrained from scratch, with the Readout Neuron outputs from the trained MothNet instance fed in as additional features (switches in Fig 2). These were the “insect cyborgs”.
5. Trained ML accuracy of the baselines and cyborgs were compared to assess the value of the AL-MB as a feature generator. These experiments were repeated 13 times per , for each ML method.

2. Other feature generators vs baseline ML methods

To compare the effectiveness of MothNet fetaures to those generated by conventional ML methods, we ran experiments structured as the MothNet experiments above, but with the MothNet feature module replaced by one of the following options:
1. PCA (Principal Components Analysis) applied to the MNIST training samples. The new features were the projections onto each of the top 10 modes.
2. PLS (Projection to Latent Structures) applied to the MNIST training samples. The new features were the projections onto each of the top 10 modes. We expected PLS to do better than PCA because PLS takes class information into account.
3. NN pre-trained on the MNIST training samples. The new features were the (logs of the) 10 output units. This feature generator was used as a front end to SVM and Nearest Neighbors only.
4. NN with weights first modulated by training on the vectorized Omniglot dataset, then trained on the

MNIST training samples. (This last was a transfer learning method, not a feature generator.)

3. Relative importance of AL vs MB experiments

There are two key structural components in the AL-MB, the competitive inhibition layer (the AL) and projection into a high-dimensional sparse layer (the MB) with Hebbian synaptic updates. These two structures can be deployed separately or together. In particular, the (trainable) high-dimensional sparse layer can be deployed with or without the competitive inhibition layer. In order to assess the relative value of the competitive inhibition layer, mutant MothNets were generated from templates that had a “pass-through” AL, i.e. with uniform weights and no lateral inhibition (switch in Fig 2). Steps 1 to 4 above were followed using these mutant MothNets (so Step 4 corresponded to switches in Fig 2). The results from step (4) were then compared to those of full cyborgs.

4. Cyborg experiments on Omniglot

These experiments were set up as in (1), but used the Omniglot dataset, a collection of hand-drawn characters with 136 classes with 20 samples each. For each run, 10 Omniglot classes were randomly chosen. Thumbnails were subsampled down to 200 pixels and vectorized. max’ed out at 15, to ensure at least 5 test samples per class.


Figure 2: Schematic of the various Learner configurations. Two switches created the various models. In the ordinary MothNet, input pixels passed through the AL (switch ), then the MB, and prediction was based on a log-likelihood over the Readout Neurons as in [16]. The ordinary (baseline) ML module accepted only input pixels as features (switch ). Two cyborg variants were tested: In the full cyborg, the Readouts of an ordinary trained MothNet are fed into the ML module as additional features (switches ). In a mutant cyborg used to test the role of the AL, Readouts from a trained MothNet with disabled (pass-through, no lateral inhibition) AL fed into the ML module as additional features (switches ).

3 Results

The ML baseline methods (no added features) started at 10% to 30% accuracy for = 1 sample per class, and rose to 80% to 88% accuracy (depending on method) at N = 100, where we stopped our sweep. This baseline accuracy is marked by the lower colored circles in Fig 3.

3.1 Gains due to MothNet features (i.e cyborgs)

MothNet-ML cyborgs, i.e. networks in which the 10 Readouts of the trained MothNet were fed into the ML module as 10 additional features, showed consistently improved Test set performance versus their ML baselines, for all ML methods at all N, except for SVM at . Cyborg accuracy is marked by the upper colored circles in Fig 3, and the raw gains in accuracy are marked by thick vertical bars.

Raw increases in accuracy due to cyborgs were fairly stable for all ML models. This led to two trends in terms of relative changes. Relative gains, i.e. as percentage of baseline, were highest at low training samples per class: Average relative gains were 10% to 33% at , and 6% to 10% for (see Fig 4 A). Conversely, the relative reduction in Test set error, as a percentage of baseline error, increased substantially as baseline accuracy grew (see Fig 4 B). Thus, MothNet cyborgs reduced Test set error by over 50% on the most accurate models, such as NNs with 80% baseline accuracy. Of the ML methods, the Neural Net cyborgs had the best performance and also showed the highest percentage gains.


Figure 3: Trained accuracy of baseline ML and full cyborg classifiers, vs number of training samples per class. Baseline ML accuracies are shown as small circles, cyborg accuracies are shown as larger circles, and thick vertical bars mark the increase in accuracy. In almost every case the cyborgs had significantly improved accuracy (5% to 33% relative increase), indicating that the MothNet Readouts are information-rich. Std Devs () for each baseline method are given as solid dots near the x-axis (cyborg s were similar). The inset shows the magnitude of raw gain in accuracy (cyborg over ML baseline) in units of std dev, using the Fisher discriminant (see text). 13 runs per data point.

Gains were significant in almost all cases where . Table 1 gives the -values of the gains due to MothNet features, for each and ML method. -values are calculated from the Fisher linear discriminant , where are the means and std devs of the ML baseline accuracy distribution and the cyborg accuracy distribution. These values are shown in the inset of Fig 3. Lower significance at was often due to large std dev (relative to mean) in ML baseline accuracies (these std devs are seen as dots at the bottom of Fig 3).

ML method N = 1 2 3 5 7 10 15 20 30 40 50 70 100
NearNeigh 58 42 20 2 4 4 2 1 9 16 7 0 3
SVM 100 96 31 39 18 4 6 16 4 13 8 0 4
Neural Net 89 76 48 7 3 1 1 1 0 0 1 8 0
Table 1: -values, given as percentages, of gains in accuracy over baseline due to cyborg features at each . -values correspond to the distance in std devs between the means of the accuracy distributions (baseline and cyborg) as follows: 100 identical distributions; 32 1 std dev; 4.5 2 std dev; 0.27 3 std dev.

Remarkably, adding a MothNet front-end improved ML accuracy even in cases where the ML module baseline already exceeded the accuracy ceiling of MothNet ( 75% [16]), at = 15 to 100 samples per class. This implies that the Readouts of MothNet contain valuable clustering information which ML methods are able to leverage more effectively than MothNet itself does. Also remarkably, the highest gains by the NN-cyborg at came from using only the MothNet’s Readouts as features and ignoring the original feature pixels, an indication of the strong clustering abilities of the AL-MB architecture.


Figure 4: Effects on test set accuracy of cyborg over baseline ML, vs baseline ML accuracy (given as %). A: Mean relative gains in accuracy over ML baseline, due to cyborgs (this is the size of the vertical red bars, relative to baselines, in Fig 3). B: Relative reduction in test set error due to cyborgs. Because raw gains were steady across all baseline accuracies, the reductions in test set error were very high for ML models with high baseline accuracy. Neural Net cyborgs saw the largest benefits. 13 runs per data point.

3.2 Comparison to other feature generators

To compare with other methods, we ran the feature generation framework using PCA (Principal Components Analysis, projections onto 10 modes), PLS (Projection to Latent Structures, projection onto top 10 modes), and NN (logs of the 10 output units). In each case, the method used the training samples to generate 10 new features. Each method was run using a Matlab built-in function. The Matlab code can be found at https://github.com/charlesDelahunt/PuttingABugInML. For NN as baseline, we did not use NN-generated features, but instead initialized the NN network weights by pre-training on the Omniglot dataset, then trained on the MNIST data as usual.

With few exceptions, MothNet features were much more effective than these other methods. Tables 2, 3, and 4 give, for each baseline ML classifier, the relative increase in mean accuracy due to the various feature generators (or to pre-training). “MothNet” refers to the cyborgs. “NA” appears in the tables for PLS and SVM at = 1, because PLS and SVM required at least 2 training samples per class to run. 13 runs per data point.

FG method N = 1 2 3 5 7 10 15 20 30 40 50 70 100
PCA -67 0.7 0.6 1.4 1.2 1.2 1.5 1 1.3 1.4 0.0 0.9 1.5
PLS NA 1.4 0.6 1.6 2.1 1.5 1.1 1.9 1.2 1.1 0.4 0.9 -0.1
Neural Net -1.4 1.3 2.1 1.5 2.6 2.1 4.4 3.2 4.7 3.4 3.9 3.9 3.7
MothNet 13.6 13.9 14.2 16.9 11.5 10 9.6 10 5.6 5.1 6.6 6.1 4.7
Table 2: Nearest Neighbor baseline: Relative percentage increase in accuracy over Nearest Neighbor as baseline classifier, due to various feature generators (“FG”).
FG method N = 1 2 3 5 7 10 15 20 30 40 50 70 100
PCA NA 12.2 -0.4 -1.4 0.3 0.2 0.2 -0.9 0.3 -0.5 -0.8 -1.4 -0.5
PLS NA -14.1 4.2 3.5 1.5 -0.2 -2.6 -4 -5.4 -5.6 -5.3 -5.1 -5.5
Neural Net NA 6.8 -1.3 -3.7 -2 -0.9 1.7 0.5 4.3 4.9 4.1 4.9 4.9
MothNet NA 0.8 6.5 10.7 10.8 10 7.8 6.3 7.2 5.8 6.9 8.3 6.2
Table 3: SVM baseline: Relative percentage increase in accuracy over SVM as baseline classifier, due to various feature generators (“FG”).
FG method N = 1 2 3 5 7 10 15 20 30 40 50 70 100
PCA -57 0.2 -0.8 1.2 2.6 1.7 0.3 1.3 -0.3 0 0.2 0.3 0.2
PLS NA 0.2 5.9 1.0 1.5 2.8 -0.2 1.2 0.3 1.2 1.6 1.5 1.9
preTrainOmni 15 4.2 5.8 -3.1 -1.1 0.2 1.3 1.5 -3.4 -2.5 -0.4 -4.7 -1.1
MothNet 4 17 15 13.1 13 11.3 10.8 9.0 9.7 8.4 8.5 7.1 6.4
Table 4: Neural Net baseline: Relative percentage increase in accuracy over Neural Net as baseline classifier, due to various feature generators (“FG”). “preTrainOmni” means: Initialize NN weights by training on the Omniglot set, then train on the MNIST training samples.

3.3 Relative contribution of the AL and MB layers

MothNet has two key structures, a competitive inhibition layer (the AL) and a high-dimensional, sparse layer (the MB). Cyborgs built from MothNets with a “pass-through” (identity) AL still posted significant improvements in accuracy over baseline ML methods. The gains of cyborgs with pass-through ALs were generally between 60% and 100% of the gains posted by cyborgs with normal ALs (see Table 5), suggesting that the high-dimensional, trainable layer (the MB) was of primary importance. However, the competitive inhibition of the AL layer clearly added value in terms of generating strong features, contributing up to 40% of the total gain. NNs benefitted most from the competitive inhibition layer.

In terms of overall effect on downstream ML modules, a functioning AL enabled slightly better, more reliable gains: Averaged over all ML methods and all numbers of training samples

, a functioning AL gave mean raw increase in accuracy = 5.6%, standard error

= 0.38; while a pass-through AL gave mean raw increase in accuracy = 5.0%, standard error = 0.43.

ML method N = 1 2 3 5 7 10 15 20 30 40 50 70 100
NearNeigh 82 100 91 76 100 100 58 74 88 64 100 100 65
SVM NA NA 100 87 79 97 75 94 98 82 100 76 15
Neural Net 100 60 62 67 75 91 100 93 100 100 100 82 65
Table 5: The relative importance of the MB, vs number of training samples per class . Entries give the gains posted by mutant cyborgs as a percentage of the gains of full cyborgs (shown in Fig 4), for the three ML methods. Entries = 100% indicate that average gains from the pass-through AL were average gains from the normal AL. 13 runs per data point.

3.4 Cyborgs on the Omniglot dataset

We also ran the cyborg experiments on a downsampled, vectorized Omniglot dataset. Experimental set-up was the same as for MNIST, except that vectorized thumbnails had 200 pixels and 10 Omniglot classes were selected at random for each run. Table 6 below gives relative percentage increases in accuracy due to MothNet. MothNet-generated features resulted in high relative gains in accuracy (up to 27%). However, due to low s (), the std dev of baseline ML accuracy was always high (, roughly twice as large as for MNIST). Thus the gains were not significant (in a -value sense), except for SVMs.

ML method N = 1 2 3 5 7 10 15
NearNeigh 7.5 12.3 9.5 14.8 7.3 1.6 -2.6
SVM NA 4.1 22.7 27.7 26.1 19.1 12
Neural Net 0.7 8.2 10.8 13.3 10.9 11.8 2.3
Table 6: Relative percentage increase in accuracy over baseline method due to cyborg features at each , on Omniglot data. 13 runs per data point.

4 Discussion

Strong, automatically-generated feature sets enhance the power of ML algorithms to extract structure from data. They are always desirable tools, but especially so when training data is limited. Many ML targets, such as tasks for which data must be manually collected in medical, scientific, or field settings, do not have the luxury of vast amounts of (eg internet-generated) training data, so they must extract maximum value from the limited available data. This large class of ML targets also includes Artificial Intelligence systems that seek adaptive and rapid learning skills. In this context, biological structures and mechanisms are potentially useful tools, given that BNNs excel at rapid learning.

Our experiments deployed an architecture based on a simple BNN, the moth olfactory network, to generate features to support ML classifiers. The three key elements of this network are novel in the context of engineered NNs, but are endemic in BNNs of all complexity levels: (i) a competitive inhibition layer; (ii) a high-dimensional sparse layer; and (iii) a Hebbian plasticity mechanism for weight updates in training. Our experiments indicate that these structures, as combined in the MothNet model of the insect olfactory network, create a highly effective feature generator whose Readout Neurons contain strong class-specific information.

In particular, using MothNet as a feature generator upstream of standard ML methods significantly and consistently improved their learning abilities on MNIST. That is, some class-relevant information in the raw feature distributions was not extracted by the ML methods alone, but pre-processing by MothNet made that information accessible. Relative increase in accuracy averaged 10% to 33% for and 6% to 10% for , while the relative reduction in Test set error exceeded 50% for NN models with higher () baseline accuracy.

MothNet features were much more useful than features generated by standard methods such as PCA, PLS, or NNs, and also more useful than pre-training NNs on similar data. We hypothesize that the “orthogonality” (loosely used) of the BNN structures and mechanisms in MothNet, relative to the baseline ML methods and to methods such as PCA, allowed MothNet to extract otherwise inaccessible clustering information.

Not only can these structures be readily prepended as feature generators to arbitrary ML modules, as we did here, but they can perhaps also be inserted as layers into deep NNs. Indeed, this is what BNNs appear to do.

These gains can also be viewed as savings on training data: For example, with 30 training samples per class, a MothNet+NN cyborg attains the same Test accuracy (79%) as a NN baseline attains with 100 training samples per class, a savings of over 3x in training data. These savings in training data can be seen in Fig 3 by drawing horizontal lines between cyborg and baseline accuracies. Savings consistently ranged from 1.5x to 3x. If these accuracy gains and commensurate savings held for higher numbers of training samples in more difficult tasks, the savings in data requirements would be substantial, an important benefit for many ML use-cases.

Comparison of the Mushroom Body to sparse autoencoders

The insect MB is a biological means to project codes into a sparse, high-dimensional space. It naturally brings to mind sparse autoencoders (SA)

[21, 22]. However, there are several differences, beyond the fact that MBs are not trying to match the identity function.

First, in SAs the goal is typically to detect lower-dimensional structures that carry the input data. Thus the sparse layers of SAs have fewer active neurons than the nominal dimension of the input. In the MB, the number of neurons increases manyfold (eg 30x), so that even with enforced sparsity the number of active MB neurons is much greater than the input dimension: In MothNet there are approximately 150 - 200 active neurons in the MB vs 85 input features. The functional effects are also different: In MNIST experiments in [22], a sparse layer with 100 active neurons (vs 784 input pixels, i.e. ratio 1:8) captured only very local features and was not effective for feeding into shallow neural nets (though it was useful for deeper nets). In our experiments, a ratio of 2:1 (i.e. 16x that of the SA) generated features that were very effective as input to a shallow net.

Second, there is no off-line training or pre-tuning step, as used in some SAs, though of course Mother Nature has been tinkering with this system for a long time. Third, SAs typically (to our knowledge) require large amounts of training data (eg 5000 per class in [22]), while the MB needs as few as one training sample per class to bake in structure that improves classification. Fourth, the updates in SAs are by backprop, while those in MBs are Hebbian. While the ramifications of this difference are unclear, we suspect that the two methods yield distinct results, and that the dissimilarity of the optimizers (MothNet vs ML) was an asset in our experiments.

The MB shares with Reservoir Networks [23] a (non-linear) projection into a high-dimensional space and (linear) projection out to a Readout layer. A major difference is that in the MB neurons are not recurrently connected, while in a Reservoir Network they are. SVMs also use projection into high-dimensional spaces, and it is perhaps due to this commonality that cyborgs were less beneficial to SVMs than to NNs and Nearest Neighbors.

Role of the competitive inhibition layer

The competitive inhibition layer may enhance classification by creating several attractor basins for inputs, each focused according to which subsets of features present most strongly, which in turn depends on the classes. This might serve to push otherwise similar samples (of different classes) away from each other, towards their respective class attractors, increasing the effective distance between the samples. Thus the outputs of the AL, after this competitive inhibition, may have better separation by class.

However, in our experiments on this particular dataset, while the competitive inhibition layer (AL) did benefit the downstream ML classifier, it was less important than the sparse layer (MB). We see two reasons why this might so. First, the AL has other jobs to do in the insect olfactory network, such as gain control and corraling inputs from the noisy antennae [24, 25]. Perhaps these are the AL’s primary tasks, and separating input signals is a secondary task. Second, the MothNet model was transferred to the MNIST task from a model developed to study odor learning that was calibrated to in vivo moth data [15]. Perhaps the AL has a larger role in the natural, odor-processing setting, and its transfer to the MNIST task modified the overall balance of the AL-MB system and reduced the importance of the AL relative to the MB. That said, the best results and also most consistent improvements were posted by full cyborgs, i.e. those generating features using the full AL-MB network.

Role of Hebbian updates

We suspect that much of the success of BNNs (and MothNet) is due to the Hebbian update mechanism, which appears to be quite distinct from typical ML weight update methods. It has no objective function or output-based loss that is pushed back through the network as in backprop or agent-based reinforcement learning (there is no “agent” in the MothNet system). Hebbian weight updates, either growth or decay, occur on a local “use it or lose it” basis.

We also suspect that much of the success of the cyborgs was due to the stacking of two distinct update methods, e.g. Hebbian and backprop. In our experience, stacking dissimilar ML methods is more productive than stacking similar methods. This may be one reason MothNet cyborgs delivered improvement to ML accuracy even in cases where the baseline ML accuracy already exceeded the MothNet’s top performance, enabling up to 40% reductions in Test set error: Each system brings unique structure-extracting skills to the data. It may also explain why projecting into the high-dimensional MB is not redundant when paired with an SVM, which also projects into a high-dimensional space: The two methods of learning the projections are different.

Limitations

A practical limitation of this method, in its current form, is that MothNet trains on and evaluates samples by the time-evolution of systems of coupled differential equations. This is time-consuming (

4 seconds per sample on a laptop), and increases non-linearly for more complex datasets with high-dimensional feature spaces, since these would likely require larger networks with more neurons per layer and thus more equations to evolve. In addition, the time-evolution system does not conveniently mesh with other ML platforms such as Tensorflow. Thus, a key future project is to develop different methods of running MothNet-like architectures that bypass the computations of time evolution and mesh with other platforms, yet functionally preserve a Hebbian update mechanism.

Acknowledgments

Our thanks to Blake Richards, who articulated these hypotheses and suggested these experiments.
CBD gratefully acknowledges partial funding from the Swartz Foundation.

References