1 Introduction
Sparse expansive representations are ubiquitous in neurobiology. Expansion means that a highdimensional input is mapped to an even higher dimensional secondary representation. Such expansion is often accompanied by a sparsification of the activations: dense input data is mapped into a sparse code, where only a small number of secondary neurons respond to a given stimulus.
A classical example of the sparse expansive motif is the Drosophila fruit fly olfactory system. In this case, approximately projection neurons send their activities to about Kenyon cells (turnerOlfactoryRepresentationsDrosophila2008), thus accomplishing an approximately x expansion. An input stimulus typically activates approximately of projection neurons, and less than Kenyon cells (turnerOlfactoryRepresentationsDrosophila2008), providing an example of significant sparsification of the expanded codes. Another example is the rodent olfactory circuit. In this system, dense input from the olfactory bulb is projected into piriform cortex, which has x more neurons than the number of glomeruli in the olfactory bulb. Only about of those neurons respond to a given stimulus (mombaertsVisualizingOlfactorySensory1996). A similar motif is found in rat’s cerebellum and hippocampus (dasguptaNeuralAlgorithmFundamental2017a).
From the computational perspective, expansion is helpful for increasing the number of classification decision boundaries by a simple perceptron
(coverGeometricalStatisticalProperties1965) or increasing memory storage capacity in models of associative memory (hopfieldNeuralNetworksPhysical1982a). Additionally, sparse expansive representations have been shown to reduce intrastimulus variability and the overlaps between representations induced by distinct stimuli (sompolinskySparsenessExpansionSensory2014). Sparseness has also been shown to increase the capacity of models of associative memory (tsodyksEnhancedStorageCapacity1988).The goal of the present work is to use this “biological” inspiration about sparse expansive motifs, as well as local Hebbian learning, for designing a novel hashing algorithm BioHash that can be used in similarity search. Below we describe the task, the algorithm, and demonstrate that BioHash improves retrieval performance on common benchmark datasets.
Similarity search and LSH. In similarity search, given a query , a similarity measure , and a database containing items, the objective is to retrieve a ranked list of items from the database most similar to . When data is highdimensional (e.g. images/documents) and the databases are large (millions or billions items), this is a computationally challenging problem. However, approximate solutions are generally acceptable, with Locality Sensitive Hashing (LSH) being one such approach (wangHashingSimilaritySearch2014). Similarity search approaches maybe unsupervised or supervised. Since labelled information for extremely large datasets is infeasible to obtain, our work focuses on the unsupervised setting. In LSH (indykApproximateNearestNeighbors1998; charikarSimilarityEstimationTechniques2002), the idea is to encode each database entry (and query ) with a binary representation ( respectively) and to retrieve entries with smallest Hamming distances . Intuitively, (see (charikarSimilarityEstimationTechniques2002), for a formal definition), a hash function is said to be locality sensitive, if similar (dissimilar) items and are close by (far apart) in Hamming distance
. LSH algorithms are of fundamental importance in computer science, with applications in similarity search, data compression and machine learning
(andoniNearoptimalHashingAlgorithms2008).Drosophila olfactory circuit and FlyHash. In classical LSH approaches, the data dimensionality is much larger than the embedding space dimension , resulting in lowdimensional hash codes (wangHashingSimilaritySearch2014; indykApproximateNearestNeighbors1998; charikarSimilarityEstimationTechniques2002). In contrast, a new family of hashing algorithms has been proposed (dasguptaNeuralAlgorithmFundamental2017a) where , but the secondary representation is highly sparse with only a small number of units being active, see Figure 1. We call this algorithm FlyHash in this paper, since it is motivated by the computation carried out by the fly’s olfactory circuit. The expansion from the dimensional input space into an dimensional secondary representation is carried out using a random set of weights (dasguptaNeuralAlgorithmFundamental2017a; caronRandomConvergenceOlfactory2013). The resulting high dimensional representation is sparsified by WinnerTakeAll (WTA) feedback inhibition in the hidden layer resulting in top of units staying active (linSparseDecorrelatedOdor2014; stevensStatisticalPropertyFly2016).
While FlyHash uses random synaptic weights, sparse expansive representations are not necessarily random (sompolinskySparsenessExpansionSensory2014), perhaps not even in the case of Drosophila (gruntmanIntegrationOlfactoryCode2013; zhengCompleteElectronMicroscopy2018). Moreover, using synaptic weights that are learned from data might help to further improve the locality sensitivity property of FlyHash
. Thus, it is important to investigate the role of learned synapses on the hashing performance. A recent work
SOLHash (liFastSimilaritySearch2018a), takes inspiration from FlyHash and attempts to adapt the synapses to data, demonstrating improved performance over FlyHash. However, every learning update step in SOLHashinvokes a constrained linear program and also requires computing pairwise innerproducts between all training points, making it very time consuming and limiting its scalability to datasets of even modest size. These limitations restrict
SOLHash to training only on a small fraction of the data (liFastSimilaritySearch2018a). Additionally, SOLHash is biologically implausible (for an extended discussion, see Sec. 5). BioHash also takes inspiration from FlyHash and demonstrates improved performance compared to random weights used in FlyHash, but it is fast, online, scalable and, importantly, BioHash is neurobiologically plausible.Not only "biological" inspiration can lead to improving hashing techniques, but the opposite might also be true. One of the statements of the present paper is that BioHash satisfies locality sensitive property, and, at the same time, utilizes a biologically plausible learning rule for synaptic weights (krotovUnsupervisedLearningCompeting2019). This provides evidence toward the proposal that the reason why sparse expansive representations are so common in biological organisms is because they perform locality sensitive hashing. In other words, they cluster similar stimuli together and push distinct stimuli far apart. Thus, our work provides evidence toward the proposal that LSH might be a fundamental computational principle utilized by the sparse expansive circuits Fig. 1 (right). Importantly, learning of synapses must be neurobiologically plausible (the synaptic plasticity rule should be local).
Contributions. Building on inspiration from FlyHash and more broadly the ubiquity of sparse, expansive representations in neurobiology, our work proposes a novel hashing algorithm BioHash, that in contrast with previous work (dasguptaNeuralAlgorithmFundamental2017a; liFastSimilaritySearch2018a), produces sparse high dimensional hash codes in a datadriven manner and with learning of synapses in a neurobiologically plausible way. We provide an existence proof for the proposal that LSH maybe a fundamental computational principle in neural circuits (dasguptaNeuralAlgorithmFundamental2017a) in the context of learned synapses. We incorporated convolutional structure into BioHash, resulting in a hashing with improved performance compared to previously published benchmarks. From the perspective of computer science, we show that BioHash is simple, scalable to large datasets and demonstrates good performance for similarity search. Interestingly, BioHash outperforms a number of recent SOTA deep hashing methods trained via backpropogation.
2 Approximate Similarity Search via BioHashing
Formally, if we denote a data point as , we seek a binary hash code . We define the hash length of a binary code as , if the exact Hamming distance computation is . Below we present our bioinspired hashing algorithm.
2.1 Bioinspired Hashing (BioHash)
We adopt a biologically plausible unsupervised algorithm for representation learning from krotovUnsupervisedLearningCompeting2019. Denote the synapses from the input layer to the hash layer as . The learning dynamics for the synapses of an individual neuron , denoted by , is given by
(1) 
where , and
(2) 
and , with , where is Kronecker delta and is the time scale of the learning dynamics. The Rank operation in equation (1) sorts the inner products from the largest () to the smallest (). It can be shown that the synapses converge to a unit (norm) sphere (krotovUnsupervisedLearningCompeting2019). The training dynamics can be shown to minimize the following energy function
(3) 
where indexes the training example. Note that the training dynamics do not perform gradient descent, i.e . However, time derivative of the energy function under dynamics (1) is always negative (we show this for the case below),
(4) 
where CauchySchwartz inequality is used. For every training example the index of the activated hidden unit is defined as
(5) 
Thus, the energy function decreases during learning. A similar result can be shown for .
After the learningphase is complete, the hash code is generated, as in FlyHash, via WTA sparsification: for a given query we generate a hash code as
(6) 
Thus, the hyperparameters of the method are
and. Note that the synapses are updated based only on pre and postsynaptic activations resulting in Hebbian or antiHebbian updates. Many "unsupervised" learning to hash approaches provide a sort of "weak supervision" in the form of similarities evaluated in the feature space of deep CNNs trained on ImageNet
(jinUnsupervisedSemanticDeep2019) to achieve good performance. BioHash does not assume such information is provided and is completely unsupervised.2.2 Intuition behind the learning algorithm
An intuitive way to think about the learning algorithm is to view the hidden units as particles that are attracted to local peaks of the density of the data, and that simultaneously repel each other. To demonstrate this, it is convenient to think about input data as randomly sampled points from a continuous distribution. Consider the case when and . In this case, the energy function can be written as (since for the inner product does not depend on the weights, we drop the subscript of the inner product)
(7) 
where we introduced a continuous density of data . Furthermore, consider the case of , and imagine that the data lies on a unit circle. In this case the density of data can be parametrized by a single angle . Thus, the energy function can be written as
(8) 
It is instructive to solve a simple case when the data follows an exponential distribution concentrated around zero angle with the decay length
( is a normalization constant),(9) 
In this case, the energy (8) can be calculated exactly for any number of hidden units . However, minimizing over the position of hidden units cannot be done analytically for general . To further simplify the problem consider the case when the number of hidden units is small. For the energy is equal to
(10) 
Thus, in this simple case the energy is minimized when
(11) 
In the limit when the density of data is concentrated around zero angle () the hidden units are attracted to the origin and . In the opposite limit (
) the data points are uniformly distributed on the circle. The resulting hidden units are then organized to be on the opposite sides of the circle
, due to mutual repulsion.Another limit when the problem can be solved analytically is the uniform density of the data for arbitrary number of hidden units. In this case the hidden units span the entire circle homogeneously  the angle between two consecutive hidden units is .
These results are summarized in an intuitive cartoon in Figure 2
, panel A. After learning is complete, the hidden units, denoted by circles, are localized in the vicinity of local maxima of the probability density of the data. At the same time, repulsive force between the hidden units prevents them from collapsing onto the exact position of the local maximum. Thus the concentration of the hidden units near the local maxima becomes high, but, at the same time, they span the entire support (area where there is nonzero density) of the data distribution.
For hashing purposes, trying to find a data point “closest” to some new query requires a definition of “distance”. Since this measure is wanted only for nearby locations and , it need not be accurate for long distances. If we pick a set of reference points in the space, then the location of point can be specified by noting the few reference points it is closest to, producing a sparse and useful local representation. Uniformly tiling a high dimensional space is not a computationally useful approach. Reference points are needed only where there is data, and high resolution is needed only where there is high data density. The learning dynamics in 1 distributes
reference vectors by an iterative procedure such that their density is high where the data density is high, and low where the data density is low. This is exactly what is needed for a good hash code.
The case of uniform density on a circle is illustrated in Figure 2, panel B. After learning is complete the hidden units homogeneously span the entire circle. For hash length , any given data point activates two closest hidden units. If two data points are located between two neighboring hidden units (like and ) they produce exactly identical hash codes with hamming distance zero between them (black and red active units). If two data points are slightly farther apart, like and , they produce hash codes that are slightly different (black and green circles, hamming distance is equal to in this case). If the two data points are even farther, like and , their hash codes are not overlapping at all (black and magenta circles, hamming distance is equal to ). Thus, intuitively similar data activate similar hidden units, resulting in similar representations, while dissimilar data result in very different hash codes.
2.3 Computational Complexity and Metabolic Cost.
In classical LSH algorithms (charikarSimilarityEstimationTechniques2002; indykApproximateNearestNeighbors1998), typically, and , entailing a storage cost of bits per database entry and computational cost to compute Hamming distance. In BioHash (and in FlyHash), typically and entailing storage cost of bits per database entry and ^{1}^{1}1If we maintain sorted pointers to the locations of s, we have to compute the intersection between 2 ordered lists of length , which is . computational cost to compute Hamming distance. Note that while there is additional storage/lookup overhead over classical LSH in maintaining pointers, this is not unlike the storage/lookup overhead incurred by quantization methods like Product Quantization (PQ) (jegouProductQuantizationNearest2011), which stores a lookup table of distances between every pair of codewords for each product space. From a neurobiological perspective, a highly sparse representation such as the one produced by BioHash keeps the same metabolic cost (levyEnergyEfficientNeural1996) as a dense lowdimensional () representation, such as in classical LSH methods. At the same time, as we empirically show below, it better preserves similarity information.
2.4 Convolutional BioHash
In order to take advantage of the spatial statistical structure present in images, we use the dynamics in 1 to learn convolutional filters by training on image patches as in grinbergLocalUnsupervisedLearning2019. Convolutions in this case are unusual since the patches of the images are normalized to be unit vectors before calculating the inner product with the filters. Differently from grinbergLocalUnsupervisedLearning2019, we use cross channel inhibition to suppress the activities of the hidden units that are weakly activated. Specifically, if there are convolutional filters, then only the top of activations are kept active per spatial location.
Patch normalization is reminiscent of the canonical computation of divisive normalization (carandiniNormalizationCanonicalNeural2011) and performs local intensity normalization. This is not unlike divisive normalization in the fruit fly’s projection neurons. Divisive normalization has also been found to be beneficial (renNormalizingNormalizersComparing2016)
in CNNs trained endtoend by the backpropogation algorithm on a supervised task. We found that the cross channel inhibition is important for a good hashing performance. Post crosschannel inhibition, we use a maxpooling layer, followed by a
BioHash layer as in Sec. 2.1.Hash Length ()  
Method  2  4  8  16  32  64 
LSH  12.45  13.77  18.07  20.30  26.20  32.30 
PCAHash  19.59  23.02  29.62  26.88  24.35  21.04 
FlyHash  18.94  20.02  24.24  26.29  32.30  38.41 
SH  20.17  23.40  29.76  28.98  27.37  24.06 
ITQ  21.94  28.45  38.44  41.23  43.55  44.92 
DH        43.14  44.97  46.74 
UHBNN        45.38  47.21   
NaiveBioHash  25.85  29.83  28.18  31.69  36.48  38.50 
BioHash  44.38  49.32  53.42  54.92  55.48   
BioConvHash  64.49  70.54  77.25  80.34  81.23   
3 Similarity Search
In this section, we empirically evaluate BioHash
, investigate the role of sparsity in the latent space, and compare our results with previously published benchmarks. We consider two settings for evaluation: a) the training set contains unlabeled data, and the labels are only used for the evaluation of the performance of the hashing algorithm and b) where supervised pretraining on a different dataset is permissible. Features extracted from this pretraining are then used for hashing. In both settings
BioHash outperforms previously published benchmarks for various hashing methods.3.1 Evaluation Metric
Following previous work (dasguptaNeuralAlgorithmFundamental2017a; liFastSimilaritySearch2018a; suGreedyHashFast2018)
, we use Mean Average Precision (mAP) as the evaluation metric, a measure that averages precision over different recall. Specifically, given a query set
, we evaluate mAP as(12) 
where is number of retrievals, and is the precision of the top retrievals. Notation: when is equal to size of the entire database, i.e a ranking of the entire database is desired, we use the notation mAP@All or simply mAP, dropping the reference to .
3.2 Datasets and Protocol
To make our work comparable with recent related work, we used common benchmark datasets: a) MNIST (lecunGradientbasedLearningApplied1998), a dataset of 70k greyscale images (size 28 x 28) of handwritten digits with 10 classes of digits ranging from "0" to "9", b) CIFAR10 (krizhevskyLearningMultipleLayers2009), a dataset containing 60k images (size 32x32x3) from 10 classes (e.g: car, bird).
Hash Length ()  
Method  2  4  8  16  32  64 
LSH  11.73  12.49  13.44  16.12  18.07  19.41 
PCAHash  12.73  14.31  16.20  16.81  17.19  16.67 
FlyHash  14.62  16.48  18.01  19.32  21.40  23.35 
SH  12.77  14.29  16.12  16.79  16.88  16.44 
ITQ  12.64  14.51  17.05  18.67  20.55  21.60 
NaiveBioHash  11.79  12.43  14.54  16.62  17.75  18.65 
BioHash  20.47  21.61  22.61  23.35  24.02   
BioConvHash  26.94  27.82  29.34  29.74  30.10   
Following the protocol in luDeepHashingScalable2017; chenLearningDeepUnsupervised2018, on MNIST we randomly sample 100 images from each class to form a query set of 1000 images. We use the rest of the 69k images as the training set for BioHash as well as the database for retrieval post training. Similarly, on CIFAR10, following previous work (suGreedyHashFast2018; chenLearningDeepUnsupervised2018; jinUnsupervisedSemanticDeep2018), we randomly sampled 1000 images per class to create a query set containing 10k images. The remaining 50k images were used for both training as well as the database for retrieval as in the case of MNIST. Ground truth relevance for both dataset is based on class labels. Following previous work (chenLearningDeepUnsupervised2018; linDeepLearningBinary2015; jinUnsupervisedSemanticDeep2019), we use mAP@1000 for CIFAR10 and mAP@All for MNIST. It is common to benchmark the performance of hashing methods at hash lengths . However, it was observed in dasguptaNeuralAlgorithmFundamental2017a that the regime in which FlyHash outperformed LSH was in the regime of low hash lengths . Accordingly, we evaluate performance for .
3.3 Baselines
As baselines we include random hashing methods FlyHash (dasguptaNeuralAlgorithmFundamental2017a), classical LSH (LSH (charikarSimilarityEstimationTechniques2002)), and datadriven hashing methods PCAHash (gongIterativeQuantizationProcrustean2011), Spectral Hashing (SH (weissSpectralHashing2009)), Iterative Quantization (ITQ (gongIterativeQuantizationProcrustean2011)). As in dasguptaNeuralAlgorithmFundamental2017a, for FlyHash we set the sampling rating from PNs to KCs to be and . Additionally, where appropriate, we also compare performance of BioHash to deep hashing methods: DeepBit (linDeepLearningBinary2015), DH (luDeepHashingScalable2017), USDH (jinUnsupervisedSemanticDeep2019), UHBNN (doLearningHashBinary2016), SAH (doSimultaneousFeatureAggregating2017) and GreedyHash (suGreedyHashFast2018). As previously discussed, in nearly all similarity search methods, a hash length of entails a dense representation using units. In order to clearly demonstrate the utility of sparse expansion in BioHash, we include a baseline (termed "NaiveBioHash"), which uses the learning dynamics in 1 but without sparse expansion, i.e the input data is projected into a dense latent representation with
hidden units. The activations of those hidden units are then binarized to generate a hash code of length
.3.4 Results and Discussion
The performance of BioHash on MNIST is shown in Table 1. BioHash demonstrates the best retrieval performance, substantially outperforming other methods, including deep hashing methods DH and UHBNN, especially at small . Indeed, even at a very short hash length of , the performance of BioHash is comparable to or better than DH for , while at , the performance of BioHash is better than the DH and UHBNN for . The performance of BioHash saturates around , showing only a small improvement from to and an even smaller improvement from to ; accordingly, we do not evaluate performance at . We note that while SOLHash also evaluated retrieval performance on MNIST and is a datadriven hashing method inspired by Drosophila’s olfactory circuit, the ground truth in their experiment was top 100 nearest neighbors of a query in the database, based on Euclidean distance between pairs of images in pixel space and thus cannot be directly compared^{2}^{2}2Due to missing values of the hyperparameters we are unable to reproduce the performance of SOLHash to enable a direct comparison.. Nevertheless, we adopt that protocol (liFastSimilaritySearch2018a) and show that BioHash substantially outperforms SOLHash in Sec 5.
The performance of BioHash on CIFAR10 is shown in Table 2. Similar to the case of MNIST, BioHash demonstrates the best retrieval performance, substantially outperforming other methods, especially at small . Even at , the performance of BioHash is comparable to other methods with . This suggests that BioHash is a particularly good choice when short hash lengths are required.
Hash Length ()  

2  4  8  16  
1  56.16  66.23  71.20  73.41 
5  58.13  70.88  75.92  79.33 
10  64.49  70.54  77.25  80.34 
25  56.52  64.65  68.95  74.52 
100  23.83  32.28  39.14  46.12 
Hash Length ()  

2  4  8  16  
1  26.94  27.82  29.34  29.74 
5  24.92  25.94  27.76  28.90 
10  23.06  25.25  27.18  27.69 
25  20.30  22.73  24.73  26.20 
100  17.84  18.82  20.51  23.57 
Effect of sparsity For a given hash length , we parametrize the total number of neurons in the hash layer as , where is the activity i.e the fraction of active neurons. For each hash length , we varied % of active neurons and evaluated the performance on a validation set (see appendix for details), see Figure 4. There is an optimal level of activity for each dataset. For MNIST and CIFAR10, was set to and
respectively for all experiments. We visualize tSNE embeddings for different settings of activity levels in Figure
5. Interestingly, at lower sparsity levels, dissimilar images may become nearest neighbors though highly dissimilar images stay apart. This is reminiscent of an experimental finding (linSparseDecorrelatedOdor2014) in Drosophila. Sparsification of Kenyon cells in Drosophila is controlled by feedback inhibition from the anterior paired lateral neuron. Disrupting this feedback inhibition leads to denser representations, resulting in fruit flies being able to discriminate between dissimilar odors but not similar odors.Convolutional BioHash In the case of MNIST, we trained 500 convolutional filters (as described in Sec. 2.4) of kernel sizes . In the case of CIFAR10, we trained 400 convolutional filters of kernel sizes and . The convolutional variant of BioHash, which we call BioConvHash shows further improvement over BioHash on MNIST as well as CIFAR10, with even small hash lengths substantially outperforming other methods at larger hash lengths. Channel Inhibition is critical for performance of BioConvHash across both datasets, see Table 3. A high amount of sparsity is essential for good performance. As discussed previously discussed, convolutions in our network are atypical in yet another way, due to patch normalization. We show in Sec 6, that patch normalization results in robustness of BioConvHash to "shadows", a robustness also characteristic of biological vision.
Hashing using deep CNN features Stateoftheart hashing methods generally adapt deep CNNs trained on ImageNet (suGreedyHashFast2018; jinUnsupervisedSemanticDeep2019; chenLearningDeepUnsupervised2018; linDiscriminativeDeepHashing2017). These approaches derive large performance benefits from the semantic information learned in pursuit of the classification goal on ImageNet (dengImagenetLargescaleHierarchical2009a). To make a fair comparison with our work, we trained BioHash on features extracted from fc7 layer of VGG16 (simonyanVeryDeepConvolutional2014a), since previous work (suGreedyHashFast2018; linDeepLearningBinary2015; chenLearningDeepUnsupervised2018) has often adapted this pretrained network. BioHash demonstrates substantially improved performance over recent deep unsupervised hashing methods with mAP1000 of 63.47 for ; example retrievals are shown in Figure 3. Even at very small hash lengths of , BioHash outperforms other methods at . For performance of other methods and performance at varying hash lengths see Table 4.
It is worth remembering that while exact Hamming distance computation is for all the methods under consideration, unlike classical hashing methods, BioHash (and also FlyHash) incurs a storage cost of instead of per database entry. In the case of MNIST (CIFAR10), BioHash at corresponds to () entailing a storage cost of () bits respectively. Even in scenarios where storage is a limiting factor, BioHash at compares favorably to other methods at , yet Hamming distance computation remains cheaper for BioHash.
Hash Length ()  
Method  2  4  8  16  32  64 
LSH  13.25  17.52  25.00  30.78  35.95  44.49 
PCAHash  21.89  31.03  36.23  37.91  36.19  35.06 
FlyHash  25.67  32.97  39.46  44.42  50.92  54.68 
SH  22.27  31.33  36.96  38.78  39.66  37.55 
ITQ  23.28  32.28  41.52  47.81  51.90  55.84 
DeepBit        19.4  24.9  27.7 
USDH        26.13  36.56  39.27 
SAH        41.75  45.56  47.36 
GreedyHash  10.56  23.94  34.32  44.8  47.2  50.1 
NaiveBioHash  18.24  26.60  31.72  35.40  40.88  44.12 
BioHash  57.33  59.66  61.87  63.47  64.61   
4 Conclusions, Discussion, and Future Work
Inspired by the recurring motif of sparse expansive representations in neural circuits, we introduced a new hashing algorithm, BioHash. In contrast with previous work (dasguptaNeuralAlgorithmFundamental2017a; liFastSimilaritySearch2018a), BioHash is both a datadriven algorithm and has a reasonable degree of biological plausibility. BioHash demonstrates strong empirical results outperforming recent unsupervised deep hashing methods. The biological plausibility of our work provides support toward proposal that LSH might be a general computational function (valiantWhatMustGlobal2014) of the neural circuits featuring sparse expansive representations. From the perspective of computer science BioHash is easy to train and scalable to large datasets.
Compressed sensing/sparse coding have also been suggested as computational roles of sparse expansive representations in biology (ganguliCompressedSensingSparsity2012)
. These ideas, however, require that the input be reconstructable from the sparse latent code. This is a much stronger assumption than LSH  downstream tasks might not require such detailed information about the inputs, e.g: novelty detection
(dasguptaNeuralDataStructure2018). Another idea of modelling the fruit fly’s olfactory circuit as a form of kmeans clustering algorithm has been recently discussed in
(pehlevan2017clustering).In this work, we limited ourselves to linear scan using fast Hamming distance computation for image retrieval, like much of the relevant literature
(dasguptaNeuralAlgorithmFundamental2017a; suGreedyHashFast2018; linDeepLearningBinary2015; jinUnsupervisedSemanticDeep2018). Yet, there is potential for improvement. One line of future inquiry would be to speed up retrieval using multiprobe methods, perhaps via psuedohashes (sharmaImprovingSimilaritySearch2018). Another line of inquiry would be to adapt BioHash for Maximum Inner Product Search (shrivastavaAsymmetricLSHALSH2014; neyshaburSymmetricAsymmetricLSHs).References
Supplementary Material
We expand on the discussion of related work in section 5. We also include here some additional results. In section 6, we show that BioConvHash enjoys robustness to intensity variations. In section 7, we show that the strong empirical performance of BioHash is not specific to the choice of VGG16. Finally, we include technical details about implementation and architecture in section 8.
5 Appendix A: Related Work
Sparse HighDimensional Representations in Neuroscience. Previous work has explored the nature of sparse highdimensional representations through the lens of sparse coding and compressed sensing (ganguliCompressedSensingSparsity2012). Additionally, sompolinskySparsenessExpansionSensory2014 has examined the computational role of sparse expansive representations in the context of categorization of stimuli as appetitive or aversive. They studied the case of random projections as well as learned/"structured" projections. However, structured synapses were formed by a Hebbianlike association between each cluster center and a corresponding fixed, randomly selected pattern from the cortical layer; knowledge of cluster centers provides a strong form a "supervision"/additional information, while BioHash does not assume access to such information. To the best of our knowledge no previous work has systematically examined the proposal that LSH maybe a computational principle in the brain in the context of structured synapses learned in a biologically plausible manner.
Classical LSH. A classic LSH algorithm for angular similarity is SimHash (charikarSimilarityEstimationTechniques2002), which produces hash codes by , where entries of
are i.i.d from a standard normal distribution and
is elementwise. While LSH is a property and consequently is sometimes used to refer to hashing methods in general, when the context is clear we refer to SimHash as LSH following previous literature.Fruit Fly inspired LSH. The fruit fly’s olfactory circuit has inspired research into new families (dasguptaNeuralDataStructure2018; sharmaImprovingSimilaritySearch2018; liFastSimilaritySearch2018a) of Locality Sensitive Hashing (LSH) algorithms. Of these, FlyHash (dasguptaNeuralAlgorithmFundamental2017a) and DenseFly (sharmaImprovingSimilaritySearch2018) are based on random projections and cannot learn from data. Sparse Optimal Lifting (SOLHash) (liFastSimilaritySearch2018a) is based on learned projections and results in improvements in hashing performance. SOLHash attempts to learn a sparse binary representation , by optimizing
(13) 
is an all ’s vector of size . Note the relaxation from a binary to continuous . After obtaining a , queries are hashed by learning a linear map from to by minimizing
(14) 
Here, is the # of synapses with weight ; the rest are . To optimize this objective, liFastSimilaritySearch2018a resorts to FrankeWolfe optimization, wherein every learning update involves solving a constrained linear program involving all of the training data, which is biologically unrealistic. In contrast, BioHash is neurobiologically plausible involving only Hebbian/AntiHebbian updates and inhibition.
From a computer science perspective, the scalability of SOLHash is highly limited; not only does every update step invoke a constrained linear program but the program involves pairwise similarity matrices, which can become intractably large for datasets of even modest size. This issue is further exacerbated by the fact that and is recomputed at every step (liFastSimilaritySearch2018a). Indeed, though liFastSimilaritySearch2018a uses the SIFT1M dataset, the discussed limitations limit training to only of the training data. Nevertheless, we make a comparison to SOLHash in Table 5 and see that BioHash results in substantially improved performance.
Hash Length ()  

Method  2  4  8  16  32  64 
BioHash  39.57  54.40  65.53  73.07  77.70  80.75 
SOLHash  11.59  20.03  30.44  41.50  51.30   
Deep LSH. A number of stateoftheart approaches (suGreedyHashFast2018; jinUnsupervisedSemanticDeep2018; doSimultaneousFeatureAggregating2017; linDeepLearningBinary2015) to unsupervised hashing for image retrieval are perhaps unsurprisingly, based on deep CNNs trained on ImageNet dengImagenetLargescaleHierarchical2009a; A common approach (suGreedyHashFast2018) is to adopt a pretrained DCNN as a backbone, replace the last layer with a custom hash layer and objective function and to train the network by backpropogation. Some other approaches (yangSemanticStructurebasedUnsupervised2018), use DCNNs as feature extractors or to compute a measure of similarity in it’s feature space, which is then used as a training signal. While deep hashing methods are not the purpose of our work, we include them here for completeness.
6 Appendix B: Robustness of BioConvHash to variations in intensity
Patch normalization is reminiscent of the canonical neural computation of divisive normalization (carandiniNormalizationCanonicalNeural2011) and performs local intensity normalization. This makes BioConvHash robust to variations in light intensity. To test this idea, we modified the intensities in the query set of CIFAR10 by multiplying 80% of each image by a factor of 0.3; such images largely remain discriminable to human perception, see Figure 6. We evaluated the retrieval performance of this query set with "shadows", while the database (and synapses) remain unmodified. We find that BioConvHash performs best at small hash lengths, while the performance of other methods except GreedyHash is almost at chance. These results suggest that it maybe beneficial to incorporate divisive normalization into DCNNs architectures to increase robustness to intensity variations.
Hash Length ()  

Method  2  4  8  16  32  64 
LSH  10.62  11.82  11.71  11.25  11.32  11.90 
PCAHash  10.61  10.60  10.88  11.33  11.79  11.83 
FlyHash  11.44  11.09  11.86  11.89  11.45  11.44 
SH  10.64  10.45  10.45  11.70  11.26  11.30 
ITQ  10.54  10.68  11.65  11.00  10.95  10.94 
BioHash  11.05  11.50  11.57  11.33  11.59   
BioConvHash  26.84  27.60  29.31  29.57  29.95   
GreedyHash  10.56  21.47  25.21  30.74  30.16  37.63 
7 Appendix C: Evaluation using VGG16BN and AlexNet
The strong empirical performance of BioHash using features extracted from VGG16 fc7 is not specific to choice of VGG16. To demonstrate this, we evaluated the performance of BioHash
using VGG16 with batch normalization (BN)
(ioffeBatchNormalizationAccelerating2015) as well as AlexNet (krizhevskyImageNetClassificationDeep2012). Consistent with the evaluation using VGG16 reported in the main paper, BioHash consistently demonstrates the best retrieval performance, especially at small .Hash Length ()  

Method  2  4  8  16  32  64 
LSH  13.16  15.86  20.85  27.59  38.26  47.97 
PCAHash  21.72  34.05  38.64  40.81  38.75  36.87 
FlyHash  27.07  34.68  39.94  46.17  52.65  57.26 
SH  21.76  34.19  38.85  41.80  42.44  39.69 
ITQ  23.02  34.04  44.57  51.23  55.51  58.74 
BioHash  60.56  62.76  65.08  66.75  67.53   
Hash Length ()  

Method  2  4  8  16  32  64 
LSH  13.25  12.94  18.06  23.28  25.79  32.99 
PCAHash  17.19  22.89  27.76  29.21  28.22  26.73 
FlyHash  18.52  23.48  27.70  30.58  35.54  38.41 
SH  16.66  22.28  27.72  28.60  29.27  27.50 
ITQ  17.56  23.94  31.30  36.25  39.34  42.56 
BioHash  44.17  45.98  47.66  49.32  50.13   
8 Appendix D: Implementation details

BioHash: The training /retrieval database was centered. Queries were also centered using mean computed on the training set. Weights were initialized by sampling from the standard normal distribution. For simplicity we used . We set initial learning rate , which was decayed as , where
is epoch number and
is maximum number of epochs. We used and a minibatch size of 100. The criterion for convergence was average norm of synapses was . Convergence usually took epochs.In order to set the activity level, we performed crossvalidation. In the case of MNIST, we separated 1k random samples (100 from each class) from the training set, to create a training set of size 68k and validation set of 1k images. Activity level with highest mAP@All on the validation set was determined to be 5%, see Figure 4. We then retrained BioHash on the whole training data of size 69k and reported the performance on the query set. Similarly for CIFAR10, we separated 1k samples (100 images per class) to create a training set of size 49k and validation set of 1k. We set the activity level to be 0.5%, see Figure 4. We then retrained BioHash on the whole training data of size 50k and reported the performance on the query set.

BioConvHash A convolutional filter of kernel size is learned by dividing the training set into patches of sizes and applying the learning dynamics 1. In the case of MNIST, we trained 500 filters of kernel sizes . In the case of CIFAR10, we used
. For both datasets, we used a stride of 1 in the convolutional layers. We set
for MNIST and for CIFAR10 , by crossvalidation in a procedure similar to how sparsity was set. The effect of channel inhibition is shown in Table 3. means that only the largest activation across channels per spatial location was kept, while the rest are set to 0. This was followed by 2d maxpooling with a stride of 2 and kernel size of 7. This was followed by a fully connected layer (the "hash" layer). 
FlyHash Following dasguptaNeuralAlgorithmFundamental2017a, we set for all hash lengths and each neuron in the hashing layer ("Kenyon" cell) sampled from dimensions of input data (Projection neurons). Following gongIterativeQuantizationProcrustean2011, ITQ employed 50 iterations.

To extract representations from VGG16 fc7, CIFAR10 images were resized to and normalized using default values: . To make a fair comparison we used the pretrained VGG16 model (without BN), since this model is frequently employed by deep hashing methods. We also evaluated the performance using VGG16 with BN and also using AlexNet (krizhevskyImageNetClassificationDeep2012), see Tables 7, 8.

GreedyHash
replaces the softmax layer of VGG16 with a hash layer and is trained endtoend via backpropogation using a custom objective function, see
suGreedyHashFast2018 for more details. We use the code ^{3}^{3}3 https://github.com/ssppp/GreedyHash provided by the authors to measure performance at , since these numbers were not reported in suGreedyHashFast2018. We used the default parameters: minibatch size of 32, learning rate of and trained for 60 epochs.
9 Appendix E: Training Time
Here we report the training times for the best performing (having the highest corresponding mAP@R) variant of our algorithm: BioHash, BioConvHash, or BioHash on top of VGG16 representations. For the case of MNIST, the best performing variant is BioConvHash, and for CIFAR10 it is BioHash on top of VGG16 representations. We also report the training time of the next best method for each dataset. This is GreedyHash in the case of CIFAR10, and BioHash in the case of MNIST. In the case of MNIST, the best method that is not a variant of BioHash is UHBNN. Training time for UHBNN is unavailable, since it is not reported in literature. All experiments were run on a single V100 GPU to make a fair comparison.
Hash Length ()  

Method  2  4  8  16  32 
BioHash  1.7 s  1.7s  1.7 s  3.4 s  5 s 
BioConvHash  3.5 m  3.5 m  3.5 m  5 m  5 m 
Hash Length ()  

CIFAR10  2  4  8  16  32 
BioHash  4.2 s  7.6 s  11.5 s  22 s  35 s 
GreedyHash  1.2 hrs  1.2 hrs  1.3 hrs  1.4 hrs  1.45 hrs 