Deep learning has made tremendous advances in computer vision, natural language processing and many other areas. While taking high-level inspiration from biology, the current generation of deep learning methods are not necessarily biologically realistic. This raises the question whether biological systems can further inform the development of new network architectures and learning algorithms that can lead to competitive performance on machine learning tasks or offer additional insights into intelligent behavior. Our work is inspired by this motivation. We study a well-established neurobiological network motif from the fruit fly brain and investigate the possibility of reusing it for solving common machine learning tasks in NLP. We consider this exercise as a toy model example illustrating the possibility of “reprogramming” of naturally occurring algorithms and behaviors (clustering combinations of input stimuli from olfaction, vision, and thermo-hydro sensory system) into a target algorithm of interest (learning word embeddings from raw text) that the original biological organism does not naturally engage in.
The mushroom body (MB) is a major area of the brain responsible for processing of sensory information in fruit flies. It receives inputs from a set of projection neurons (PN) conveying information from several sensory modalities. The major modality is olfaction (Bates et al., 2020), but there are also inputs from the PN responsible for sensing temperature and humidity (Marin et al., 2020), as well as visual inputs (Vogt et al., 2016; Caron et al., 2020). These sensory inputs are forwarded to a population of approximately Kenyon cells (KCs) through a set of synaptic weights (Li et al., 2020). KCs are reciprocally connected through an anterior paired lateral (APL) neuron, which sends a strong inhibitory signal back to KCs. This recurrent network effectively implements winner-takes-all competition between KCs, and silences all but a small fraction of top activated neurons (Dasgupta et al., 2017). This is the network motif that we study in this paper; its schematic is shown in Fig. 1. KCs also send their outputs to mushroom body output neurons (MBONs), but this part of the MB network is not included into our mathematical model.
Behaviorally, it is important for a fruit fly to distinguish sensory stimuli, e.g., different odors. If a fruit fly senses a smell associated with danger, it’s best to avoid it; if it smells food, the fruit fly might want to approach it. The network motif shown in Fig. 1 is believed to be responsible for clustering sensory stimuli so that similar stimuli elicit similar patterns of neural responses at the level of KCs to allow generalization, while distinct stimuli result in different neural responses, to allow discrimination. Importantly, this biological network has evolved to accomplish this task in a very efficient way.
In computational linguistics there is a long tradition (Harris, 1954) of using distributional properties of linguistic units for quantifying semantic similarities between them, as summarized in the famous quote by JR Firth: “a word is characterized by the company it keeps” (Firth, 1957). This idea has led to powerful tools such as Latent Semantic Analysis (Deerwester et al., 1990), topic modelling (Blei et al., 2003)
, and language models like word2vec(Mikolov et al., 2013a), GloVe (Pennington et al., 2014), and, more recently, BERT (Devlin et al., 2018) which relies on the Transformer model (Vaswani et al., 2017)
. Specifically word2vec models are trained to maximize the likelihood of a word given its context, GloVe models utilize global word-word co-occurence statistics, and BERT uses a deep neural network with attention to predict masked words (and the next sentence). As such, all these methods utilize the correlations between individual words and their context in order to learn useful word embeddings.
In our work we ask the following question: can the correlations between words and their contexts be extracted from raw text by the biological network of KCs, shown in Fig. 1? Further, how do the word representations learned by KCs differ from those obtained by existing NLP methods? Although this network has evolved to process sensory stimuli from olfaction and other modalities and not to “understand” language it uses a general purpose algorithm to embed inputs (from different modalities) into a high dimensional space with several desirable properties, which we discuss below.
Our approach relies on a recent proposal that the recurrent network of mutually inhibited KCs can be used as a “biological” model for generating sparse binary hash codes for the input data presented at the projection neuron layer (Dasgupta et al., 2017). It was argued that a matrix of random weights projecting from PN layer into the KCs layer leads to the highly desirable property of making the generated hash codes locality sensitive, i.e., placing similar inputs close to each other in the embedding space and pushing distinct stimuli far apart. A subsequent study (Ryali et al., 2020) has demonstrated that the locality sensitivity of the hash codes can be significantly increased, compared to the random case, if the matrix of weights from PN to KCs is learned from data. The idea of using the network of KCs with random projections for NLP tasks has also been previously explored in (Preissner and Herbelot, 2019), see discussion in section 7.
Biologically, there is an ongoing debate in the neuroscience community regarding whether these projections are random. For instance, (Caron et al., 2013) argues for the random model, while (Zheng et al., 2020) presents evidence of the non-random structure of this network, which is related to the frequency of presented odors. Since the goal of our work is to build a useful AI system and not mimic every detail of the biological system, we adopt the data-driven synaptic weight strategy even if fruit flies may use random projections. As is clearly demonstrated in (Ryali et al., 2020)
, learned synapses lead to better performance.
Our main contributions in this work are the following:
Inspired by the fruit fly network, we propose an algorithm that makes it possible to generate binary (as opposed to continuous) word embeddings for words and their context. We systematically evaluate the performance of this algorithm on word similarity task, word-sense disambiguation, and document classification.
We demonstrate that our binary embeddings result in tighter and better separated clusters of concepts compared to continuous GloVe embeddings, and stand in line with clustering properties of binarized versions of GloVe.
We show that training the fruit fly network requires an order of magnitude smaller compute time than training the classical NLP architectures, like BERT, at the expense of relatively small decrease in classification accuracy.
2 Learning Algorithm
Consider a training corpus. Each sentence can be decomposed into a collection of -grams of consecutive words. If the word tokens come from a predefined vocabulary of size
, the input to the algorithm is a vector of size. This vector consists of two blocks: the context (the first elements), and the target (the remaining elements); see Fig. 2. In this work
is assumed to be an odd integer, and the target word is assumed to be the center of the-gram.
The target word is one-hot encoded in the target block, and the context words are binary encoded as a bag of words in the context block (no positional information is used). The windowslides along the text corpus, and for each position generates a training vector , where the index enumerates different -grams, and index enumerates positions in the context-target vector. These training vectors are passed to the learning algorithm. The goal of the algorithm is to learn correlations between the context and the target blocks.
2.1 Mathematical Formulation
Mathematically, the objective of the training algorithm is to distribute a set of context-target pairs among buckets, so that similar pairs end up in similar buckets. In order to achieve this, the learning algorithm takes two inputs: a set of training vectors
, and a vector of occurrence probabilities, where is the probability of observing word in the training corpus111In our notation vector has elements, while vector has elements. Given that index runs from to , notation is a mathematical way to concatenate two vectors into a twice longer vector .. The learning can be formalized as a minimization of the energy function, see (Ryali et al., 2020) for additional details, defined by
In this equation is a matrix of synaptic connections, given as , projecting from PN layer (individual neurons in the layer are denoted by the index ) to the KC layer (individual neurons in the KC layer are denoted by the index ). There are neurons in the PN layer and neurons in the KC layer. The inner product is defined as a contraction over index of PN cells. In the numerator of the energy function the binary encoded -gram is divided by the probabilities of occurrences of individual words element-wise, so that the numerator can be written as
Probabilities are calculated based on the frequencies of words in the training corpus. The vocabulary contains most frequent words in the corpus, thus all the elements of are non-zero and the element-wise division is well defined.
Intuitively, the goal of the training algorithm is to adjust the weights of the neural network so that they are aligned with -grams that are frequently present in the training corpus. We rely on the assumption that semantically related -grams share several “core” words, while a few individual words might be substituted by synonyms/antonyms. The minimization of the energy function (1) is accomplished by the iterative update of the weights satisfying the following learning rule (Krotov and Hopfield, 2019; Ryali et al., 2020; Grinberg et al., 2019)
In this equation the activation function is equal to one for a maximally driven hidden unit (Kenyon cell), and is equal to zero otherwise
The learning rate is denoted by , and is a Kronecker delta symbol.
After learning is complete the hash codes for the inputs can be generated in the following way. Given the binary encoded -gram ,
This is a crude mathematical approximation of the biological computation performed by the PN–KC–APL neural network (Dasgupta et al., 2017; Ryali et al., 2020). An input generates an input current into the KC neurons using feedforward weights . The recurrent network of KCs and the APL neuron silences all but a small fraction of KCs. Those cells that remain active are assigned state , while the rest of the KCs are assigned the inactive state .
Notice, that equation (4) makes it possible to generate the hash codes for both individual words (static word embeddings like word2vec and GloVe) and phrases (similar to Transformer models). In the static case, the input has all zeros in the context block and a one-hot encoded word in the target block. In the context-dependent case, both blocks have binary encoded input words. Importantly, both context-dependent and static embeddings are mapped into the same space of sparse binary hash codes (a vector of elements, with ones in it). We show below that these hash codes capture semantic meaning of the target word and the context in which it is used. For the rest of the paper we refer to the parameter in equation (4) as the hash length.
In order to provide an intuition behind the learning algorithm defined by the energy function (1) and weight update rule (2) and connect it to some of the existing methods in machine learning, consider the limit when all the words have equal probabilities in the training corpus, . In this limit the energy function (1) reduces to the familiar spherical -means clustering algorithm (Dhillon and Modha, 2001). In this limit the weights of each KC correspond to the centroids of the clusters of context-target vectors. The hashing rule (4) assigns active state to the closest centroids (and inactive state
to the remaining ones), defined with respect to cosine similarity distance. In this simple limit the learning algorithm that we use can be viewed as a biologically plausible implementation of this classical algorithm. For real datasets the probabilities of words are different, thus this correspondence does not hold. Notice that division by the probability appears only in the expression for the energy, but not in the definition ofin equation (1). Equivalently, division by appears in the second bracket of equation (2), but not in the argument of the activation function . Thus, in the general case (for different word probabilities ) our algorithm is not equivalent to spherical -means on context-target vectors rescaled by the probabilities. Rather, in the general case, the closest centroid is found for a given context-target vector (via the definition of in equation (1) - no involved), but the “updates of the position” of that centroid are computed by enhancing the contributions of rare words (small ) and suppressing the contributions of frequent words (large ). Empirically, we have found that division by the probabilities improves performance of our method compared to the case of spherical -means (when the factor is removed from the algorithm).
3 Empirical Evaluation
GB corpus of unstructured text containing approximately 6B tokens. The details of the training protocols and the hyperparameters are reported in section8 in the supplement.
3.1 Static Word Embeddings Evaluation
Our aim here is to demonstrate that the sparse embeddings obtained by the fruit fly network motif are competitive with existing state-of-the-art word embeddings such as GloVe (Pennington et al., 2014) and word2vec (Mikolov et al., 2013a) and commonly used binarization tools for these continuous embeddings. We show this by evaluating the semantic similarity of static word embeddings. Several common benchmark datasets are used: WS353 (Finkelstein et al., 2002), MEN (Bruni et al., 2014), RW (Luong et al., 2013), SimLex (Hill et al., 2015), RG-65 (Rubenstein and Goodenough, 1965), Mturk (Halawi et al., 2012). These datasets contain pairs of words with human-annotated similarity scores between them. Following previous work (Tissier et al., 2019; Sokal, 1958), model similarity score for binary representations is evaluated as , where () is the number of bits in and that are both (), and is the length of . Cosine similarity is used for real-valued representations. Spearman’s correlation coefficient is calculated between this similarity and the human annotated score. The results are reported in Table 1.
|MEN||56.6||69.5||75.5||81.3 (Dobó, 2019)|
|WS353||63.7||64.0||66.5||81.0 (Halawi et al., 2012)|
|SIMLEX||21.0||31.5||41.7||56.0 (Schwartz et al., 2015)|
|RW||39.4||46.8||61.3||61.7 (Pilehvar et al., 2018)|
|RG||69.0||74.2||75.4||83.3 (Hassan and Mihalcea, 2011)|
|Mturk||56.1||57.5||69.8||72.7 (Halawi et al., 2012)|
We observe that our word embeddings demonstrate competitive performance compared to GloVe, but worse performance than word2vec. At the same time, our embeddings are binary, as opposed to GloVe and word2vec, which are represented by continuous vectors. Thus, it is more appropriate to compare them with commonly used binarized versions of the continuous embeddings. Specifically, we compare the performance of fruit fly embeddings with a number of state-of-the-art binarization methods such as: LSH/SimHash (Charikar, 2002) (random contractive projections followed by binarization based on sign), RandExp (Dasgupta et al., 2017) (random expansive projections followed by -winner take all binarization), ITQ (Gong and Lazebnik, 2011) (iterative quantization), SH (spectral hashing) (Weiss et al., 2008), PCAH (Gong and Lazebnik, 2011) (PCA followed by binarization based on sign). The complete evaluation of all these methods for varying hash length is presented in Section 9; please see Tables 7, 8, 9 for binarization of pretrained GloVe, pretrained word2vec, and GloVe trained on OpenWebText. In Table 7 we also include evaluation from NLB, “Near-Lossless Binarization” (Tissier et al., 2019)
(autoencoder-based binarization) for the hash lengths where those results are available. Here we only present a short summary of those results for a specific (small) hash lengthin Table 2.
It is clear from Table 2 that fruit fly embeddings outperform existing methods for word embedding discretization on WS353 and Mturk, and demonstrate second best result (after LSH binarization of word2vec) on MEN. In general (see Tables 7, 8, 9), we find that fruit fly embeddings are particularly powerful compared to existing methods at small hash lengths (see in the aforementioned tables). These results indicate that the fruit fly network can learn meaningful binary semantic representations directly from raw text. We also note that an added advantage of binary embeddings is that they require only a fraction (approx. 3%) of the memory footprint required for continuous word embeddings (assuming they have the same length), since a real value requires -bits per vector element, whereas a boolean value requires only -bit.
3.2 Word Clustering
A nice aspect of binary embeddings is that they result in tighter and better separated clusters than continuous embeddings. To evaluate this property for our method we started with hash codes for individual words and performed agglomerative clustering via complete link, using the cosine distance as the metric. The clustering algorithm was terminated at clusters (we experimented with possible choices of this parameter, such as , and arrived at similar conclusions). We repeated the same analysis for continuous GloVe, binarization of GloVe embeddings via autoencoder-like method (Tissier et al., 2019), and simple discretization method of GloVe when one declares the largest elements of each word vector to be and assigns to the remaining elements (for ). The results for the inter-cluster similarity vs. intra-cluster similarity are shown in Fig. 3 (panel A). It is clear from this scatter plot that the average distance between the points within a cluster is smaller (higher similarity) for all considered binary embeddings compared to GloVe embeddings. At the same time, the distance between the closest clusters is larger or equal (smaller similarity) for the fruit fly embeddings and naive discretizations with
. We also observe that the clusters lose detail (i.e., both intra- and inter-cluster similarity increases) as the binarization threshold gets higher (shown for Glove). However, our embeddings maintain a balance between intra- and inter-clustering similarity, and thus still capture fine-grained cluster information. For instance, inspecting the semantic structure of the clusters obtained this way, an example of the hierarchical clustering diagram (lower part of the tree containingleaves) is shown in Fig. 3 (panel B). We clearly observe semantically coherent clusters resulting from the fruit fly word embeddings.
3.3 Context-Dependent Word Embeddings
Here, we evaluate the effectiveness of our fruit fly inspired approach for contextual word embeddings, as opposed to static (or context-independent) embeddings from above. We use the WiC (Pilehvar and Camacho-Collados, 2018) and SCWS (Huang et al., 2012) benchmarks for the evaluation of context-sensitive word embeddings for word sense disambiguation. Both the datasets comprise pairs of sentences that contain a target word, and the task is to determine whether the two target words share a similar semantic meaning in the corresponding contexts. The WiC dataset is modeled as a binary prediction task, with 1 denoting that the target words have the same sense, and 0 indicating that they mean different things. The SCWS dataset is modeled as a rank prediction task, since for each pair of sentences and target words, it reports the average human similarity scores (from 10 Amazon Mechanical Turkers per pair).
Before presenting quantitative results, we qualitatively examine how the fruit fly network performs on context sentence pairs for target words “apple” and “bank” in Fig. 4. We show the top nearest neighbor words for the context dependent target word. These examples clearly indicate that the “correct” sense of the word has been found (“apple” the device manufacturer has different nearest neighbors from the fruit, and “bank” the financial institution from the natural feature).
For the quantitative comparison, we contrast our method against contextual embeddings from BERT (Devlin et al., 2018), GloVe (Pennington et al., 2014), word2vec (Mikolov et al., 2013a) and Word2Sense (Panigrahi et al., 2019). For BERT we use the 768-dimensional embeddings from the uncased-large model, for GloVe and word2vec we use the 300-dimensional embeddings, and for Word2Sense we use the sparse 2250-dimensional pretrained embeddings. Since BERT outputs contextual embeddings for each word in a sentence, we simply compute the cosine similarity between the embedding vectors for the target words for each pair of instances. For GloVe/word2vec, we use a context window of size centered at each of the target words and compute the average embedding for each window and compute the cosine similarity between the two window vectors. Similar approach is used for Word2Sense, but the similarity between two embeddings is based on the Jensen-Shannon divergence (Panigrahi et al., 2019). For the fruit fly network, given the effectiveness of the top- nearest neighbor words (as seen in Fig. 4), we devise a two component scoring function. The first component is the dot-product between the context-dependent hash codes for the two target words plus length context blocks, denoted . The second is the number of common contextual nearest neighbors of the two target words among the top- neighbors of each (scaled to be between 0 and 1), denoted . The final score is given as , where is a hyperparameter. For all the methods, we predict a WiC pair to be positive if the score is above a threshold value . For SCWS, the ranking is proportional to the scores above , with the rest scored as zero. The hyperparameter is tuned for all the methods independently. Finally, for a fair comparison, all methods use the same 20k vocabulary.
We report the performance of our context-dependent word embeddings for both SCWS and WiC in Table 4 and Table 4, respectively. For both benchmarks we report the results from a 5-fold cross-validation study, where each fold (in turn) is used as a development set, and the remaining four folds as the test set. We select the optimal hyperparameters (including ) for all the methods using only the first fold; no training is done since we evaluate only the pretrained embeddings. The tables report the Spearman rank correlation on SCWS, and the accuracy on WiC.
On SWCS (Table 4), we see that the context-independent embeddings (using ) are better for GloVe, Word2Sense and our method, with word2vec yielding the best results. The reason is that about 86.5% of the word pairs in SCWS are different words, and can be distinguished without looking at the context. Unlike SCWS, the WiC benchmark uses the same target word (with only minor variations in some cases) in both contexts, and therefore a context-independent approach is not expected to perform well. Indeed, on WiC (Table 4), we clearly observe that context-independent vectors () are not very good, and our method, that uses the joint scoring function combining both the hash code and nearest neighbor scores, is better than context-dependent GloVe (), word2vec () and Word2Sense (also ).
3.4 Document Classification
We also compare our binary embeddings with GloVe (Pennington et al., 2014), Word2vec (Mikolov et al., 2013b), BERT (Devlin et al., 2018) and Near-Lossless Binarization (Tissier et al., 2019) on document classification tasks. The benchmarks we use are 20 Newsgroups (20NewsGroups, 1995), Stanford Sentiment Treebank (Socher et al., 2013), WOS-11967(Kowsari et al., 2017) and TREC-6 datasets (Li and Roth, 2002)
. The 20 Newsgroups dataset contains around 18,000 documents, partitioned evenly into 20 different groups; the Stanford Sentiment Treebank dataset contains movie reviews reflecting their sentiment as either positive or negative; WOS-11967 dataset contains 11967 documents with 35 categories which include 7 parents categories; and TREC-6 dataset consists of open-domain, fact-based questions divided into broad semantic categories. We use the TextCNN(Kim, 2014)classifier that uses all the different embeddings mentioned above. For fair comparison, we use the same model parameters (e.g., kernel size, filter dimension) while testing different embeddings. The results in Table 5 show how our sparse binary encodings are competitive with other methods.
4 Computational Complexity
The computational complexity of our method can be evaluated by analyzing equations (2,3) for the weight updates. In these equations is a sparse vector, which has only non-zero elements in it. Thus, for a minibatch of size , the computational complexity of evaluating the dot product with weights is . Additionally, the argmax operation requires operations. We will assume that the largest parameters in our model are the size of the corpus , and the size of the vocabulary . Additionally we use large minibatches . Calculation of the second term in (2) requires operations in addition to operations for calculating the dot-product for each data point. Since the algorithm has to go over the entire corpus, this computation needs to be repeated
times per epoch. Thus, the overall computational complexity of our method isper epoch. Thus, in the leading order it does not grow with the size of the vocabulary, which is a nice computational feature.
|device batch-size GPU mem time V100 3 400 2000 3 122MB 17m V100 3 400 10000 3 150MB 8m V100 3 600 2000 3 232MB 24m V100 3 600 10000*3 267MB 11.5m CPU 44cores 400 2000 - 76m CPU 44cores 400 10000 - 25m|
From the practical perspective, typical wall-clock training time and memory requirements per epoch are shown in Table 6. As is shown in Fig. 5, accurate solutions are obtained after about epochs; improvements beyond that are relatively small. Thus, our algorithm is capable of producing competitive models in a couple of hours. Contrast this with approximately hours training time for GloVe (Pennington et al., 2014); 4 days of training on 16 TPUs for ; and 4 days on 64 TPUs for (Devlin et al., 2018) (the last two numbers assume training corpus of size 250B tokens vs. 6B tokens considered in this paper). The record breaking training time of minutes for BERT requires the use of NVIDIA V100 GPUs each with GB of memory and a specialized DGX server architecture (Narasimhan, 2019). In our own experiments, we trained GloVe embedding on OWT corpus using the same vocabulary of 20k words that we used for the fruit fly embeddings. The wall-clock training time was approximately 10 hours on 16 threads, see details in Section 12. These are substantially larger computational resources than those required for training the fruit fly network.
5 Discussion and Conclusions
In this work we asked the intriguing question whether the core computational algorithm of one of the best studied networks in neuroscience – the network of KCs in the fruit fly brain – can be repurposed for solving a well defined machine learning task, namely, learning word embeddings from text. We have shown that, surprisingly, this network can indeed learn the correlations between the words and their context, and produce high quality word embeddings. On the semantic similarity task the fruit fly word embeddings outperform common methods for binarizing continuous SOTA word embeddings (applied to GloVe, word2vec, and GloVe trained on OWT) at small hash lengths. On the word-in-context task the fruit fly network outperforms GloVe by almost , word2vec by more than , but looses to BERT by , see Table 4. The small gap in classification accuracy compared with BERT, however, is outweighed by the benefit of requiring significantly smaller computational resources to obtain these fruit fly embeddings, as we have explained in Section 4, see Table 6. We view this result as an example of a general statement that biologically inspired algorithms might be more compute efficient compared with their classical (non-biological) counterparts, even if they slightly lose in terms of accuracy.
7 Appendix A. Related Work.
Our work builds on several ideas previously discussed in the literature. The first idea is that fruit fly olfactory network can generate high quality hash codes for the input data in random (Dasgupta et al., 2017) and data-driven (Ryali et al., 2020) cases. There are two algorithmic differences of our approach compared to these previous studies. First, our network uses representational contraction, rather than expansion when we go from the PN layer to the KCs layer. Second, (Dasgupta et al., 2017; Ryali et al., 2020) construct hash codes for data coming from a single modality (e.g., images, or word vectors), while the goal of the present paper is to learn correlations between two different “modalities”: target word and its context. The second idea pertains to the training algorithm for learning the PNKCs synapses. We use a biologically plausible algorithm of (Krotov and Hopfield, 2019) to do this, with modifications that take into account the wide range of frequencies of different words in the training corpus (we discuss these differences in section 2.1). Also, similarly to (Dasgupta et al., 2017; Ryali et al., 2020) the algorithm of (Krotov and Hopfield, 2019) is used for learning the representations of the data, and not correlations between two types of data (context and target) as we do in this paper.
Another closely related work (Preissner and Herbelot, 2019) uses the network of KCs with random weights for generating binary hash codes for individual words. There are several differences compared to our approach. First, in our system the synaptic weights from PNs to KCs are learned and not random. We have found that learning these weights improves the performance compared to the random case. Second, unlike (Preissner and Herbelot, 2019) (and unlike fruit flies), in our system the number of KCs is smaller than the number of PNs, so there is no representational expansion as we move into the “mushroom body”. This expansion is essential for the system of (Preissner and Herbelot, 2019), which uses random weights. Finally, our algorithm uses a different encoding scheme at the level of PNs, see Fig. 2.
8 Appendix B. Training Protocols and Hyperparameter Choices.
The fruit fly network was trained on the OpenWebText Corpus (Gokaslan and Cohen, 2019), which is a GB corpus of unstructured text containing approximately 6B tokens. Individual documents were concatenated and split into sentences. A collection of -grams were extracted from each sentence by sliding a window of size along each sentence from the beginning to the end. Sentences shorter than were removed. The vocabulary was composed of most frequent tokens in the corpus.
Training was done for . At each epoch all the -grams were shuffled, organized in minibatches, and presented to the learning algorithm. The learning rate was linearly annealed starting from the maximal value at the first epoch to nearly zero at the last epoch.
The training algorithm has the following hyperparameters: size of the KC layer , window , overall number of training epochs , initial learning rate , minibatch size, and hash length . All models presented in this paper were trained for . The optimal ranges of the hyperparameters are: learning rate is ; ; ; minibatch size ; hash length is reported for each individual experiment.
9 Appendix C. Comparison with Binarized GloVe and word2vec.
|Method||Hash Length ()||Hash Length ()|
|MEN (69.5/68.1)||WS353 (64.0/47.7)|
|SIMLEX (31.5/29.8)||RW (46.8/31.4)|
|RG (74.2/67.6)||Mturk (57.5/61.9)|
Our aim here is to demonstrate that the fruit fly word embeddings are competitive with existing state-of-the-art binarization methods applied to GloVe and word2vec embeddings. We show this by evaluating the semantic similarity of static word embeddings, using several common benchmark datasets: WS353 (Finkelstein et al., 2002), MEN (Bruni et al., 2014), RW (Luong et al., 2013), SimLex (Hill et al., 2015), RG-65 (Rubenstein and Goodenough, 1965), and Mturk (Halawi et al., 2012). These datasets contain pairs of words with human-annotated similarity scores between them. Specifically, we compare with GloVe (Pennington et al., 2014) word embeddings222pretrained embeddings: https://nlp.stanford.edu/projects/glove
trained on Wiki2014 and Gigaword 5, GloVe embeddings trained on OpenWebText Corpus(Gokaslan and Cohen, 2019) and word2vec embeddings333pretrained embeddings: https://code.google.com/archive/p/word2vec.
Since our representations are binary (in contrast to GloVe and word2vec), we binarize GloVe and word2vec embeddings and report their performance using a number of common hashing methods, LSH/SimHash (Charikar, 2002) (random contractive projections followed by binarization based on sign), RandExp (Dasgupta et al., 2017) (random expansive projections followed by -winner take all binarization), ITQ (Gong and Lazebnik, 2011) (iterative quantization), SH (spectral hashing) (Weiss et al., 2008), PCAH (Gong and Lazebnik, 2011) (PCA followed by binarization based on sign). Where available, we include evaluation from NLB, “Near-Lossless Binarization” (Tissier et al., 2019) (autoencoder-based binarization).
Following previous work (Tissier et al., 2019; Sokal, 1958), model similarity score for binary representations is evaluated as , where () is the number of bits in and that are both (), and is the length of . Cosine similarity is used for real-valued representations. The results are reported in Tables 7, 8 and 9. For each dataset, we report performance across a range of hash lengths . For methods that incorporate randomness (LSH, RandExp, ITQ), we report the average across 5 runs. ITQ, SH and PCAH in Tables 7 and 8 were trained using the top 400k most frequent words. Table 9 compares our method to GloVe trained on OpenWebText (same dataset that our method is trained on) using the same vocabulary as our method uses.
Our binary word embeddings demonstrate competitive performance compared to published methods for GloVe and word2vec binarization, and our algorithm can learn meaningful binary semantic representations directly from raw text. Importantly, our algorithm does not require training GloVe or word2vec embeddings first before binarizing them.
|Method||Hash Length ()||Hash Length ()|
|MEN (75.5)||WS353 (66.5)|
|SIMLEX (41.7)||RW (61.3)|
|RG (75.4)||Mturk (69.8)|
|Method||Hash Length ()||Hash Length ()|
|MEN (76.4)||WS353 (72.2)|
|SIMLEX (34.0)||RW (54.5)|
|RG (78.7)||Mturk (71.1)|
10 Appendix D. Details of Technical Implementation.
From the practical perspective, efficient implementation of the learning algorithm for the fruit fly network requires the use of sparse algebra, atomic updates, and block-sparse data access. Our algorithm is implemented in CUDA as a back-end, while python is used as an interface with the main functions.
The typical memory footprint of our approach is very small. About MB GPU memory is allocated for the operators and temporary fields; while approximately GB CPU memory is needed to store the input data, array of random numbers for shuffle operations and shuffled indices. For GPU implementation, the model data is stored in the GPU’s memory, while the input data is stored in the CPU memory. The parallelization strategy in our implementation is based on two aspects. First, each minibatch of data is divided into smaller sub-minibatches which are processed on different GPUs. Second, all the operations (dense-sparse matrix multiplications, operation, and weight updates) are executed in parallel using multiple threads.
11 Appendix E. Qualitative Evaluation of Contextual Embeddings.
In order to evaluate the quality of contextualized embeddings we have created an online tool, which we are planning to release with the paper, that allows users to explore the representations learned by our model for various inputs (context-target pairs). For a given query the tool returns the word cloud visualizations for each of the four top activated Kenyon cells. We show some examples of the outputs produced by this tool in Fig. 6. Each query is used to generate a bag of words input vector . This vector is then used to compute the activations of KCs using
. Top four KCs with the highest activations are selected. The corresponding four weight vectors are used to generate four probability distributions of individual words learned by those KCs by passing the weights through a softmax function. For example, for one of those vectors with index, the probability distribution is computed as . These probability distributions for the top four activated KCs are visualized as word clouds. In computing the softmax only the target block of the weight vector was used (we have checked that using only the context block gives qualitatively similar word clouds).
The results indicate that the fruit fly network indeed has learned meaningful representations. Consider for example the first query. The sentence: “Entertainment industry shares rise following the premiere of the mass destruction weapon documentary” results in the four top activated KCs shown in Fig. 6. The top activated KC has the largest weights for the words “weapon”, “mass”, etc. The second activated KC is sensitive to the words “market”, “stock”, etc. This illustrates how the fruit fly network processes the queries. In this example the query refers to several distinct combinations of concepts: “weapon of mass destruction”, “stock market”, “movie industry”. Each of those concepts has a dedicated KC responsible for it. As one can see the responses are not perfect. For example in this case one would expect to have the 4-th highest activated KC, which is responsible for the “movie industry” concept to have a higher activation than the 3-rd highest KC, which is responsible for the types of “weapons of mass destruction”. But overall all the concepts picked by the KCs are meaningful and related to the query.
12 Appendix F. Details of GloVe Retraining
To directly compare our method to GloVe, we trained an uninitialized GloVe model on the same OpenWebText corpus using the code provided by the original GloVe authors (Pennington et al., 2014)444https://nlp.stanford.edu/projects/glove/. This model was optimized to have the same vocab size as our model (the 20k most frequent tokens), used an embedding size of 300, and a window size of 15. The model was trained for 180 iterations at about 3 minutes, 20 seconds per iteration on 16 threads, resulting in the total training time of approximately 10 hours.
- 20 newsgroups dataset. External Links: Cited by: §3.4.
- Complete connectomic reconstruction of olfactory projection neurons in the fly brain. BioRxiv. Cited by: §1.
- Latent dirichlet allocation. Journal of machine Learning research 3 (Jan), pp. 993–1022. Cited by: §1.
Multimodal distributional semantics.
Journal of Artificial Intelligence Research49 (1), pp. 1–47. External Links: Cited by: §3.1, §9.
- Random convergence of olfactory inputs in the drosophila mushroom body. Nature 497 (7447), pp. 113–117. Cited by: §1.
- Two parallel pathways convey distinct visual information to the drosophila mushroom body. bioRxiv. Cited by: §1.
Similarity Estimation Techniques from Rounding Algorithms. In
Annual ACM Symposium on Theory of Computing, pp. 380–388. External Links: Cited by: §3.1, §9.
- A neural algorithm for a fundamental computing problem. Science 358 (6364), pp. 793–796. Cited by: §1, §1, §2.2, §3.1, §7, §9.
- Indexing by latent semantic analysis. Journal of the American society for information science 41 (6), pp. 391–407. Cited by: §1.
- Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §1, §3.3, §3.4, §4.
- Concept decompositions for large sparse text data using clustering. Machine learning 42 (1-2), pp. 143–175. Cited by: §2.2.
- A comprehensive analysis of the parameters in the creation and comparison of feature vectors in distributional semantic models for multiple languages. Ph.D. Thesis, szte. Cited by: Table 1.
- Placing search in context: the concept revisited. ACM Trans. Inf. Syst. 20 (1). External Links: Cited by: §3.1, §9.
- A synopsis of linguistic theory, 1930-1955. Studies in linguistic analysis. Cited by: §1.
- OpenWebText Corpus. Note: http://Skylion007.github.io/OpenWebTextCorpus Cited by: §3, §8, §9.
- Iterative quantization: A procrustean approach to learning binary codes. In CVPR 2011, pp. 817–824 (en). External Links: Cited by: §3.1, §9.
Local unsupervised learning for image analysis. arXiv preprint arXiv:1908.08993. Cited by: §2.1.
- Large-scale learning of word relatedness with constraints. KDD, pp. 1406–1414. Cited by: §3.1, Table 1, §9.
- Distributional structure. Word 10 (2-3), pp. 146–162. Cited by: §1.
- Semantic relatedness using salient semantic analysis. In Twenty-Fifth AAAI Conference on Artificial Intelligence, Cited by: Table 1.
- SimLex-999: evaluating semantic models with (genuine) similarity estimation. Computational Linguistics 41 (4), pp. 665–695. Cited by: §3.1, §9.
- Improving Word Representations via Global Context and Multiple Word Prototypes. In Annual Meeting of the Association for Computational Linguistics (ACL), Cited by: §3.3.
- Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882. Cited by: §3.4.
- HDLTex: hierarchical deep learning for text classification. In Machine Learning and Applications (ICMLA), 2017 16th IEEE International Conference on, Cited by: §3.4.
- Unsupervised learning by competing hidden units. Proceedings of the National Academy of Sciences 116 (16), pp. 7723–7731. Cited by: §2.1, §7.
- The connectome of the adult drosophila mushroom body: implications for function. bioRxiv. Cited by: §1.
- Learning question classifiers. In COLING 2002: The 19th International Conference on Computational Linguistics, Cited by: §3.4.
- Better word representations with recursive neural networks for morphology. In Conference on Computational Natural Language Learning, pp. 104–113. Cited by: §3.1, §9.
- Connectomics analysis reveals first, second, and third order thermosensory and hygrosensory neurons in the adult drosophila brain. BioRxiv. Cited by: §1.
- Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781. Cited by: §1, §3.1, §3.3.
- Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems, C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger (Eds.), Vol. 26, pp. 3111–3119. External Links: Cited by: §3.4.
- NVIDIA clocks world’s fastest bert training time and largest transformer based model, paving path for advanced conversational ai. Note: https://devblogs.nvidia.com/training-bert-with-gpus/ Cited by: §4.
- Word2Sense: sparse interpretable word embeddings. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 5692–5705. Cited by: §3.3.
- Glove: global vectors for word representation. In Empirical methods in natural language processing (EMNLP), pp. 1532–1543. Cited by: §1, §12, §3.1, §3.3, §3.4, §4, §9.
- Wic: the word-in-context dataset for evaluating context-sensitive meaning representations. arXiv preprint arXiv:1808.09121. Cited by: §3.3.
- Card-660: cambridge rare word dataset-a reliable benchmark for infrequent word representation models. arXiv preprint arXiv:1808.09308. Cited by: Table 1.
- To be fair: a case for cognitively-inspired models of meaning.. In CLiC-it, Cited by: §1, §7.
- Contextual correlates of synonymy. Communications of the ACM 8 (10), pp. 627–633. Cited by: §3.1, §9.
- Bio-Inspired Hashing for Unsupervised Similarity Search. arXiv preprint arXiv:2001.04907. Cited by: §1, §1, §2.1, §2.1, §2.2, §7.
- Symmetric pattern based word embeddings for improved word similarity prediction. In Proceedings of the nineteenth conference on computational natural language learning, pp. 258–267. Cited by: Table 1.
- Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pp. 1631–1642. Cited by: §3.4.
- A statistical method for evaluating systematic relationships.. Univ. Kansas, Sci. Bull. 38, pp. 1409–1438. Cited by: §3.1, §9.
- Near-lossless binarization of word embeddings. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 7104–7111. Cited by: §3.1, §3.1, §3.2, §3.4, §9, §9.
- Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §1.
- Direct neural pathways convey distinct visual information to drosophila mushroom bodies. Elife 5, pp. e14009. Cited by: §1.
- Spectral hashing. Advances in neural information processing systems 21, pp. 1753–1760. Cited by: §3.1, §9.
- Structured sampling of olfactory input by the fly mushroom body. bioRxiv. Cited by: §1.