Can a Fruit Fly Learn Word Embeddings?

01/18/2021 ∙ by Yuchen Liang, et al. ∙ ibm Cold Spring Harbor Laboratory Rensselaer Polytechnic Institute University of California, San Diego 0

The mushroom body of the fruit fly brain is one of the best studied systems in neuroscience. At its core it consists of a population of Kenyon cells, which receive inputs from multiple sensory modalities. These cells are inhibited by the anterior paired lateral neuron, thus creating a sparse high dimensional representation of the inputs. In this work we study a mathematical formalization of this network motif and apply it to learning the correlational structure between words and their context in a corpus of unstructured text, a common natural language processing (NLP) task. We show that this network can learn semantic representations of words and can generate both static and context-dependent word embeddings. Unlike conventional methods (e.g., BERT, GloVe) that use dense representations for word embedding, our algorithm encodes semantic meaning of words and their context in the form of sparse binary hash codes. The quality of the learned representations is evaluated on word similarity analysis, word-sense disambiguation, and document classification. It is shown that not only can the fruit fly network motif achieve performance comparable to existing methods in NLP, but, additionally, it uses only a fraction of the computational resources (shorter training time and smaller memory footprint).

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep learning has made tremendous advances in computer vision, natural language processing and many other areas. While taking high-level inspiration from biology, the current generation of deep learning methods are not necessarily biologically realistic. This raises the question whether biological systems can further inform the development of new network architectures and learning algorithms that can lead to competitive performance on machine learning tasks or offer additional insights into intelligent behavior. Our work is inspired by this motivation. We study a well-established neurobiological network motif from the fruit fly brain and investigate the possibility of reusing it for solving common machine learning tasks in NLP. We consider this exercise as a toy model example illustrating the possibility of “reprogramming” of naturally occurring algorithms and behaviors (clustering combinations of input stimuli from olfaction, vision, and thermo-hydro sensory system) into a target algorithm of interest (learning word embeddings from raw text) that the original biological organism does not naturally engage in.

The mushroom body (MB) is a major area of the brain responsible for processing of sensory information in fruit flies. It receives inputs from a set of projection neurons (PN) conveying information from several sensory modalities. The major modality is olfaction (Bates et al., 2020), but there are also inputs from the PN responsible for sensing temperature and humidity (Marin et al., 2020), as well as visual inputs (Vogt et al., 2016; Caron et al., 2020). These sensory inputs are forwarded to a population of approximately Kenyon cells (KCs) through a set of synaptic weights (Li et al., 2020). KCs are reciprocally connected through an anterior paired lateral (APL) neuron, which sends a strong inhibitory signal back to KCs. This recurrent network effectively implements winner-takes-all competition between KCs, and silences all but a small fraction of top activated neurons (Dasgupta et al., 2017). This is the network motif that we study in this paper; its schematic is shown in Fig. 1. KCs also send their outputs to mushroom body output neurons (MBONs), but this part of the MB network is not included into our mathematical model.

Figure 1: Network architecture. Several groups of PNs corresponding to different modalities send their activities to the layer of KCs, which are inhibited through the reciprocal connections to the APL neuron.

Behaviorally, it is important for a fruit fly to distinguish sensory stimuli, e.g., different odors. If a fruit fly senses a smell associated with danger, it’s best to avoid it; if it smells food, the fruit fly might want to approach it. The network motif shown in Fig. 1 is believed to be responsible for clustering sensory stimuli so that similar stimuli elicit similar patterns of neural responses at the level of KCs to allow generalization, while distinct stimuli result in different neural responses, to allow discrimination. Importantly, this biological network has evolved to accomplish this task in a very efficient way.

In computational linguistics there is a long tradition (Harris, 1954) of using distributional properties of linguistic units for quantifying semantic similarities between them, as summarized in the famous quote by JR Firth: “a word is characterized by the company it keeps” (Firth, 1957). This idea has led to powerful tools such as Latent Semantic Analysis (Deerwester et al., 1990), topic modelling (Blei et al., 2003)

, and language models like word2vec

(Mikolov et al., 2013a), GloVe (Pennington et al., 2014), and, more recently, BERT (Devlin et al., 2018) which relies on the Transformer model (Vaswani et al., 2017)

. Specifically word2vec models are trained to maximize the likelihood of a word given its context, GloVe models utilize global word-word co-occurence statistics, and BERT uses a deep neural network with attention to predict masked words (and the next sentence). As such, all these methods utilize the correlations between individual words and their context in order to learn useful word embeddings.

In our work we ask the following question: can the correlations between words and their contexts be extracted from raw text by the biological network of KCs, shown in Fig. 1? Further, how do the word representations learned by KCs differ from those obtained by existing NLP methods? Although this network has evolved to process sensory stimuli from olfaction and other modalities and not to “understand” language it uses a general purpose algorithm to embed inputs (from different modalities) into a high dimensional space with several desirable properties, which we discuss below.

Our approach relies on a recent proposal that the recurrent network of mutually inhibited KCs can be used as a “biological” model for generating sparse binary hash codes for the input data presented at the projection neuron layer (Dasgupta et al., 2017). It was argued that a matrix of random weights projecting from PN layer into the KCs layer leads to the highly desirable property of making the generated hash codes locality sensitive, i.e., placing similar inputs close to each other in the embedding space and pushing distinct stimuli far apart. A subsequent study (Ryali et al., 2020) has demonstrated that the locality sensitivity of the hash codes can be significantly increased, compared to the random case, if the matrix of weights from PN to KCs is learned from data. The idea of using the network of KCs with random projections for NLP tasks has also been previously explored in (Preissner and Herbelot, 2019), see discussion in section 7.

Biologically, there is an ongoing debate in the neuroscience community regarding whether these projections are random. For instance, (Caron et al., 2013) argues for the random model, while (Zheng et al., 2020) presents evidence of the non-random structure of this network, which is related to the frequency of presented odors. Since the goal of our work is to build a useful AI system and not mimic every detail of the biological system, we adopt the data-driven synaptic weight strategy even if fruit flies may use random projections. As is clearly demonstrated in (Ryali et al., 2020)

, learned synapses lead to better performance.

Our main contributions in this work are the following:

  1. Inspired by the fruit fly network, we propose an algorithm that makes it possible to generate binary (as opposed to continuous) word embeddings for words and their context. We systematically evaluate the performance of this algorithm on word similarity task, word-sense disambiguation, and document classification.

  2. We demonstrate that our binary embeddings result in tighter and better separated clusters of concepts compared to continuous GloVe embeddings, and stand in line with clustering properties of binarized versions of GloVe.

  3. We show that training the fruit fly network requires an order of magnitude smaller compute time than training the classical NLP architectures, like BERT, at the expense of relatively small decrease in classification accuracy.

2 Learning Algorithm

Consider a training corpus. Each sentence can be decomposed into a collection of -grams of consecutive words. If the word tokens come from a predefined vocabulary of size

, the input to the algorithm is a vector of size

. This vector consists of two blocks: the context (the first elements), and the target (the remaining elements); see Fig. 2. In this work

is assumed to be an odd integer, and the target word is assumed to be the center of the

-gram.

Figure 2: The encoding method. The input vector consists of two blocks separated by the (thick) blue line. Assuming , a center word “stock” is the target word and the two flanking words form a context. The -gram is highlighted in light blue.

The target word is one-hot encoded in the target block, and the context words are binary encoded as a bag of words in the context block (no positional information is used). The window

slides along the text corpus, and for each position generates a training vector , where the index enumerates different -grams, and index enumerates positions in the context-target vector. These training vectors are passed to the learning algorithm. The goal of the algorithm is to learn correlations between the context and the target blocks.

2.1 Mathematical Formulation

Mathematically, the objective of the training algorithm is to distribute a set of context-target pairs among buckets, so that similar pairs end up in similar buckets. In order to achieve this, the learning algorithm takes two inputs: a set of training vectors

, and a vector of occurrence probabilities

, where is the probability of observing word in the training corpus111In our notation vector has elements, while vector has elements. Given that index runs from to , notation is a mathematical way to concatenate two vectors into a twice longer vector .. The learning can be formalized as a minimization of the energy function, see (Ryali et al., 2020) for additional details, defined by

(1)

In this equation is a matrix of synaptic connections, given as , projecting from PN layer (individual neurons in the layer are denoted by the index ) to the KC layer (individual neurons in the KC layer are denoted by the index ). There are neurons in the PN layer and neurons in the KC layer. The inner product is defined as a contraction over index of PN cells. In the numerator of the energy function the binary encoded -gram is divided by the probabilities of occurrences of individual words element-wise, so that the numerator can be written as

Probabilities are calculated based on the frequencies of words in the training corpus. The vocabulary contains most frequent words in the corpus, thus all the elements of are non-zero and the element-wise division is well defined.

Intuitively, the goal of the training algorithm is to adjust the weights of the neural network so that they are aligned with -grams that are frequently present in the training corpus. We rely on the assumption that semantically related -grams share several “core” words, while a few individual words might be substituted by synonyms/antonyms. The minimization of the energy function (1) is accomplished by the iterative update of the weights satisfying the following learning rule (Krotov and Hopfield, 2019; Ryali et al., 2020; Grinberg et al., 2019)

(2)

In this equation the activation function is equal to one for a maximally driven hidden unit (Kenyon cell), and is equal to zero otherwise

(3)

The learning rate is denoted by , and is a Kronecker delta symbol.

2.2 Bio-Hashing

After learning is complete the hash codes for the inputs can be generated in the following way. Given the binary encoded -gram ,

(4)

This is a crude mathematical approximation of the biological computation performed by the PN–KC–APL neural network (Dasgupta et al., 2017; Ryali et al., 2020). An input generates an input current into the KC neurons using feedforward weights . The recurrent network of KCs and the APL neuron silences all but a small fraction of KCs. Those cells that remain active are assigned state , while the rest of the KCs are assigned the inactive state .

Notice, that equation (4) makes it possible to generate the hash codes for both individual words (static word embeddings like word2vec and GloVe) and phrases (similar to Transformer models). In the static case, the input has all zeros in the context block and a one-hot encoded word in the target block. In the context-dependent case, both blocks have binary encoded input words. Importantly, both context-dependent and static embeddings are mapped into the same space of sparse binary hash codes (a vector of elements, with ones in it). We show below that these hash codes capture semantic meaning of the target word and the context in which it is used. For the rest of the paper we refer to the parameter in equation (4) as the hash length.

In order to provide an intuition behind the learning algorithm defined by the energy function (1) and weight update rule (2) and connect it to some of the existing methods in machine learning, consider the limit when all the words have equal probabilities in the training corpus, . In this limit the energy function (1) reduces to the familiar spherical -means clustering algorithm (Dhillon and Modha, 2001). In this limit the weights of each KC correspond to the centroids of the clusters of context-target vectors. The hashing rule (4) assigns active state to the closest centroids (and inactive state

to the remaining ones), defined with respect to cosine similarity distance. In this simple limit the learning algorithm that we use can be viewed as a biologically plausible implementation of this classical algorithm. For real datasets the probabilities of words are different, thus this correspondence does not hold. Notice that division by the probability appears only in the expression for the energy, but not in the definition of

in equation (1). Equivalently, division by appears in the second bracket of equation (2), but not in the argument of the activation function . Thus, in the general case (for different word probabilities ) our algorithm is not equivalent to spherical -means on context-target vectors rescaled by the probabilities. Rather, in the general case, the closest centroid is found for a given context-target vector (via the definition of in equation (1) - no involved), but the “updates of the position” of that centroid are computed by enhancing the contributions of rare words (small ) and suppressing the contributions of frequent words (large ). Empirically, we have found that division by the probabilities improves performance of our method compared to the case of spherical -means (when the factor is removed from the algorithm).

3 Empirical Evaluation

The KC network shown in Fig. 1 was trained on the OpenWebText Corpus (Gokaslan and Cohen, 2019), which is a

GB corpus of unstructured text containing approximately 6B tokens. The details of the training protocols and the hyperparameters are reported in section

8 in the supplement.

3.1 Static Word Embeddings Evaluation

Our aim here is to demonstrate that the sparse embeddings obtained by the fruit fly network motif are competitive with existing state-of-the-art word embeddings such as GloVe (Pennington et al., 2014) and word2vec (Mikolov et al., 2013a) and commonly used binarization tools for these continuous embeddings. We show this by evaluating the semantic similarity of static word embeddings. Several common benchmark datasets are used: WS353 (Finkelstein et al., 2002), MEN (Bruni et al., 2014), RW (Luong et al., 2013), SimLex (Hill et al., 2015), RG-65 (Rubenstein and Goodenough, 1965), Mturk (Halawi et al., 2012). These datasets contain pairs of words with human-annotated similarity scores between them. Following previous work (Tissier et al., 2019; Sokal, 1958), model similarity score for binary representations is evaluated as , where () is the number of bits in and that are both (), and is the length of . Cosine similarity is used for real-valued representations. Spearman’s correlation coefficient is calculated between this similarity and the human annotated score. The results are reported in Table 1.

Dataset Ours GloVe word2vec SOTA
MEN 56.6 69.5 75.5 81.3     (Dobó, 2019)
WS353 63.7 64.0 66.5 81.0     (Halawi et al., 2012)
SIMLEX 21.0 31.5 41.7 56.0     (Schwartz et al., 2015)
RW 39.4 46.8 61.3 61.7     (Pilehvar et al., 2018)
RG 69.0 74.2 75.4 83.3     (Hassan and Mihalcea, 2011)
Mturk 56.1 57.5 69.8 72.7     (Halawi et al., 2012)
Table 1: Evaluation on word similarity datasets via Spearman’s rank correlation coefficient. Both GloVe and word2vec use d pretrained embeddings. Hyperparameter settings for our model: , . Results for our algorithm are reported only for a fixed hash length, . See Table 7 for results as a function of hash length.

We observe that our word embeddings demonstrate competitive performance compared to GloVe, but worse performance than word2vec. At the same time, our embeddings are binary, as opposed to GloVe and word2vec, which are represented by continuous vectors. Thus, it is more appropriate to compare them with commonly used binarized versions of the continuous embeddings. Specifically, we compare the performance of fruit fly embeddings with a number of state-of-the-art binarization methods such as: LSH/SimHash (Charikar, 2002) (random contractive projections followed by binarization based on sign), RandExp (Dasgupta et al., 2017) (random expansive projections followed by -winner take all binarization), ITQ (Gong and Lazebnik, 2011) (iterative quantization), SH (spectral hashing) (Weiss et al., 2008), PCAH (Gong and Lazebnik, 2011) (PCA followed by binarization based on sign). The complete evaluation of all these methods for varying hash length is presented in Section 9; please see Tables 7, 8, 9 for binarization of pretrained GloVe, pretrained word2vec, and GloVe trained on OpenWebText. In Table 7 we also include evaluation from NLB, “Near-Lossless Binarization” (Tissier et al., 2019)

(autoencoder-based binarization) for the hash lengths where those results are available. Here we only present a short summary of those results for a specific (small) hash length

in Table 2.

Dataset Ours LSH RandExp ITQ SH PCAH
MEN 34.0 16.9/35.5/23.6 27.5/24.2/28.4 0.1/9.2/26.9 9.4/7.2/23.8 12.5/5.3/26.0
WS353 43.2 8.2/26.0/20.2 20.9/23.5/30.5 -6.6/16.0/25.9 15.4/3.3/18.1 6.4/17.3/21.2
SIMLEX 13.4 6.8/17.0/8.0 10.4/17.6/10.1 7.0/3.3/7.3 9.3/-3.6/12.1 4.4/-2.9/11.5
RW 11.0 10.8/21.8/16.2 19.9/24.7/22.0 13.7/17.4/24.5 22.6/14.6/19.7 12.4/15.0/19.7
RG 24.0 21.2/44.6/25.5 36.6/30.4/28.7 -17.5/32.8/21.4 4.5/18.0/39.8 1.9/20.8/45.0
Mturk 44.0 16.0/33.1/18.3 29.3/22.7/28.3 9.9/22.5/26.3 18.9/21.9/20.3 15.5/23.6/24.9
Table 2: Comparison to common binarization methods. This table is a simplified version (for hash length ) of the complete evaluation for a range of hash lengths reported in Tables 7, 8, 9. Each binarization technique was evaluated on three continuous embeddings: pretrained GloVe, pretrained word2vec, GloVe trained on OpenWebText (the same dataset that was used for training our fruit fly embeddings), format: pretrained GloVe/ pretrained word2vec/ GloVe on OWT. Hyperparameter settings for our model: , . Best result in bold; second best underlined.

It is clear from Table 2 that fruit fly embeddings outperform existing methods for word embedding discretization on WS353 and Mturk, and demonstrate second best result (after LSH binarization of word2vec) on MEN. In general (see Tables 7, 8, 9), we find that fruit fly embeddings are particularly powerful compared to existing methods at small hash lengths (see in the aforementioned tables). These results indicate that the fruit fly network can learn meaningful binary semantic representations directly from raw text. We also note that an added advantage of binary embeddings is that they require only a fraction (approx. 3%) of the memory footprint required for continuous word embeddings (assuming they have the same length), since a real value requires -bits per vector element, whereas a boolean value requires only -bit.

3.2 Word Clustering

A nice aspect of binary embeddings is that they result in tighter and better separated clusters than continuous embeddings. To evaluate this property for our method we started with hash codes for individual words and performed agglomerative clustering via complete link, using the cosine distance as the metric. The clustering algorithm was terminated at clusters (we experimented with possible choices of this parameter, such as , and arrived at similar conclusions). We repeated the same analysis for continuous GloVe, binarization of GloVe embeddings via autoencoder-like method (Tissier et al., 2019), and simple discretization method of GloVe when one declares the largest elements of each word vector to be and assigns to the remaining elements (for ). The results for the inter-cluster similarity vs. intra-cluster similarity are shown in Fig. 3 (panel A). It is clear from this scatter plot that the average distance between the points within a cluster is smaller (higher similarity) for all considered binary embeddings compared to GloVe embeddings. At the same time, the distance between the closest clusters is larger or equal (smaller similarity) for the fruit fly embeddings and naive discretizations with

. We also observe that the clusters lose detail (i.e., both intra- and inter-cluster similarity increases) as the binarization threshold gets higher (shown for Glove). However, our embeddings maintain a balance between intra- and inter-clustering similarity, and thus still capture fine-grained cluster information. For instance, inspecting the semantic structure of the clusters obtained this way, an example of the hierarchical clustering diagram (lower part of the tree containing

leaves) is shown in Fig. 3 (panel B). We clearly observe semantically coherent clusters resulting from the fruit fly word embeddings.

Figure 3: Panel A: average cosine similarity between the points within the cluster vs. maximum cosine similarity (minimal distance) to a point from the closest cluster. Solid lines correspond to meanstd for the individual clusters. Numbers next to GloVe in the legend correspond to the number of largest elements in the word vector that are mapped to under the naive discretization procedure. Panel B: an example of a cluster generated by the agglomerative clustering for our method, the integer number associated with each node corresponds to the number of daughter leaves in that cluster. The root node corresponds to “interchange (42)”.

3.3 Context-Dependent Word Embeddings

Here, we evaluate the effectiveness of our fruit fly inspired approach for contextual word embeddings, as opposed to static (or context-independent) embeddings from above. We use the WiC (Pilehvar and Camacho-Collados, 2018) and SCWS (Huang et al., 2012) benchmarks for the evaluation of context-sensitive word embeddings for word sense disambiguation. Both the datasets comprise pairs of sentences that contain a target word, and the task is to determine whether the two target words share a similar semantic meaning in the corresponding contexts. The WiC dataset is modeled as a binary prediction task, with 1 denoting that the target words have the same sense, and 0 indicating that they mean different things. The SCWS dataset is modeled as a rank prediction task, since for each pair of sentences and target words, it reports the average human similarity scores (from 10 Amazon Mechanical Turkers per pair).

Figure 4: For every word (highlighted in green) in context (left), 10 nearest neighbor words in the binary hashing space are shown (right). Context allows the algorithm to disambiguate the target word’s meaning.

Before presenting quantitative results, we qualitatively examine how the fruit fly network performs on context sentence pairs for target words “apple” and “bank” in Fig. 4. We show the top nearest neighbor words for the context dependent target word. These examples clearly indicate that the “correct” sense of the word has been found (“apple” the device manufacturer has different nearest neighbors from the fruit, and “bank” the financial institution from the natural feature).

For the quantitative comparison, we contrast our method against contextual embeddings from BERT (Devlin et al., 2018), GloVe (Pennington et al., 2014), word2vec (Mikolov et al., 2013a) and Word2Sense (Panigrahi et al., 2019). For BERT we use the 768-dimensional embeddings from the uncased-large model, for GloVe and word2vec we use the 300-dimensional embeddings, and for Word2Sense we use the sparse 2250-dimensional pretrained embeddings. Since BERT outputs contextual embeddings for each word in a sentence, we simply compute the cosine similarity between the embedding vectors for the target words for each pair of instances. For GloVe/word2vec, we use a context window of size centered at each of the target words and compute the average embedding for each window and compute the cosine similarity between the two window vectors. Similar approach is used for Word2Sense, but the similarity between two embeddings is based on the Jensen-Shannon divergence (Panigrahi et al., 2019). For the fruit fly network, given the effectiveness of the top- nearest neighbor words (as seen in Fig. 4), we devise a two component scoring function. The first component is the dot-product between the context-dependent hash codes for the two target words plus length context blocks, denoted . The second is the number of common contextual nearest neighbors of the two target words among the top- neighbors of each (scaled to be between 0 and 1), denoted . The final score is given as , where is a hyperparameter. For all the methods, we predict a WiC pair to be positive if the score is above a threshold value . For SCWS, the ranking is proportional to the scores above , with the rest scored as zero. The hyperparameter is tuned for all the methods independently. Finally, for a fair comparison, all methods use the same 20k vocabulary.

We report the performance of our context-dependent word embeddings for both SCWS and WiC in Table 4 and Table 4, respectively. For both benchmarks we report the results from a 5-fold cross-validation study, where each fold (in turn) is used as a development set, and the remaining four folds as the test set. We select the optimal hyperparameters (including ) for all the methods using only the first fold; no training is done since we evaluate only the pretrained embeddings. The tables report the Spearman rank correlation on SCWS, and the accuracy on WiC.

Method mean std
BERT 56.8 0.54
word2vec () 56.7 0.005
GloVe () 40.9 1.3
GloVe () 54.4 0.10
Word2Sense () 41.4 0.01
Word2Sense () 54.2 0.008
Ours () 49.1 0.36
Table 3: SCWS dataset: mean and std for Spearman rank correlation. The best window value is also shown.
Method mean std
BERT 61.2 0.22
word2vec () 51.3 0.004
Word2vec () 50.0 0.003
GloVe () 54.9 0.26
GloVe () 50.1 0.25
Word2Sense () 56.5 0.004
Word2Sense () 50.0 0.003
Ours () 57.7 0.27
Table 4: WiC dataset: mean and std for accuracy. The best window value is also shown.

On SWCS (Table 4), we see that the context-independent embeddings (using ) are better for GloVe, Word2Sense and our method, with word2vec yielding the best results. The reason is that about 86.5% of the word pairs in SCWS are different words, and can be distinguished without looking at the context. Unlike SCWS, the WiC benchmark uses the same target word (with only minor variations in some cases) in both contexts, and therefore a context-independent approach is not expected to perform well. Indeed, on WiC (Table 4), we clearly observe that context-independent vectors () are not very good, and our method, that uses the joint scoring function combining both the hash code and nearest neighbor scores, is better than context-dependent GloVe (), word2vec () and Word2Sense (also ).

Dataset Ours Glove NLB(256bits) NLB(512bits) Word2vec BERT
20Newsgroup 78.2 77.9 61.6 64.1 77.3 78.6
SST-2 77.1 78.3 76.3 78.6 80.7 90.8
WOS-11967 83.8 84.2 70.6 72.8 84.8 86.7
TREC-6 90.4 89.0 85.2 88.8 90.9 94.0
Table 5: Accuracy for document classification task. We use d pretrained models for GloVe and word2vec, and pretrained bert-large-uncased model for BERT. For NLB, d GloVe embeddings were binarized into and bits. For our model, hash length is used. For fair comparison, all models use the same vocabulary of 20k words.

3.4 Document Classification

We also compare our binary embeddings with GloVe (Pennington et al., 2014), Word2vec (Mikolov et al., 2013b), BERT (Devlin et al., 2018) and Near-Lossless Binarization (Tissier et al., 2019) on document classification tasks. The benchmarks we use are 20 Newsgroups (20NewsGroups, 1995), Stanford Sentiment Treebank (Socher et al., 2013), WOS-11967(Kowsari et al., 2017) and TREC-6 datasets (Li and Roth, 2002)

. The 20 Newsgroups dataset contains around 18,000 documents, partitioned evenly into 20 different groups; the Stanford Sentiment Treebank dataset contains movie reviews reflecting their sentiment as either positive or negative; WOS-11967 dataset contains 11967 documents with 35 categories which include 7 parents categories; and TREC-6 dataset consists of open-domain, fact-based questions divided into broad semantic categories. We use the TextCNN

(Kim, 2014)classifier that uses all the different embeddings mentioned above. For fair comparison, we use the same model parameters (e.g., kernel size, filter dimension) while testing different embeddings. The results in Table 5 show how our sparse binary encodings are competitive with other methods.

4 Computational Complexity

The computational complexity of our method can be evaluated by analyzing equations (2,3) for the weight updates. In these equations is a sparse vector, which has only non-zero elements in it. Thus, for a minibatch of size , the computational complexity of evaluating the dot product with weights is . Additionally, the argmax operation requires operations. We will assume that the largest parameters in our model are the size of the corpus , and the size of the vocabulary . Additionally we use large minibatches . Calculation of the second term in (2) requires operations in addition to operations for calculating the dot-product for each data point. Since the algorithm has to go over the entire corpus, this computation needs to be repeated

times per epoch. Thus, the overall computational complexity of our method is

per epoch. Thus, in the leading order it does not grow with the size of the vocabulary, which is a nice computational feature.

device batch-size GPU mem time V100 3 400 2000 3 122MB 17m V100 3 400 10000 3 150MB 8m V100 3 600 2000 3 232MB 24m V100 3 600 10000*3 267MB 11.5m CPU 44cores 400 2000 - 76m CPU 44cores 400 10000 - 25m Table 6: Training time (per epoch) and memory footprint of our method on GPUs and CPUs. For the GPU implementation, three V100 GPUs interconnected with GB/s (bi-directional) NVLink were used. For the CPU implementation, the computation was done on two 22-core CPUs. CPU memory is 137GB. The results are reported for window . Figure 5: Spearman’s correlation on word similarity datasets (see Section 3.1) vs. training time. Each point is one epoch.

From the practical perspective, typical wall-clock training time and memory requirements per epoch are shown in Table 6. As is shown in Fig. 5, accurate solutions are obtained after about epochs; improvements beyond that are relatively small. Thus, our algorithm is capable of producing competitive models in a couple of hours. Contrast this with approximately hours training time for GloVe (Pennington et al., 2014); 4 days of training on 16 TPUs for ; and 4 days on 64 TPUs for (Devlin et al., 2018) (the last two numbers assume training corpus of size 250B tokens vs. 6B tokens considered in this paper). The record breaking training time of minutes for BERT requires the use of NVIDIA V100 GPUs each with GB of memory and a specialized DGX server architecture (Narasimhan, 2019). In our own experiments, we trained GloVe embedding on OWT corpus using the same vocabulary of 20k words that we used for the fruit fly embeddings. The wall-clock training time was approximately 10 hours on 16 threads, see details in Section 12. These are substantially larger computational resources than those required for training the fruit fly network.

5 Discussion and Conclusions

In this work we asked the intriguing question whether the core computational algorithm of one of the best studied networks in neuroscience – the network of KCs in the fruit fly brain – can be repurposed for solving a well defined machine learning task, namely, learning word embeddings from text. We have shown that, surprisingly, this network can indeed learn the correlations between the words and their context, and produce high quality word embeddings. On the semantic similarity task the fruit fly word embeddings outperform common methods for binarizing continuous SOTA word embeddings (applied to GloVe, word2vec, and GloVe trained on OWT) at small hash lengths. On the word-in-context task the fruit fly network outperforms GloVe by almost , word2vec by more than , but looses to BERT by , see Table 4. The small gap in classification accuracy compared with BERT, however, is outweighed by the benefit of requiring significantly smaller computational resources to obtain these fruit fly embeddings, as we have explained in Section 4, see Table 6. We view this result as an example of a general statement that biologically inspired algorithms might be more compute efficient compared with their classical (non-biological) counterparts, even if they slightly lose in terms of accuracy.

6 Acknowledgements

We are thankful to L.Amini, S.Chang, D.Cox, J.Hopfield, Y.Kim, and H.Strobelt for helpful discussions. This work was supported by the Rensselaer-IBM AI Research Collaboration (http://airc.rpi.edu), part of the IBM AI Horizons Network (http://ibm.biz/AIHorizons).

7 Appendix A. Related Work.

Our work builds on several ideas previously discussed in the literature. The first idea is that fruit fly olfactory network can generate high quality hash codes for the input data in random (Dasgupta et al., 2017) and data-driven (Ryali et al., 2020) cases. There are two algorithmic differences of our approach compared to these previous studies. First, our network uses representational contraction, rather than expansion when we go from the PN layer to the KCs layer. Second, (Dasgupta et al., 2017; Ryali et al., 2020) construct hash codes for data coming from a single modality (e.g., images, or word vectors), while the goal of the present paper is to learn correlations between two different “modalities”: target word and its context. The second idea pertains to the training algorithm for learning the PNKCs synapses. We use a biologically plausible algorithm of (Krotov and Hopfield, 2019) to do this, with modifications that take into account the wide range of frequencies of different words in the training corpus (we discuss these differences in section 2.1). Also, similarly to (Dasgupta et al., 2017; Ryali et al., 2020) the algorithm of (Krotov and Hopfield, 2019) is used for learning the representations of the data, and not correlations between two types of data (context and target) as we do in this paper.

Another closely related work (Preissner and Herbelot, 2019) uses the network of KCs with random weights for generating binary hash codes for individual words. There are several differences compared to our approach. First, in our system the synaptic weights from PNs to KCs are learned and not random. We have found that learning these weights improves the performance compared to the random case. Second, unlike (Preissner and Herbelot, 2019) (and unlike fruit flies), in our system the number of KCs is smaller than the number of PNs, so there is no representational expansion as we move into the “mushroom body”. This expansion is essential for the system of (Preissner and Herbelot, 2019), which uses random weights. Finally, our algorithm uses a different encoding scheme at the level of PNs, see Fig. 2.

8 Appendix B. Training Protocols and Hyperparameter Choices.

The fruit fly network was trained on the OpenWebText Corpus (Gokaslan and Cohen, 2019), which is a GB corpus of unstructured text containing approximately 6B tokens. Individual documents were concatenated and split into sentences. A collection of -grams were extracted from each sentence by sliding a window of size along each sentence from the beginning to the end. Sentences shorter than were removed. The vocabulary was composed of most frequent tokens in the corpus.

Training was done for . At each epoch all the -grams were shuffled, organized in minibatches, and presented to the learning algorithm. The learning rate was linearly annealed starting from the maximal value at the first epoch to nearly zero at the last epoch.

The training algorithm has the following hyperparameters: size of the KC layer , window , overall number of training epochs , initial learning rate , minibatch size, and hash length . All models presented in this paper were trained for . The optimal ranges of the hyperparameters are: learning rate is ; ; ; minibatch size ; hash length is reported for each individual experiment.

9 Appendix C. Comparison with Binarized GloVe and word2vec.

Method Hash Length () Hash Length ()
4 8 16 32 64 128 4 8 16 32 64 128
MEN (69.5/68.1) WS353 (64.0/47.7)
Ours 34.0 49.9 55.9 56.7 55.3 51.3 43.2 52.1 55.3 57.4 60.3 51.7
LSH 16.9 23.7 35.6 42.6 53.6 63.4 8.2 20.7 30.0 34.7 43.9 50.3
RandExp 27.5 37.7 46.6 57.6 67.3 71.6 20.9 32.9 41.9 48.4 57.6 61.7
ITQ 0.1 7.7 10.5 16.5 30.4 50.5 -6.6 -6.1 -2.4 -4.4 6.1 24.8
SH 9.4 17.0 22.9 37.6 52.9 65.4 15.4 14.1 19.5 32.3 43.1 58.4
PCAH 12.5 21.8 27.6 39.6 53.4 68.1 6.4 6.3 20.6 33.9 49.8 62.6
NLB - - - - 46.1 63.3 - - - - 30.1 44.9
SIMLEX (31.5/29.8) RW (46.8/31.4)
Ours 13.4 16.5 22.8 22.1 21.1 17.0 11.0 22.6 25.8 36.9 38.6 35.2
LSH 6.8 11.9 17.0 21.2 26.8 30.9 10.8 16.3 21.8 27.8 36.3 45.0
RandExp 10.4 17.2 22.8 28.5 32.4 35.2 19.9 21.3 30.9 40.5 47.6 53.3
ITQ 7.0 1.6 4.3 5.5 11.8 18.2 13.7 5.3 6.6 6.9 12.5 26.5
SH 9.3 15.6 15.9 17.0 23.1 31.2 22.6 21.5 24.3 28.8 36.1 45.8
PCAH 4.4 10.3 11.0 17.3 24.1 31.6 12.4 16.7 21.5 30.3 36.9 44.4
NLB - - - - 20.5 31.4 - - - - 25.1 34.3
RG (74.2/67.6) Mturk (57.5/61.9)
Ours 24.0 40.4 51.3 62.3 63.2 55.8 44.0 49.0 52.2 60.1 57.7 55.2
LSH 21.2 35.4 44.6 55.1 63.1 70.1 16.0 23.1 33.2 35.6 42.7 55.5
RandExp 36.6 49.0 49.5 66.1 69.6 70.9 29.3 35.8 41.4 50.4 59.0 61.6
ITQ -17.5 -8.9 26.3 41.7 50.5 66.2 9.9 7.8 10.1 17.7 32.8 47.3
SH 4.5 5.8 20.3 42.9 61.3 72.6 18.9 17.6 27.5 35.45 48.1 57.9
PCAH 1.9 9.6 19.8 40.9 53.3 68.2 15.5 15.1 27.1 41.7 46.5 56.2
Table 7: Evaluation on word similarity datasets. For each dataset and hash length, the best (second best) score is in bold (underlined). The performance for GloVe embeddings is reported next to the name of each dataset in the format d/d. Spearman’s rank correlation coefficient is reported for common baselines that binarize GloVe (300d) embeddings together with our results. Hyperparameter settings for our algorithm: , .

Our aim here is to demonstrate that the fruit fly word embeddings are competitive with existing state-of-the-art binarization methods applied to GloVe and word2vec embeddings. We show this by evaluating the semantic similarity of static word embeddings, using several common benchmark datasets: WS353 (Finkelstein et al., 2002), MEN (Bruni et al., 2014), RW (Luong et al., 2013), SimLex (Hill et al., 2015), RG-65 (Rubenstein and Goodenough, 1965), and Mturk (Halawi et al., 2012). These datasets contain pairs of words with human-annotated similarity scores between them. Specifically, we compare with GloVe (Pennington et al., 2014) word embeddings222pretrained embeddings: https://nlp.stanford.edu/projects/glove

trained on Wiki2014 and Gigaword 5, GloVe embeddings trained on OpenWebText Corpus

(Gokaslan and Cohen, 2019) and word2vec embeddings333pretrained embeddings: https://code.google.com/archive/p/word2vec.

Since our representations are binary (in contrast to GloVe and word2vec), we binarize GloVe and word2vec embeddings and report their performance using a number of common hashing methods, LSH/SimHash (Charikar, 2002) (random contractive projections followed by binarization based on sign), RandExp (Dasgupta et al., 2017) (random expansive projections followed by -winner take all binarization), ITQ (Gong and Lazebnik, 2011) (iterative quantization), SH (spectral hashing) (Weiss et al., 2008), PCAH (Gong and Lazebnik, 2011) (PCA followed by binarization based on sign). Where available, we include evaluation from NLB, “Near-Lossless Binarization” (Tissier et al., 2019) (autoencoder-based binarization).

Following previous work (Tissier et al., 2019; Sokal, 1958), model similarity score for binary representations is evaluated as , where () is the number of bits in and that are both (), and is the length of . Cosine similarity is used for real-valued representations. The results are reported in Tables 7, 8 and 9. For each dataset, we report performance across a range of hash lengths . For methods that incorporate randomness (LSH, RandExp, ITQ), we report the average across 5 runs. ITQ, SH and PCAH in Tables 7 and 8 were trained using the top 400k most frequent words. Table 9 compares our method to GloVe trained on OpenWebText (same dataset that our method is trained on) using the same vocabulary as our method uses.

Our binary word embeddings demonstrate competitive performance compared to published methods for GloVe and word2vec binarization, and our algorithm can learn meaningful binary semantic representations directly from raw text. Importantly, our algorithm does not require training GloVe or word2vec embeddings first before binarizing them.

Method Hash Length () Hash Length ()
4 8 16 32 64 128 4 8 16 32 64 128
MEN (75.5) WS353 (66.5)
Ours 34.0 49.9 55.9 56.7 55.3 51.3 43.2 52.1 55.3 57.4 60.3 51.7
LSH 35.5 42.5 53.6 63.4 68.4 72.2 26.0 34.7 43.9 50.3 56.0 58.6
RandExp 24.2 34.6 45.8 57.5 66.1 71.7 23.5 34.3 37.3 48.0 57.6 63.7
ITQ 9.2 13.3 25.1 41.5 57.6 68.5 16.0 18.1 22.5 30.2 43.9 54.8
SH 7.2 15.8 31.3 46.9 62.3 69.4 3.3 9.6 22.7 34.1 50.0 54.7
PCAH 5.3 18.6 37.7 52.0 63.9 71.6 17.3 24.9 38.5 42.0 52.1 59.3
SIMLEX (41.7) RW (61.3)
Ours 13.4 16.5 22.8 22.1 21.1 17.0 11.0 22.6 25.8 36.9 38.6 35.2
LSH 17.0 21.2 26.8 30.9 34.4 35.1 21.8 27.8 36.3 45.0 49.6 52.1
RandExp 17.6 24.4 29.2 32.6 38.0 39.8 24.7 27.7 39.8 46.8 52.3 55.6
ITQ 3.25 5.7 6.2 14.9 23.1 31.5 17.4 15.7 19.1 33.5 45.6 53.4
SH -3.6 3.6 10.4 17.0 23.7 32.4 14.6 22.8 28.7 37.9 43.5 52.4
PCAH -2.9 2.5 11.8 17.0 24.0 36.0 15.0 21.5 28.8 35.4 46.4 50.6
RG (75.4) Mturk (69.8)
Ours 24.0 40.4 51.3 62.3 63.2 55.8 44.0 49.0 52.2 60.1 57.7 55.2
LSH 44.6 55.1 63.1 70.1 76.4 75.8 33.1 35.6 42.7 55.5 58.6 62.4
RandExp 30.4 42.0 48.6 59.1 70.2 74.6 22.7 34.8 42.0 45.9 57.9 61.2
ITQ 32.8 49.7 31.5 55.9 62.2 71.6 22.5 21.3 42.3 46.9 59.3 60.7
SH 18.0 30.6 36.0 48.8 56.9 75.8 21.9 27.4 41.8 51.2 58.8 58.0
PCAH 20.8 22.9 40.6 36.5 59.0 71.2 23.6 34.4 45.5 55.7 64.2 60.5
Table 8: Evaluation on word similarity datasets, analogous to Table 7, for d word2vec embeddings.
Method Hash Length () Hash Length ()
4 8 16 32 64 128 4 8 16 32 64 128
MEN (76.4) WS353 (72.2)
Ours 34.0 49.9 55.9 56.7 55.3 51.3 43.2 52.1 55.3 57.4 60.3 51.7
LSH 23.6 29.1 37.4 49.6 60.6 67.0 20.2 29.0 35.5 47.5 53.3 61.4
RandExp 28.4 40.3 52.3 62.5 67.7 71.0 30.5 40.0 48.1 57.9 63.3 67.5
ITQ 26.9 33.9 46.3 56.1 64.1 70.3 25.9 33.7 44.5 56.1 63.9 67.6
SH 23.8 28.7 44.1 54.7 62.1 69.7 18.1 25.7 40.1 51.8 60.9 62.9
PCAH 26.0 30.1 46.3 57.9 67.5 72.4 21.2 30.5 43.8 50.7 61.1 59.9
SIMLEX (34.0) RW (54.5)
Ours 13.4 16.5 22.8 22.1 21.1 17.0 11.0 22.6 25.8 36.9 38.6 35.2
LSH 8.0 16.8 19.0 24.8 26.7 32.9 16.2 21.0 26.1 33.6 40.8 47.0
RandExp 10.1 17.3 23.4 26.6 29.7 31.3 22.0 28.8 34.1 43.9 46.3 51.5
ITQ 7.3 13.8 14.4 20.9 25.3 30.3 24.5 26.8 34.8 43.2 49.1 51.5
SH 12.1 14.2 17.5 20.0 26.4 36.0 19.7 24.8 32.9 38.7 45.4 46.7
PCAH 11.5 13.8 16.4 22.6 31.1 38.6 19.7 24.8 32.9 38.7 45.4 46.7
RG (78.7) Mturk (71.1)
Ours 24.0 40.4 51.3 62.3 63.2 55.8 44.0 49.0 52.2 60.1 57.7 55.2
LSH 25.5 24.9 34.6 62.1 61.8 73.5 18.3 31.3 31.4 42.9 56.5 60.7
RandExp 28.7 45.6 47.3 63.7 67.8 70.8 28.3 41.3 50.1 56.5 65.4 67.1
ITQ 21.4 32.7 50.4 57.7 67.6 70.3 26.3 41.4 53.2 61.2 67.1 68.9
SH 39.8 45.6 50.0 50.2 62.3 68.6 20.3 35.9 51.9 61.9 59.1 61.3
PCAH 45.0 50.0 49.2 46.8 66.6 69.8 24.9 40.7 55.7 64.3 64.4 60.5
Table 9: Evaluation on word similarity datasets, analogous to Table 7. The d GloVe embeddings trained from scratch on the same OpenWebText dataset as our algorithm.

10 Appendix D. Details of Technical Implementation.

From the practical perspective, efficient implementation of the learning algorithm for the fruit fly network requires the use of sparse algebra, atomic updates, and block-sparse data access. Our algorithm is implemented in CUDA as a back-end, while python is used as an interface with the main functions.

The typical memory footprint of our approach is very small. About MB GPU memory is allocated for the operators and temporary fields; while approximately GB CPU memory is needed to store the input data, array of random numbers for shuffle operations and shuffled indices. For GPU implementation, the model data is stored in the GPU’s memory, while the input data is stored in the CPU memory. The parallelization strategy in our implementation is based on two aspects. First, each minibatch of data is divided into smaller sub-minibatches which are processed on different GPUs. Second, all the operations (dense-sparse matrix multiplications, operation, and weight updates) are executed in parallel using multiple threads.

11 Appendix E. Qualitative Evaluation of Contextual Embeddings.

In order to evaluate the quality of contextualized embeddings we have created an online tool, which we are planning to release with the paper, that allows users to explore the representations learned by our model for various inputs (context-target pairs). For a given query the tool returns the word cloud visualizations for each of the four top activated Kenyon cells. We show some examples of the outputs produced by this tool in Fig. 6. Each query is used to generate a bag of words input vector . This vector is then used to compute the activations of KCs using

. Top four KCs with the highest activations are selected. The corresponding four weight vectors are used to generate four probability distributions of individual words learned by those KCs by passing the weights through a softmax function. For example, for one of those vectors with index

, the probability distribution is computed as . These probability distributions for the top four activated KCs are visualized as word clouds. In computing the softmax only the target block of the weight vector was used (we have checked that using only the context block gives qualitatively similar word clouds).

Figure 6: Examples of three queries and corresponding word cloud visualization for top four activated KCs (by each query).

The results indicate that the fruit fly network indeed has learned meaningful representations. Consider for example the first query. The sentence: “Entertainment industry shares rise following the premiere of the mass destruction weapon documentary” results in the four top activated KCs shown in Fig. 6. The top activated KC has the largest weights for the words “weapon”, “mass”, etc. The second activated KC is sensitive to the words “market”, “stock”, etc. This illustrates how the fruit fly network processes the queries. In this example the query refers to several distinct combinations of concepts: “weapon of mass destruction”, “stock market”, “movie industry”. Each of those concepts has a dedicated KC responsible for it. As one can see the responses are not perfect. For example in this case one would expect to have the 4-th highest activated KC, which is responsible for the “movie industry” concept to have a higher activation than the 3-rd highest KC, which is responsible for the types of “weapons of mass destruction”. But overall all the concepts picked by the KCs are meaningful and related to the query.

12 Appendix F. Details of GloVe Retraining

To directly compare our method to GloVe, we trained an uninitialized GloVe model on the same OpenWebText corpus using the code provided by the original GloVe authors (Pennington et al., 2014)444https://nlp.stanford.edu/projects/glove/. This model was optimized to have the same vocab size as our model (the 20k most frequent tokens), used an embedding size of 300, and a window size of 15. The model was trained for 180 iterations at about 3 minutes, 20 seconds per iteration on 16 threads, resulting in the total training time of approximately 10 hours.

References

  • 20NewsGroups (1995) 20 newsgroups dataset. External Links: Link Cited by: §3.4.
  • A. S. Bates, P. Schlegel, R. J. Roberts, N. Drummond, I. F. Tamimi, R. G. Turnbull, X. Zhao, E. C. Marin, P. D. Popovici, S. Dhawan, et al. (2020) Complete connectomic reconstruction of olfactory projection neurons in the fly brain. BioRxiv. Cited by: §1.
  • D. M. Blei, A. Y. Ng, and M. I. Jordan (2003) Latent dirichlet allocation. Journal of machine Learning research 3 (Jan), pp. 993–1022. Cited by: §1.
  • E. Bruni, N. K. Tran, and M. Baroni (2014) Multimodal distributional semantics.

    Journal of Artificial Intelligence Research

    49 (1), pp. 1–47.
    External Links: ISSN 1076-9757 Cited by: §3.1, §9.
  • S. J. Caron, V. Ruta, L. Abbott, and R. Axel (2013) Random convergence of olfactory inputs in the drosophila mushroom body. Nature 497 (7447), pp. 113–117. Cited by: §1.
  • S. J. C. Caron, J. Li, B. D. Mahoney, and M. S. Jacob (2020) Two parallel pathways convey distinct visual information to the drosophila mushroom body. bioRxiv. Cited by: §1.
  • M. S. Charikar (2002)

    Similarity Estimation Techniques from Rounding Algorithms

    .
    In

    Annual ACM Symposium on Theory of Computing

    ,
    pp. 380–388. External Links: ISBN 978-1-58113-495-7, Document Cited by: §3.1, §9.
  • S. Dasgupta, C. F. Stevens, and S. Navlakha (2017) A neural algorithm for a fundamental computing problem. Science 358 (6364), pp. 793–796. Cited by: §1, §1, §2.2, §3.1, §7, §9.
  • S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman (1990) Indexing by latent semantic analysis. Journal of the American society for information science 41 (6), pp. 391–407. Cited by: §1.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §1, §3.3, §3.4, §4.
  • I. S. Dhillon and D. S. Modha (2001) Concept decompositions for large sparse text data using clustering. Machine learning 42 (1-2), pp. 143–175. Cited by: §2.2.
  • A. Dobó (2019) A comprehensive analysis of the parameters in the creation and comparison of feature vectors in distributional semantic models for multiple languages. Ph.D. Thesis, szte. Cited by: Table 1.
  • L. Finkelstein, E. Gabrilovich, Y. Matias, E. Rivlin, Z. Solan, G. Wolfman, and E. Ruppin (2002) Placing search in context: the concept revisited. ACM Trans. Inf. Syst. 20 (1). External Links: ISSN 1046-8188 Cited by: §3.1, §9.
  • J. R. Firth (1957) A synopsis of linguistic theory, 1930-1955. Studies in linguistic analysis. Cited by: §1.
  • A. Gokaslan and V. Cohen (2019) OpenWebText Corpus. Note: http://Skylion007.github.io/OpenWebTextCorpus Cited by: §3, §8, §9.
  • Y. Gong and S. Lazebnik (2011) Iterative quantization: A procrustean approach to learning binary codes. In CVPR 2011, pp. 817–824 (en). External Links: ISBN 978-1-4577-0394-2, Document Cited by: §3.1, §9.
  • L. Grinberg, J. Hopfield, and D. Krotov (2019)

    Local unsupervised learning for image analysis

    .
    arXiv preprint arXiv:1908.08993. Cited by: §2.1.
  • G. Halawi, G. Dror, E. Gabrilovich, and Y. Koren (2012) Large-scale learning of word relatedness with constraints. KDD, pp. 1406–1414. Cited by: §3.1, Table 1, §9.
  • Z. S. Harris (1954) Distributional structure. Word 10 (2-3), pp. 146–162. Cited by: §1.
  • S. H. Hassan and R. Mihalcea (2011) Semantic relatedness using salient semantic analysis. In Twenty-Fifth AAAI Conference on Artificial Intelligence, Cited by: Table 1.
  • F. Hill, R. Reichart, and A. Korhonen (2015) SimLex-999: evaluating semantic models with (genuine) similarity estimation. Computational Linguistics 41 (4), pp. 665–695. Cited by: §3.1, §9.
  • E. H. Huang, R. Socher, C. D. Manning, and A. Y. Ng (2012) Improving Word Representations via Global Context and Multiple Word Prototypes. In Annual Meeting of the Association for Computational Linguistics (ACL), Cited by: §3.3.
  • Y. Kim (2014) Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882. Cited by: §3.4.
  • K. Kowsari, D. E. Brown, M. Heidarysafa, K. Jafari Meimandi, M. S. Gerber, and L. E. Barnes (2017) HDLTex: hierarchical deep learning for text classification. In Machine Learning and Applications (ICMLA), 2017 16th IEEE International Conference on, Cited by: §3.4.
  • D. Krotov and J. J. Hopfield (2019) Unsupervised learning by competing hidden units. Proceedings of the National Academy of Sciences 116 (16), pp. 7723–7731. Cited by: §2.1, §7.
  • F. Li, J. Lindsey, E. C. Marin, N. Otto, M. Dreher, G. Dempsey, I. Stark, A. S. Bates, M. W. Pleijzier, P. Schlegel, et al. (2020) The connectome of the adult drosophila mushroom body: implications for function. bioRxiv. Cited by: §1.
  • X. Li and D. Roth (2002) Learning question classifiers. In COLING 2002: The 19th International Conference on Computational Linguistics, Cited by: §3.4.
  • T. Luong, R. Socher, and C. Manning (2013) Better word representations with recursive neural networks for morphology. In Conference on Computational Natural Language Learning, pp. 104–113. Cited by: §3.1, §9.
  • E. C. Marin, R. J. Roberts, L. Büld, M. Theiss, M. W. Pleijzier, T. Sarkissian, W. J. Laursen, R. G. Turnbull, P. Schlegel, A. S. Bates, et al. (2020) Connectomics analysis reveals first, second, and third order thermosensory and hygrosensory neurons in the adult drosophila brain. BioRxiv. Cited by: §1.
  • T. Mikolov, K. Chen, G. Corrado, and J. Dean (2013a) Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781. Cited by: §1, §3.1, §3.3.
  • T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean (2013b) Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems, C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger (Eds.), Vol. 26, pp. 3111–3119. External Links: Link Cited by: §3.4.
  • S. Narasimhan (2019) NVIDIA clocks world’s fastest bert training time and largest transformer based model, paving path for advanced conversational ai. Note: https://devblogs.nvidia.com/training-bert-with-gpus/ Cited by: §4.
  • A. Panigrahi, H. V. Simhadri, and C. Bhattacharyya (2019) Word2Sense: sparse interpretable word embeddings. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 5692–5705. Cited by: §3.3.
  • J. Pennington, R. Socher, and C. D. Manning (2014) Glove: global vectors for word representation. In Empirical methods in natural language processing (EMNLP), pp. 1532–1543. Cited by: §1, §12, §3.1, §3.3, §3.4, §4, §9.
  • M. T. Pilehvar and J. Camacho-Collados (2018) Wic: the word-in-context dataset for evaluating context-sensitive meaning representations. arXiv preprint arXiv:1808.09121. Cited by: §3.3.
  • M. T. Pilehvar, D. Kartsaklis, V. Prokhorov, and N. Collier (2018) Card-660: cambridge rare word dataset-a reliable benchmark for infrequent word representation models. arXiv preprint arXiv:1808.09308. Cited by: Table 1.
  • S. Preissner and A. Herbelot (2019) To be fair: a case for cognitively-inspired models of meaning.. In CLiC-it, Cited by: §1, §7.
  • H. Rubenstein and J. B. Goodenough (1965) Contextual correlates of synonymy. Communications of the ACM 8 (10), pp. 627–633. Cited by: §3.1, §9.
  • C. K. Ryali, J. J. Hopfield, L. Grinberg, and D. Krotov (2020) Bio-Inspired Hashing for Unsupervised Similarity Search. arXiv preprint arXiv:2001.04907. Cited by: §1, §1, §2.1, §2.1, §2.2, §7.
  • R. Schwartz, R. Reichart, and A. Rappoport (2015) Symmetric pattern based word embeddings for improved word similarity prediction. In Proceedings of the nineteenth conference on computational natural language learning, pp. 258–267. Cited by: Table 1.
  • R. Socher, A. Perelygin, J. Wu, J. Chuang, C. D. Manning, A. Y. Ng, and C. Potts (2013) Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pp. 1631–1642. Cited by: §3.4.
  • R. Sokal (1958) A statistical method for evaluating systematic relationships.. Univ. Kansas, Sci. Bull. 38, pp. 1409–1438. Cited by: §3.1, §9.
  • J. Tissier, C. Gravier, and A. Habrard (2019) Near-lossless binarization of word embeddings. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 7104–7111. Cited by: §3.1, §3.1, §3.2, §3.4, §9, §9.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §1.
  • K. Vogt, Y. Aso, T. Hige, S. Knapek, T. Ichinose, A. B. Friedrich, G. C. Turner, G. M. Rubin, and H. Tanimoto (2016) Direct neural pathways convey distinct visual information to drosophila mushroom bodies. Elife 5, pp. e14009. Cited by: §1.
  • Y. Weiss, A. Torralba, and R. Fergus (2008) Spectral hashing. Advances in neural information processing systems 21, pp. 1753–1760. Cited by: §3.1, §9.
  • Z. Zheng, F. Li, C. Fisher, I. J. Ali, N. Sharifi, S. Calle-Schuler, J. Hsu, N. Masoodpanah, L. Kmecova, T. Kazimiers, et al. (2020) Structured sampling of olfactory input by the fly mushroom body. bioRxiv. Cited by: §1.