HyperEmbed: Tradeoffs Between Resources and Performance in NLP Tasks with Hyperdimensional Computing enabled Embedding of n-gram Statistics

by   Pedro Alonso, et al.

Recent advances in Deep Learning have led to a significant performance increase on several NLP tasks, however, the models become more and more computationally demanding. Therefore, this paper tackles the domain of computationally efficient algorithms for NLP tasks. In particular, it investigates distributed representations of n-gram statistics of texts. The representations are formed using hyperdimensional computing enabled embedding. These representations then serve as features, which are used as input to standard classifiers. We investigate the applicability of the embedding on one large and three small standard datasets for classification tasks using nine classifiers. The embedding achieved on par F1 scores while decreasing the time and memory requirements by several times compared to the conventional n-gram statistics, e.g., for one of the classifiers on a small dataset, the memory reduction was 6.18 times; while train and test speed-ups were 4.62 and 3.84 times, respectively. For many classifiers on the large dataset, the memory reduction was about 100 times and train and test speed-ups were over 100 times. More importantly, the usage of distributed representations formed via hyperdimensional computing allows dissecting the strict dependency between the dimensionality of the representation and the parameters of n-gram statistics, thus, opening a room for tradeoffs.


page 1

page 2

page 3

page 4


Transferable Neural Projection Representations

Neural word representations are at the core of many state-of-the-art nat...

Towards Evaluating the Robustness of Chinese BERT Classifiers

Recent advances in large-scale language representation models such as BE...

Delta Embedding Learning

Learning from corpus and learning from supervised NLP tasks both give us...

Sentiment Classification using N-gram IDF and Automated Machine Learning

We propose a sentiment classification method with a general machine lear...

Rethinking Batch Normalization in Transformers

The standard normalization method for neural network (NN) models used in...

MEKER: Memory Efficient Knowledge Embedding Representation for Link Prediction and Question Answering

Knowledge Graphs (KGs) are symbolically structured storages of facts. Th...

Empirical Evaluation of Pre-trained Transformers for Human-Level NLP: The Role of Sample Size and Dimensionality

In human-level NLP tasks, such as predicting mental health, personality,...

1 Introduction

Recent work (Strubell et al., 2019)

has brought significant attention by demonstrating potential cost and environmental impact of developing and training state-of-the-art models for Natural Language Processing (NLP) tasks. The work suggested several countermeasures for changing the situation. One of them

Strubell et al. (2019) recommends a concerted effort by industry and academia to promote research of more computationally efficient algorithms. The main focus of this paper falls precisely in this domain.

In particular, we consider NLP systems using a well-known technique called -gram statistics. The key idea is that hyperdimensional computing (Kanerva, 2009) allows forming distributed representations of the conventional -gram statistics (Joshi et al., 2016). The use of these distributed representations, in turn, allows trading-off the performance of an NLP system (e.g.,

score) and its computational resources (i.e., time and memory). The main contribution of this paper is the systematic study of these tradeoffs on nine machine learning algorithms using several benchmark classification datasets. This is the first study where the computational tradeoffs of the distributed representations of

-gram statistics is studied in an extensive manner on numerous datasets. We demonstrate the usefulness of hyperdimensional computing-based embedding, which is highly time and memory efficient. Our experiments on a well-known dataset (Braun et al., 2017) for intent classification show that it is possible to reduce memory usage by x and speed-up training by x without compromising the score. Several important use-cases are motivating the efforts towards trading-off the performance of a system against computational resources required to achieve that performance: high-throughput systems with an extremely large number of requests/transactions (the power of one per cent); resource-constrained systems where computational resources and energy are scarce (edge computing); green computing systems taking into account the aspects of environmental sustainability when considering the efficiency of algorithms (AI HLEG, 2019).

The paper is structured as follows. Section 2 covers the related work. Section 3 outlines the evaluation and describes the datasets. The methods being used are presented in Section 4. Section 5 evaluates of the experimental results. Discussion and concluding remarks are presented in Section 6.

2 Related Work

Commonly, data for NLP tasks are represented in the form of vectors, which are then used as an input to machine learning algorithms. These representations range from dense learnable vectors to extremely sparse non-learnable vectors. Well-known examples of such representations include one-hot encodings, count-based vectors, and Term Frequency Inverse Document Frequency (TF-IDF) among others. Despite being very useful, non-learnable representations have their disadvantages such as resource inefficiency due to their sparsity and absence of contextual information (except for TF-IDF). Learnable vector representations such as word embeddings (e.g., Word2Vec 

(Mikolov et al., 2013) or GloVe (Pennington et al., 2014)

) partially address these issues by obtaining dense vectors in an unsupervised learning fashion. These representations are based on the distributional hypothesis: words located nearby in a vector space should have similar contextual meaning. The idea has been further improved in 

Joulin et al. (2016) by representing words with character -grams. Another efficient way of representing a word is the concept of Byte Pair Encoding, which has been introduced in Gage (1994). The disadvantage of the learnable representations, however, is that they require pretraining involving large train corpus as well as have a large memory footprint (in order of GB). As an alternative to word/character embedding, Shridhar et al. (2019) introduced the idea of Subword Semantic Hashing that uses a hashing method to represent subword tokens, thus, reducing the memory footprint (in order of MB) and removing the necessity of pretraining over a large corpus. The approach has demonstrated the state-of-the-art results on three datasets for intent classification.

The Subword Semantic Hashing, however, relies on -gram statistics for extracting the representation vector used as an input to classification algorithms. It is worth noting that the conventional -gram statistics uses a positional representation where each position in the vector can be attributed to a particular -gram. The disadvantage of the conventional -gram statistics is that the size of the vector grows exponentially with . Nevertheless, it is possible to untie the size of representation from by using distributed representations (Hinton et al., 1986), where the information is distributed across the vector’s positions. In particular, Joshi et al. (2016) suggest how to embed conventional -gram statistics into a high-dimensional vector (HD vector) using the principles of hyperdimensional computing. Hyperdimensional computing also known as Vector Symbolic Architectures (Plate, 2003; Kanerva, 2009; Eliasmith, 2013) is a family of bio-inspired methods of manipulating and representing information. The method of embedding -gram statistics into the distributed representation in the form of an HD vector has demonstrated promising results on the task of language identification while being hardware-friendly (Rahimi et al., 2016). In Najafabadi et al. (2016) it was further applied to the classification of news articles into one of eight predefined categories. The method has also shown promising results (Kleyko et al., 2019)

when using HD vectors for training Self-Organizing Maps 

(Kohonen, 2001). However there are no previous studies comprehensively exploring tradeoffs achievable with the method on benchmark NLP datasets when using the supervised classifiers.

3 Evaluation outline

3.1 Classifiers and performance metrics

To obtain the results applicable to a broad range of existing machine learning algorithms, we have performed experiments with several conventional classifiers. In particular, the following classifiers were studied: Ridge Classifier, k-Nearest Neighbors (kNN), Multilayer Perceptron (MLP), Passive Aggressive, Random Forest, Linear Support Vector Classifier (SVC), Stochastic Gradient Descent (SGD), Nearest Centroid, and Bernoulli Naive Bayes (NB). All the classifiers are available in the scikit-learn library 

(Pedregosa et al., 2011), which was used in the experiments.

Since the main focus of this paper is the tradeoff between classification performance and computational resources, we have to define metrics for both aspects. The quality of the classification performance of a model will be measured by a simple and well-known metric – score (please see (Fawcett, 2006)). The computational resources will be characterized by three metrics: the time it takes to train a model, the time it takes to test the trained model, and the memory, where the memory is defined as the sum of the size of input feature vectors for train and test splits as well as the size of the trained model. To avoid the dependencies such as particular specifications of a computer and dataset size, the train/test times and memory are reported as relative values (i.e., train/test speed-up and memory reduction), where the reference is the value obtained for the case of the conventional -gram statistics.111 It is worth noting that the speed-ups reported in Section 5 do not include the time it takes to obtain the corresponding HD vectors. Please see the discussion of this issue in Section 6.

3.2 Datasets

Four different datasets were used to obtain the empirical results reported in this paper: the Chatbot Corpus (Chatbot), the Ask Ubuntu Corpus (AskUbuntu), the Web Applications Corpus (WebApplication), and the 20 News Groups Corpus (20NewsGroups). The first three are referred to as small datasets. The Chatbot dataset comprises questions posed to a Telegram chatbot. The chatbot, in turn, replied the questions of the public transport of Munich. The AskUbuntu and WebApplication datasets are questions and answers from the StackExchange. The 20NewsGroups dataset comprises news posts labelled into several categories. All datasets have predetermined train and test splits. The first three datasets (Braun et al., 2017) are available on GitHub.222Under the Creative Commons CC BY-SA 3.0 license: https://github.com/sebischair/NLU-Evaluation-Corpora

Intent Train original Train Augmented Test
Departure Time 43 57 35
Find Connection 57 57 71
Table 1: Data sample distribution for the Chatbot dataset

The Chatbot dataset consists of two intents: the (Departure Time and Find Connection) with 206 questions. The corpus has a total of five different entity types (StationStart, StationDest, Criterion, Vehicle, Line), which were not used in our benchmarks, as the results were only for intent classification. The samples come in English. Despite this, the train station names are in German, which is evident from the text where the German letters appear (ä,ö,ü,ß). Table 1 presents the data sample distribution for the Chatbot dataset.

Intent Train original Train Augmented Test
Make Update 10 17 37

Setup Printer
10 17 13

Shutdown Computer
13 17 14

Software Recommendation
17 17 40

3 17 5

Table 2: Data sample distribution for the AskUbuntu dataset

The AskUbuntu dataset comprises five intents: Make Update; Setup Printer; Shutdown Computer; Software Recommendation; None. It includes samples in total. Please refer to Table 2 for its data sample distribution.

The samples were gathered directly from the AskUbuntu platform. Only questions with the highest scores and upvotes were considered. For the task of mapping the correct intent to the question, the Amazon Mechanical Turk was employed. Beyond the questions labelled with their intent, this dataset contains also some extra information such as author, page URL with the question, entities, answer, and the answer’s author. It is worth noting that none of these data were used in the experiments.

Intent Train original Train Augmented Test
Change Password 2 7 6

Delete Account
7 7 10

Download Video
1 7 0

Export Data
2 7 3

Filter Spam
6 7 14

Find Alternative
7 7 16

Sync Accounts
3 7 6

2 7 4

Table 3: Data sample distribution for the WebApplication dataset

The WebApplication dataset comprises text samples of eight different intents: Change Password; Delete Account; Download Video; Export Data; Filter Spam; Find Alternative; Sync Accounts; None. Table 3 presents an overview of data distribution in this corpus.

Categories Train Test
alt.atheism 11314 7532

11314 7532

11314 7532

11314 7532

11314 7532

11314 7532

11314 7532

11314 7532

11314 7532

11314 7532

11314 7532

11314 7532

11314 7532

11314 7532

11314 7532

11314 7532

11314 7532

11314 7532

11314 7532

11314 7532

Table 4: Data sample distribution for the 20NewsGroups dataset

The 20NewsGroups dataset has been originally collected by Ken Lang. It comprises categories (for details please see Table 4). Each category has exactly text samples. Moreover, the samples of each category are split neatly into the train ( samples) and test ( samples) sets. The dataset comes already prepackaged with the scikitlearn library for Python.

4 Methods

4.1 Conventional n-gram statistics

An empty vector stores -gram statistics for an input text . consists of symbols from the alphabet of size ; th position in keeps the counter of the corresponding -gram from the set of all unique -grams; corresponds to a symbol in th position of . The dimensionality of equals the total number of -grams in and calculated as . Usually, is obtained via a single pass-through using the overlapping sliding window of size . The value of a position in (i.e., counter) corresponding to a -gram observed in the current window is incremented by one. In other words, summarizes how many times each -gram in was observed in .

4.2 Word Embeddings with Subword Information

Work by Bojanowski et al. (2017) demonstrated that words’ representations can be formed via learning character -grams, which are then summed up to represent words. This method (FastText) has an advantage over the conventional word embeddings since unseen words could be better approximated as it is highly likely that some of their -gram subwords have already appeared in other words. Therefore, each word is represented as a bag of its character -gram. Special boundary symbols “<” and “>” are added at the beginning and the end of each word. The word itself is added to the set of its -grams, to learn a representation for each word along with character -grams. Taking the word have and as an example, . Formally, for a given word , denotes the set of -grams appearing in . Each -gram has an associated vector representation . Word is represented as the sum of the vector representations of its -grams. A scoring function is defined for each word that is represented as a set of respective -grams and the context word (denoted as ), as:

where is the vector representation of the context word . Practically, a word is represented by its index in the word dictionary and a set of -grams it contains.

4.3 Byte Pair Encoding

The idea of Byte Pair Encoding (BPE) was introduced in Gage (1994). BPE iteratively replaces the most frequent pair of bytes in a sequence with a single, unused byte. It can be similarly used to merge characters or character sequences for words representations. A symbol vocabulary is initialized with a character vocabulary with every word represented in the form of characters, where “” is used as the end of word symbol. All symbol pairs are counted iteratively and then replaced with a new symbol. Each operation results in a new symbol, which represents an -gram. Similarly, frequently occurring -grams are eventually merged into a single symbol. This makes the final vocabulary size equal to the sum of initial vocabulary and number of merge operations.

4.4 SubWord Semantic Hashing

Subword Semantic Hashing (SemHash) is described in details in Shridhar et al. (2019); Huang et al. (2013). SemHash represents the input sentence in the form of subword tokens using a hashing method reducing the collision rate. These subword tokens act as features to the model and can be used as an alternative to word/-gram embeddings. For a given input sample text , e.g., “I have a flying disk”, we split it into a list of words . The output of the split would look as follows: [“I”, “have”, “a”, “flying”, “disk”]. Each word is then passed into a prehashing function . first adds a at the beginning and at the end of . Then it generates subwords via extracting -grams (=3) from , e.g., . These tri-grams are the subwords denoted as , where is the index of a subword. is then applied to the entire text corpus to generate subwords via -gram statistics. These subwords are used to extract features for a given text.

4.5 Embedding n-gram statistics into an HD vector

Alphabet’s symbols are the most basic elements of a system. We assign each symbol with a random -dimensional bipolar HD vector. These vectors are stored in a matrix (denoted as , where ), which is referred to as the item memory, For a given symbol its HD vector is denoted as . To manipulate HD vectors, hyperdimensional computing defines three key operations333Please see Kanerva (2009) for proper definitions and properties of hyperdimensional computing operations. on them: bundling (denoted with and implemented via position-wise addition), binding (denoted with and implemented via position-wise multiplication), and permutation444It is convenient to use to bind symbol’s HD vector with its position in a sequence. (denoted with ). The bundling operation allows storing information in HD vectors Kleyko et al. (2016); if several copies of any HD vector are included (e.g., ), the resultant HD vector is more similar to the dominating HD vector than to other components. Since the main focus of this paper is on empirical demonstration of the usefulness of embedding n-gram statistics to HD vectors it does not go into deep analytical details of why HD vectors allow embedding the conventional -gram statistics, the diligent readers are referred to Frady et al. (2018) for the relevant analysis. It is worth mentioning, however, that intuitively the whole approach works because the embedding is done in such a way that in the projected high-dimensional space, two similar -gram statistics (in the original space) still remain very similar.

Three operations above allow embedding -gram statistics into distributed representation (HD vector) Joshi et al. (2016). First, is generated for the alphabet. A position of symbol in is represented by applying to the corresponding HD vector times, which is denoted as . Next, a single HD vector for (denoted as ) is formed via the consecutive binding of permuted HD vectors representing symbols in each position of . For example, the trigram ‘cba’ will be mapped to its HD vector as follows: . In general, the process of forming HD vector of an can be formalized as follows:

where denotes the binding operation when applied to HD vectors. Once it is known how to get , embedding the conventional -gram statistics stored in (see section 4.1) is straightforward. HD vector corresponding to is created by bundling together all -grams observed in the data:

where denotes the bundling operation when applied to several HD vectors. Note that is not bipolar due to the usage of the bundling operation. In fact, the components in will be integers in the range but these extreme values are highly unlikely since HD vectors for different -grams are quasi orthogonal, which means that in the simplest (but not practical) case when all

-grams have the same probability the expected value of a component in

is . Also, the use of means that two HD vectors mapping two different

-gram statistics might have very different amplitudes if the number of observations in these statistics are very different, therefore, it is convenient to use the cosine similarity between HD vectors as it neglects the amplitude. Since there is no simple way to set a particular metric for a given machine learning algorithm (usually the dot product is used), in the experiments below we have imposed the use of the cosine similarity implicitly by normalizing each

by its norm, thus, all had the same norm and their dot product was equivalent to their cosine similarity.

4.6 Motivation for the chosen baselines

Since the primary claim in this paper is that with HD vectors, it is possible to approximate (even accurately) the results obtained with the conventional -gram statistics, the most proper baseline for classification performance comparison is the conventional -gram statistics itself555 Though, we do not make any definite statements such as that the -gram statistics is a superior technique for solving all NLP problems. The only claim is that it is a well-known technique, which is still useful for numerous problems. . It is also worth mentioning that there are methods (see, e.g., Pibiri and Venturini (2019)) for making efficient data structures for storing -gram statistics. However, such approaches rely on the fact that there are clear regularities when words are used as the basic elements for -gram statistic. This is not the case when the character -grams are used as in this study.

In addition to the methods presented above, while designing the evaluation experiments it was considered whether word embeddings such as Word2vec (Mikolov et al., 2013) or GloVe (Pennington et al., 2014)) should be used as baselines. It was concluded that from a computational point of view, it would be unfair neglecting the computational resources spent while training these embeddings. On top of this, the require quite some memory even to keep the learned embedding for each word in the dictionary. Therefore, trainable word embeddings are not part of the baseline as the resources needed to train them are significantly higher. One exception, however, was made for the case of FastText, which are the trainable subword embeddings. Please see the discussion on this matter at the end of Section 5.2.

When it comes to other well-known methods such as bag of words and TF-IDF, it was decided that since the dimensionality of the input feature equals the number of words in the dictionary, the computational efficiency of both approaches would not be much better than that of the conventional -gram statistics. This assumption is correct, at least for the small datasets, where the number of unique -grams is in the order of several thousand. At the same time, it was relevant to observe whether HD vectors could be used to embed bag of words and TF-IDF features, therefore, the experiments on the small datasets were also performed with these methods.

5 Empirical evaluation

5.1 Setup

All datasets were preprocessed using the spacy library. It was used to remove stop words from data. We used the spacy model called “en_core_web_lg” to parse the datasets. Also, all text samples were preprocessed by removing control characters; in particular, the ones in the set [Cc], which includes Unicode characters from U+0000 to U+009F. It is also worth noting that the realization of the conventional -gram statistics used in the experiments was forming a model, which was storing only -grams present in the train split.

Since the 20NewsGroups dataset is already large, it does not seem to be necessary to apply the SemHash to it, therefore, it was omitted in the experiments (i.e., SH in Table 8 refers to pure -grams). Last, the small datasets were augmented, making all smaller classes having the same number of samples as the largest class in the train split for that dataset. Using WordNet as a dictionary, nouns and verbs were swapped with their synonyms creating new sentences until all the classes for that set have the same number of samples. The final distributions were already shown in Tables 13.

For BPE, vocabulary size of was used for WebApplication and AskUbuntu dataset whereas a vocabulary size of was used for Chatbot dataset due to its smaller size. -gram range of () was used with analyzer as char. Cross-validation was set to .

For FastText, autotune validation was used to find the optimal hyperparameters for all the dataset. No quantization of the model was performed to prevent the compromise on model accuracy.

When it comes to hyperparameters, in order to find optimal hyperparameters, a grid-based search was applied to three small datasets for the following classifiers: MLP, Random Forest, and KNN. The configuration performing best among all small datasets was chosen to be used in order to report the results reported in the paper. Moreover, the same configuration was used for the 20NewsGroups dataset.

In the case of MLP, four different configurations of hidden layers were considered: [(100, 50), (300, 100),(300, 200, 100), and (300, 100, 50)]; (300, 100, 50) configuration has been chosen. The maximal number of MLP iterations was set to 500. In the case of Random Forest, two hyperparameters were optimized number of estimators ([50, 60, 70]) and minimum samples leaf ([1, 11]); we used 50 estimators and 1 leaf. In the case of KNN, the number of neighbors between 3 and 7 was considered; 3 neighbors were used in the experiments. For all the other classifiers the default hyperparameter settings provided by Sklearn library were used.

The range of in the experiments with small datasets was while for the 20NewsGroups dataset it was since the number of possible -grams was overwhelming. All results reported for small datasets were obtained by averaging across independent simulations. In the case of the 20NewsGroups dataset, the number of simulations was decreased to due to high computational costs. To have a fair comparison of computational resources, all results for small datasets were obtained on a dedicated laptop without involving GPUs while the results for the 20NewsGroups dataset were obtained with a computing cluster (CPU only) without the intervention of other users.

Figure 1: MLP results vs. the dimensionality of HD vectors on: (a) the AskUbuntu dataset. (b) the Chatbot dataset. (c) the WebApplication dataset. (d) the 20NewsGroups dataset.

5.2 Results

First, we report the results of the MLP classifier on all datasets as it represents a widely used class of algorithms – neural networks. The goal of the experiments was to observe how the dimensionality of HD vectors embedding

-gram statistics affects the scores and the computational resources. Figures 0(a)-0(d) present the results for the AskUbuntu, Chatbot, WebApplication, and 20NewsGroups datasets, respectively. The dimensionality of HD vectors varied as , . All figures have an identical structure. Shaded areas depict

% confidence intervals. Left panels depict the

score while right panels depict the train and test speed-ups as well as memory reduction. Note that there are different scales (-axes) in the right panels. A solid horizontal line indicates for the corresponding

-axis, i.e., the moment when both models consume the same resources.

The results in all figures are consistent in a way that up to a certain point score was increasing with the increasing dimensionality. For the small datasets even small dimensionalities of HD vectors (e.g., ) led to the scores, which are far beyond random. For example, for the AskUbuntu dataset, it was  % of the conventional -gram statistics score. For the values above the performance saturation begins. Moreover, the improvements beyond are marginal. The situation is more complicated for the 20NewsGroups dataset where for -dimensional HD vectors score is fairly low though still better than a random guess (). However, it increases steeply until and achieves its maximum at being  % of the conventional -gram statistics score. The dimensionalities above showed worse results.

score Resources: SH vs. HD Resources: SH vs. BPE
Classifier SH BPE HD Tr. Ts. Mem. Tr. Ts. Mem.
MLP 0.92 0.91 0.91 4.62 3.84 6.18 1.67 1.61 1.72
Passive Aggr. 0.92 0.93 0.90 4.86 3.07 6.31 2.19 2.14 1.76
SGD Classifier 0.89 0.89 0.88 4.66 3.50 6.31 1.94 2.16 1.76
Ridge Classifier 0.90 0.91 0.90 3.91 4.74 6.31 1.63 1.62 1.76
KNN Classifier 0.79 0.72 0.82 2.11 4.53 8.48 1.56 1.79 1.76
Nearest Centroid 0.90 0.89 0.90 1.66 3.41 6.32 1.35 1.87 1.76
Linear SVC 0.90 0.92 0.90 1.18 2.39 6.29 0.91 1.91 1.76
Random Forest 0.88 0.90 0.86 0.91 1.09 6.11 1.15 0.96 1.75
Bernoulli NB 0.91 0.92 0.85 2.30 3.72 6.34 1.96 2.42 1.76
Table 5: Performance of all classifiers for the AskUbuntu dataset.
score Resources: SH vs. HD Resources: SH vs. BPE
Classifier SH BPE HD Tr. Ts. Mem. Tr. Ts. Mem.
MLP 0.96 0.94 0.96 3.42 2.62 4.58 1.86 1.52 1.86
Passive Aggr. 0.95 0.91 0.94 4.40 2.38 4.72 2.29 2.22 1.92
SGD Classifier 0.93 0.93 0.92 3.16 2.06 4.72 1.88 1.84 1.92
Ridge Classifier 0.94 0.94 0.92 2.88 2.22 4.72 1.67 1.38 1.92
KNN Classifier 0.75 0.71 0.83 1.66 3.59 6.51 1.43 1.79 1.92
Nearest Centroid 0.89 0.94 0.84 1.41 2.13 4.73 1.17 1.61 1.92
Linear SVC 0.94 0.93 0.94 0.52 1.57 4.72 1.28 1.66 1.92
Random Forest 0.95 0.95 0.91 0.95 1.10 4.61 1.16 0.98 1.91
Bernoulli NB 0.93 0.93 0.82 1.92 2.60 4.73 1.53 1.72 1.92
Table 6: Performance of all classifiers for the Chatbot dataset.

When it comes to computational resources, there is a similar pattern for all the datasets. The train/test speed-ups and memory reduction are diminishing with the increased dimensionality of HD vectors. At the point when the dimensionality of HD vectors equals the size of the conventional -gram statistics, both approaches consume approximately the same resources. These points in the figures are different because the datasets have different size of -gram statistics: , , , and , for the AskUbuntu, Chatbot, WebApplication, and 20NewsGroups datasets, respectively. Also, for all datasets, the memory reduction is higher than the speed-ups. The most impressive speed-ups and reductions were observed for the 20NewsGroups dataset (e.g., times less memory for -dimensional HD vectors). This is due to its large size it contains a huge number of -grams resulting in large size of the -gram statistics. Nevertheless, even for small datasets, the gains were noticeable. For instance, for the WebApplication dataset at score was  % of the conventional -gram statistics while the train/test speed-ups and the memory reduction were , , and , respectively.

Thus, these empirical results suggest that the quality of embedding w.r.t. the achievable score improves with increased dimensionality, however, after a certain saturation or peak point increasing dimensionality further either does not affect or worsen the classification performance and arguably becomes impractical when considering the computational resources.

score Resources: SH vs. HD Resources: SH vs. BPE
Classifier SH BPE HD Tr. Ts. Mem. Tr. Ts. Mem.
MLP 0.77 0.77 0.79 3.10 2.00 4.43 1.74 1.44 1.73
Passive Aggr. 0.82 0.80 0.80 3.73 1.45 4.33 1.86 1.32 1.75
SGD Classifier 0.75 0.74 0.73 3.01 1.87 4.33 1.62 1.32 1.75
Ridge Classifier 0.79 0.80 0.80 1.66 2.40 4.34 0.71 1.09 1.75
KNN Classifier 0.72 0.75 0.76 1.16 2.76 5.96 1.14 1.51 1.76
Nearest Centroid 0.74 0.73 0.77 1.42 1.79 4.34 1.13 1.21 1.75
Linear SVC 0.82 0.80 0.80 1.04 1.48 4.29 0.47 1.18 1.75
Random Forest 0.87 0.85 0.72 0.95 1.26 4.11 1.05 1.12 1.73
Bernoulli NB 0.74 0.75 0.64 1.51 2.08 4.38 1.19 1.49 1.75
Table 7: Performance of all classifiers for the WebApplication dataset.
score Resources: SH vs. HD
Classifier SH HD Train speed-up Test speed-up Memory reduction
MLP 0.72 0.64 53.23 79.50 93.19
Passive Aggr. 0.74 0.69 103.64 202.95 93.42
SGD Classifier 0.70 0.66 105.43 186.31 93.42
Ridge Classifier 0.16 0.71 45.46 338.01 93.42
KNN Classifier 0.31 0.31 184.47 65.87 127.54
Nearest Centroid 0.08 0.15 212.75 254.74 93.42
Linear SVC 0.75 0.69 5.11 176.62 93.42
Random Forest 0.58 0.26 4.27 21.43 93.41
Bernoulli NB 0.60 0.15 57.72 56.54 93.42
Table 8: Performance of all classifiers for the 20NewsGroups dataset.

Tables 5-8666The notations Tr., Ts., Mem. in the tables stand for the train speed-up, test speed-up, and the memory reduction for the given classifier, respectively. SH stands for SemHash. report the results for all datasets when applying all the considered classifiers. For the sake of brevity, a fixed dimensionality of HD vectors is reported only: for small datasets in Tables 5-7 and for the 20NewsGroups dataset in Table 8. These dimensionalities were chosen based on the results in Figures 0(a)-0(d) as the ones allowing to achieve a good approximation of score while providing substantial speed-up/reduction. We also performed experiments when using the BPE instead of the SemHash before extracting -gram statistics.777 Note that Table 8 does not report the results for the BPE. This is purely due to high computational costs required to obtain the BPE model and vocabulary for this dataset. Throughout the tables, the BPE demonstrated scores comparable to that of the SemHash while showing the train/test speed-ups and memory reduction at about times. This is because the usage of the BPE resulted in smaller sizes of the -gram statistics, which were , , and for the AskUbuntu, Chatbot, and WebApplication datasets, respectively.

In the case of HD vectors, the picture is less coherent. For example, there is a group of classifiers (e.g., MLP, SGD, KNN) where scores are well approximated (or even improved) while achieving noticeable computational reductions. In the case of Linear SVC, scores are well-preserved and there is times memory reduction but test/train speed-ups are marginal (even slower for training the Chatbot). This is because Linear SVC implementation benefits from sparse representations (conventional -gram statistics) while HD vectors in this study are dense. Last, for Bernoulli NB and Random Forest scores were not approximated well (cf. vs. for Bernoulli NB in the case of the Chatbot). This is likely because both classifiers are relying on local information contained in individual features, which is not the case in HD vectors where information is represented distributively across the whole vector. The slow train time of Random Forest is likely because in the absence of well-separable features it tries to construct large trees.

Due to the difference in the implementation (the official implementation of FastText only uses a linear classifier), we were not able to have a proper comparison of computational resources with the FastText.888We could have implemented the algorithm ourselves but it can be claimed unfair to compare the required memory and time, if we do not use the best practices, which are unknown to us. However, we obtained the following scores with auto hyperparameter search: , , for the AskUbuntu, Chatbot, and WebApplication datasets, respectively. These results indicate that for the considered datasets there is no drastic classification performance improvement (even worse for the WebApplication) when using the learned representations of -grams.

score Resources: TF vs. HD Resources: TF-IDF vs. HD
Classifier TF TF-IDF HD Tr. Ts. Mem. Tr. Ts. Mem.
MLP 0.91 0.90 0.90 1.97 1.41 3.45 2.31 1.76 3.40
Passive Aggr. 0.93 0.93 0.90 3.58 2.15 3.48 3.57 2.51 3.50
SGD Classifier 0.90 0.89 0.86 3.81 4.25 3.48 3.32 3.98 3.50
Ridge Classifier 0.92 0.92 0.91 2.35 4.09 3.48 2.70 4.86 3.50
KNN Classifier 0.68 0.68 0.81 2.37 2.78 4.67 2.63 2.88 4.70
Nearest Centroid 0.88 0.86 0.89 2.77 3.50 3.48 2.63 3.56 3.50
Linear SVC 0.94 0.93 0.91 2.07 2.54 3.47 1.93 2.81 3.49
Random Forest 0.89 0.88 0.84 0.87 1.08 3.38 0.93 1.11 3.40
Bernoulli NB 0.92 0.92 0.84 2.44 2.88 3.71 2.81 2.80 3.72
Table 9: Performance of all classifiers for the AskUbuntu dataset with TF-IDF.
score Resources: TF vs. HD Resources: TF-IDF vs. HD
Classifier TF TF-IDF HD Tr. Ts. Mem. Tr. Ts. Mem.
MLP 0.95 0.95 0.96 2.02 1.36 2.63 2.15 1.64 2.64
Passive Aggr. 0.92 0.91 0.93 2.61 1.38 2.63 2.93 2.75 2.58
SGD Classifier 0.92 0.92 0.91 2.84 1.79 2.63 3.84 4.56 2.58
Ridge Classifier 0.94 0.96 0.90 1.70 1.97 2.63 2.45 3.25 2.58
KNN Classifier 0.64 0.71 0.79 2.25 2.26 3.63 2.07 2.56 3.56
Nearest Centroid 0.95 0.94 0.84 2.05 2.29 2.63 2.49 5.60 2.58
Linear SVC 0.93 0.93 0.93 1.23 4.63 2.63 1.00 2.42 2.58
Random Forest 0.92 0.93 0.89 0.90 1.07 2.58 0.92 1.05 2.53
Bernoulli NB 0.89 0.89 0.84 2.27 1.99 2.65 2.07 1.84 2.62
Table 10: Performance of all classifiers for the Chatbot dataset with TF-IDF.
score Resources: TF vs. HD Resources: TF-IDF vs. HD
Classifier TF TF-IDF HD Tr. Ts. Mem. Tr. Ts. Mem.
MLP 0.76 0.76 0.79 1.94 1.49 2.50 1.82 1.61 2.50
Passive Aggr. 0.79 0.78 0.80 2.48 1.98 2.47 2.50 3.37 2.35
SGD Classifier 0.77 0.77 0.75 2.61 2.84 2.47 1.32 1.47 2.35
Ridge Classifier 0.79 0.79 0.80 2.28 2.60 2.47 1.91 2.10 2.35
KNN Classifier 0.76 0.75 0.76 1.38 1.82 3.36 1.18 1.78 3.19
Nearest Centroid 0.75 0.75 0.76 1.39 1.58 2.47 1.50 2.21 2.35
Linear SVC 0.81 0.79 0.80 2.19 1.55 2.45 2.35 1.09 2.33
Random Forest 0.85 0.85 0.72 0.89 1.03 2.37 0.91 1.08 2.25
Bernoulli NB 0.79 0.79 0.64 2.24 1.81 0 2.41 2.14 2.35
Table 11: Performance of all classifiers for the WebApplication dataset with TF-IDF.

Tables 9-11

report the results for small datasets when applying all the considered classifiers on the features extracted with bag of words (denoted as TF) and TF-IDF. In these experiments as input to the classifiers we either used the features extracted by these methods or HD vectors (

) embedding these features. With respect to the compromise in terms of resources the classifiers performed similarly to the previous experiment with the difference that a typical speed-up and memory reduction were about three times for HD vectors. When it comes to scores the results are consistent with the original motivation for the SemHash method, which argued that subword representations help in getting better performance compared to word-based representations at least for small datasets due to the limited amount of training data.

Platform Chatbot AskUbuntu WebApp Average
Botfuel 0.98 0.90 0.80 0.89
Luis 0.98 0.90 0.81 0.90
Dialogflow 0.93 0.85 0.80 0.86
Watson 0.97 0.92 0.83 0.91
Rasa 0.98 0.86 0.74 0.86
Snips 0.96 0.83 0.78 0.86
Recast 0.99 0.86 0.75 0.87
TildeCNN 0.99 0.92 0.81 0.91
FastText 0.97 0.91 0.76 0.88
SemHash 0.96 0.92 0.87 0.92
BPE 0.95 0.93 0.85 0.91
HD vectors 0.97 0.92 0.82 0.90
Table 12: score comparison of various platforms on three smaller datasets with methods mentioned in the paper. Some results are taken from Shridhar et al. (2019)

Finally, for the small datasets Table 12 places the results reported here in the context of results obtained in Shridhar et al. (2019). One thing to note in Table 12 is the differences in the scores of the SemHash approach from the ones reported in Shridhar et al. (2019) for all three small datasets. There were some data augmentation techniques, which were used in the paper, most prominently a QWERTY-based word augmentation accounting for the spelling mistakes. This technique was not used in this work, which resulted in a slight difference in the obtained scores.

6 Discussion and conclusions

The first observation is that the results on the 20NewsGroups dataset are not the state-of-the-art, which is currently score achieved with the BERT model as reported in (Mahabal et al., 2019). Nevertheless, it is important to keep in mind that the main goal of the experiments with the 20NewsGroups dataset has been to demonstrate that -gram statistics embedded into HD vectors allows getting the tradeoff even for a large text corpus. We even observed that for large datasets the usage of HD vectors is likely to provide the best gains in terms of resource-efficiency. Moreover, the gains on the small datasets were also noteworthy (several times). Thus, based on these observations we conclude that HyperEmbed would be a very useful feature in the standard ML libraries. A more general conclusion is that it is worth revisiting results in the area of random projection (Rachkovskij, 2016) as they are likely to allow achieving performance/resources tradeoff in a range of NLP scenarios (see, e.g., Nunes and Antunes (2018) for one such example).

It was stated in Section 3.1 the speed-ups reported above did not include the time for forming HD vectors. The main reason for that is that our Python-based implementation of the method was quite inefficient, especially the cyclic shifts implemented with numpy.roll. At the same time, as it could be seen from the formulation of the embedding method in Section 4.5 its complexity is linear and depends on as well as on the length of the sample text, thus, fast implementation is doable. We made the proof-of-concept implementation in Matlab, which is much faster. For example, for the AskUbuntu dataset forming -dimensional HD vectors of the train split (the same machine) took about  % of the MLP training time, which is a positive result.

Despite the demonstrated tradeoffs between the score and the computational resources, it is extremely hard to have an objective function, which would tell us when the compromise is acceptable and when it is not. In our opinion, a general solution would be to define a utility function, which would be able to assign a certain cost to both a unit of performance (e.g., increase in score) and a unit of computation (e.g., % decrease in the inference time). The use of the utility function would allow deciding whether an alternative solution, which is, e.g., faster but less accurate, is better or not than the existing one. However, the main challenge here would be to define such a utility function since it would have to be defined for each particular application. Moreover, defining such functions even for the considered classification problems is out of the scope of this study. Nevertheless, we believe that it is the way forward to get an objective comparison criterion.


  • AI HLEG (2019)

    High-Level Expert Group on Artificial Intelligence. Ethics Guidelines for Trustworthy AI

    Cited by: §1.
  • P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov (2017) Enriching Word Vectors with Subword Information. Transactions of the Association for Computational Linguistics 5, pp. 135–146. Cited by: §4.2.
  • D. Braun, A. Hernandez-Mendez, F. Matthes, and M. Langen (2017) Evaluating Natural Language Understanding Services for Conversational Question Answering Systems. In Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL), pp. 174–185. Cited by: §1, §3.2.
  • C. Eliasmith (2013) How to Build a Brain. Oxford University Press. Cited by: §2.
  • T. Fawcett (2006) An Introduction to ROC Analysis. Pattern Recognition Letters 27, pp. 861–874. Note: Cited by: §3.1.
  • E.P. Frady, D. Kleyko, and F.T. Sommer (2018)

    A Theory of Sequence Indexing and Working Memory in Recurrent Neural Networks

    Neural Computation 30 (), pp. 1449–1513. Cited by: §4.5.
  • P. Gage (1994) A New Algorithm for Data Compression. The C Users Journal 12 (2), pp. 23–38. Cited by: §2, §4.3.
  • G.E. Hinton, J.L. McClelland, and D.E. Rumelhart (1986) Distributed Representations. In Parallel Distributed Processing. Explorations in the Microstructure of Cognition. Volume 1. Foundations, D.E. Rumelhart and J.L. McClelland (Eds.), pp. 77–109. Cited by: §2.
  • P. Huang, X.He, J. Gao, L. Deng, A. Acero, and L. Heck (2013) Learning Deep Structured Semantic Models for Web Search using Clickthrough Data. In ACM international conference on Information and Knowledge Management (CIKM), pp. 2333–2338. Cited by: §4.4.
  • A. Joshi, J.T. Halseth, and P. Kanerva (2016) Language Geometry Using Random Indexing. In Quantum Interaction (QI), pp. 265–274. Cited by: §1, §2, §4.5.
  • A. Joulin, E. Grave, P. Bojanowski, and T. Mikolov (2016) Bag of Tricks for Efficient Text Classification. arXiv:1607.01759. External Links: 1607.01759 Cited by: §2.
  • P. Kanerva (2009) Hyperdimensional Computing: An Introduction to Computing in Distributed Representation with High-Dimensional Random Vectors. Cognitive Computation 1 (2), pp. 139–159. Cited by: §1, §2, footnote 3.
  • D. Kleyko, E. Osipov, A. Senior, A. I. Khan, and Y. A. Şekerciogğlu (2016)

    Holographic Graph Neuron: A Bioinspired Architecture for Pattern Processing

    IEEE Transactions on Neural Networks and Learning Systems 28 (6), pp. 1250–1262. Cited by: §4.5.
  • D. Kleyko, E. Osipov, D. D. Silva, U. Wiklund, V. Vyatkin, and D. Alahakoon (2019) Distributed Representation of n-gram Statistics for Boosting Self-Organizing Maps with Hyperdimensional Computing. In International Andrei Ershov Memorial Conference on Perspectives of System Informatics (PSI), Lecture Notes in Computer Science, Vol. 11964, pp. 64–79. Cited by: §2.
  • T. Kohonen (2001) Self-Organizing Maps. Springer Series in Information Sciences. Cited by: §2.
  • A. Mahabal, J. Baldridge, B. K. Ayan, V. Perot, and D. Roth (2019) Text Classification with Few Examples using Controlled Generalization. In Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT), pp. 3158–3167. Cited by: §6.
  • T. Mikolov, K. Chen, G. Corrado, and J. Dean (2013) Efficient Estimation of Word Representations in Vector Space. arXiv:1301.3781. Cited by: §2, §4.6.
  • F.R. Najafabadi, A. Rahimi, P. Kanerva, and J.M. Rabaey (2016) Hyperdimensional Computing for Text Classification. In Design, Automation and Test in Europe Conference (DATE), pp. 1–1. Cited by: §2.
  • D. Nunes and L. Antunes (2018) Neural Random Projections for Language Modelling. arXiv:1807.00930, pp. 1–15. Cited by: §6.
  • F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay (2011) Scikit-learn: Machine learning in Python. Journal of Machine Learning Research 12 (), pp. 2825–2830. Cited by: §3.1.
  • J. Pennington, R. Socher, and C. D. Manning (2014) GloVe: Global Vectors for Word Representation. In Conference on Empirical Methods in Natural Language Processing (EMNLP), Vol. 14, pp. 1532–1543. Cited by: §2, §4.6.
  • G. E. Pibiri and R. Venturini (2019) Handling Massive N-Gram Datasets Efficiently. ACM Transactions on Information Systems 37 (2), pp. 25:1–25:41. Note: Cited by: §4.6.
  • T. A. Plate (2003) Holographic Reduced Representations: Distributed Representation for Cognitive Structures. Stanford: Center for the Study of Language and Information (CSLI). Cited by: §2.
  • D.A. Rachkovskij (2016) Real-Valued Embeddings and Sketches for Fast Distance and Similarity Estimation. Cybernetics and Systems Analysis 52 (6), pp. 967–988. Cited by: §6.
  • A. Rahimi, P. Kanerva, and J.M. Rabaey (2016) A Robust and Energy Efficient Classifier Using Brain-Inspired Hyperdimensional Computing. In IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED), pp. 64–69. Cited by: §2.
  • K. Shridhar, A. Dash, A. Sahu, G. G. Pihlgren, P. Alonso, V. Pondenkandath, G. Kovacs, F. Simistira, and M. Liwicki (2019) Subword Semantic Hashing for Intent Classification on Small Datasets. In International Joint Conference on Neural Networks (IJCNN), pp. 1–6. Cited by: §2, §4.4, §5.2, Table 12.
  • E. Strubell, A. Ganesh, and A. McCallum (2019) Energy and Policy Considerations for Deep Learning in NLP. In 57th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 3645–3650. Cited by: §1.