Words are basic units in many natural language processing (NLP) applications, e.g., translation and text classification . Understanding words is crucial but can be very challenging. One difficulty lies in the large vocabulary commonly seen in applications. Moreover, their semantic permutations can be numerous, constituting rich expressions at the sentence and paragraph levels.
In statistical language models, word distributions are learned for unigrams, bigrams, and generally n-grams. A unigram distribution presents the probability for each word. The histogram is already sufficiently complex given a large vocabulary. Then, the complexity of bigram distributions is quadratic in the vocabulary size and that of n-gram ones is exponential. The combinatorial nature motivates researchers to develop alternative representations which otherwise explode.
Instead of word distributions, continuous representations with floating-point vectors are much more convenient to handle: they are differentiable, and their differences can be used to draw semantic analogy. A variety of algorithms were proposed over the years for learning these word vectors. Two representative ones are Word2Vec  and GloVe 
. Word2Vec is a classical algorithm based on either skip grams or a bag of words, both of which are unsupervised and can directly learn word embeddings from a given corpus. GloVe is another embedding learning algorithm, which combines the advantage of a global factorization of the word co-occurrence matrix, as well as that of the local context. Both approaches are effective in many NLP applications, including word analogy and name entity recognition.
Neural networks with word embeddings are frequently used in solving NLP problems, such as sentiment analysis  and name entity recognition . An advantage of word embeddings is that interactions between words may be modeled by using neural network layers (e.g., attention architectures).
Despite the success of these word embeddings, they often constitute a substantial portion of the overall model. For example, the pre-trained Word2Vec  contains 3M word vectors and the storage is approximately 3GB. This cost becomes a bottleneck in deployment on resource-constrained platforms.
Thus, much work studies the compression of word embeddings.  propose to represent word vectors by using multiple codebooks trained with Gumbel-softmax.  learn binary document emebddings via a bag-of-word-like process. The learned vectors are demonstrated to be effective for document retrieval.
In information retrieval, iterative quantization (ITQ) 
transforms vectors into binary ones, which are found to be successful in image retrieval. The method maximizes the bit variance meanwhile minimizing the quantization loss. It is theoretically sound and also computationally efficient. However, find that directly applying ITQ in NLP tasks may not be effective.
In , authors propose an alternate approach that improves the quality of word embeddings without incurring extra training. The main idea lies in the concept of isotropy used to explain the success of pointwise mutual information (PMI) based embeddings. The authors demonstrate that the isotropy could be improved through projecting embedding vectors toward weak directions.
Therefore, in this work we propose isotropic iterative quantization (IIQ), which leverages iterative quantization meanwhile satisfying the isotropic property. The main idea is to optimize a new objective function regarding the isotropy of word embeddings, rather than maximizing the bit variance.
Maximizing the bit variance and maximizing isotropy are two opposite ideas, because the former performs projection toward large eigenvalues (dominant directions) while the latter projects toward the smallest ones (weak directions). Given prior success, it is argued that maximizing isotropy is more beneficial in NLP applications.
2 Related Work
In information retrieval (where the proposed method is inspired), locality-sensitive hashing (LSH) is well studied and explored. The aim of LSH is to preserve the similarity between inputs after hashing. This aim is well aligned with that of embedding compression. For example, word similarity can be measured by the cosine distance of their embeddings. If LSH is applied, the hashed emebddings should maintain a similar distance as the original cosine distance but have much lower complexity in the meantime.
A well-known LSH method in image retrieval is ITQ . However, its application in NLP tasks such as document retrieval is not as successful . Rather, the authors propose to learn binary paragraph embeddings via a bag-of-words-like model, which essentially computes a binary hash function for the real-valued embedding vectors.
On the other hand,  propose a compact structure for embeddings by using the gumble softmax. In this approach, each word vector is represented as the summation of a set of real-valued embeddings. This idea amounts to learning a low-rank representation of the embedding matrix.
Pre-trained embeddings may be directly used in deep neural networks (DNN) or serve as initialization . There exist several compression techniques for DNNs, including pruning  and low-rank compression . Most of these techniques requires retraining for specific tasks, thus challenges exist when applying them to unsupervised word embeddings (e.g., GloVe).
 successfully apply DNN compression techniques to unsupervised embeddings. The authors use pruning to sparsify embedding vectors, which however requires retraining after each pruning iteration. Although retraining is common when compressing DNNs, it often takes a long time to recover the model performance. Similarly,  uses low rank approximation to compress word embeddings, but they also face the same problem to fine-tune a supervised model.
3.1 Iterative Quantization
In this section, we briefly revisit the iterative quantization method by breaking it down into two steps. The first step is to maximize bit variance when transforming given vectors into binary representation. The second step is about minimizing the quantization loss while maintaining the maximum bit variance.
Maximize Bit Variance.
Let be the embedding dictionary, where each row denotes the embedding vector for the -th word in the dictionary. Assuming that vectors are zero centered (), ITQ encodes vectors with a binary representation through maximizing the bit variance, which is achieved by solving the following optimization problem:
where and is the dimension of the encoded vectors. Here, is the final binary representation of and and
are the trace and the sign function, respectively. The problem is the same as that of Principal Component Analysis (PCA) and could be solved by selecting the topright singular vectors of as .
Minimize Quantization Loss.
Given a solution to Equation (1),
is also a solution for any orthogonal matrix. Thus, we could minimize the quantization loss via adjusting the matrix while maintaining the solution to (1). The quantization loss is defined as the difference between the vectors before and after the quantization:
where is the Frobenius norm. Note that must be binary. The proposed solution in ITQ is an iterative procedure that updates and in an alternating fashion until convergence. In practice, ITQ turns out able to achieve good performance with early stopping .
3.2 Isotropy of Word Embedding
In , isotropy is used to explain the success of PMI based word embedding algorithm, for example GloVe embedding. However,  find that existing word embeddings are not nearly isotropic but could be improved. The proposed solution is to project word embeddings toward the weak directions rather than the dominant directions, which seems counter-intuitive but in practice works well. The isotropy of word embedding is defined as:
where is the partition function
The value of is a measure of isotropy of the given embedding . A higher
means more isotropic and a better quality of the embedding. It is found making the singular values close to each other can effectively improve embedding isotropy.
4 Proposed Method
The preceding section hints that maximizing the isotropy and maximizing the bit variance are opposite in action: The former intends to make the singular values close by removing the largest singular values, whereas the latter removes the smallest singular values and maintains the largest. Given the success of isotropy in NLP applications, we propose to minimize the quantization loss while improving the isotropy, rather than maximizing the bit variance. We call the proposed method isotropic iterative quantization, IIQ.
The key idea of ITQ is based on the observation that is still a solution to the objective function of (1). In our approach IIQ, we show that the orthogonal transformation maintains the isotropy of the input embedding, so that we could apply a similar alternating procedure as in ITQ to minimize the quantization loss. As a result, our method is composed of three steps: maximizing isotropy, reducing dimension, and minimizing quantization loss.
The isotropy measure can be approximated as following  :
where and are the smallest and largest singular values of , respectively. For to be , the middle term on both the numerator and the denominator must be zero and additionally . The former requirement can be easily satisfied by the zero-centering given embeddings:
where . The latter may be approximately achieved by removing the large singular values such that the rest of the singular values are close to each other. A reason why removing the large singular values makes the rest close, is that often the large singular values have substantial gaps while the rest are clustered. However, removing singular components does not change its dimension. We denote the maximized result as .
To make our method more flexible, we perform a dimension reduction afterward by using PCA. This step essentially removes the smallest singular values so that the clustering of the singular values may be further tightened. Note that PCA won’t affect the maximized isotropy of given embeddings, since it only works on the singular values that are already closed to each other after previous step. One can treat the dimension as a hyperparameter, tailored for each data set.
Minimize Quantization Loss.
Given a solution to the maximization of (5), we prove that multiplying with an orthogonal matrix results in the same . In other words, we could minimize the quantization loss (2) while maintaining the isotropy.
If is isotropic and is orthogonal, then admits .
Given that is orthogonal, we first prove that has the same singular values as does . Let
have the singular value decomposition (SVD)
where and orthogonal matrix . Let . Then, we have
Since is also orthogonal, Equation (8) gives the SVD of . Therefore, has the same singular values as does .
Moreover, , thus is also zero-centered. By Equation (5), we conclude . ∎
With the given proof, we can always use an orthogonal matrix to reduce the quantization loss. The iterative optimization strategy as in ITQ  is adopted to minimize the quantization loss. Two alternating steps lead to a local minimum. First, compute given :
Second, update given . The update minimizes the quantization loss, which essentially solves the orthogonal Procrustes problem. The solution is given by
where SVD() is the singular value decomposition function and is the diagonal matrix of singular values.
This iterative updating strategy runs until a local optimal solution is found. Fig. 1 shows an example of the quantization loss curve. This result is similar to the behavior of ITQ, the authors of which proposed using early stopping to terminate iteration in practice. We follow the guidance and run only 50 iterations in our experiments.
Our method is an unsupervised approach, which does not require any label supervision. Therefore, it can be applied independently of downstream tasks and no fine tuning is needed. This advantage benefits many problems where embeddings often slow down the learning process because of the high space and computation complexity.
We present the pseudocode of the proposed IIQ method in Algorithm 1. The input denotes the number of top singular values to be removed, denotes the number of iterations for minimizing the quantization loss, and denotes the dimension of the output binary vectors.
The first two lines make zero-centered embedding. Lines 3 to 5 maximize the isotropy. Lines 6 to 8 reduce the embedding dimension, if necessary. Lines 9 to 15 minimize the quantization loss. Within the iteration loop, lines 11 to 12 update based on the most recent , whereas lines 13 to 14 update given the updated . The last line uses the final transformation to return the binary embeddings as output.
5 Experimental Results
We run the proposed method on pre-trained embedding vectors and evaluate the compressed embedding in various NLP tasks. For some tasks, the evaluation is directly conducted over the embedding (e.g., measuring the cosine similarity between word vectors); whereas for others, a classifier is trained with the embedding. We conduct all experiments in Python by using Numpy and Keras. The environment is Ubuntu 16.04 with Intel(R) Xeon(R) CPU E5-2698.
We perform experiments with the GloVe embedding  and the HDC embedding . The GloVe embedding is trained from 42B tokens of Common Crawl data. The HDC embedding is trained from public Wikipedia. It has a better quality than GloVe because the training process considers both syntagmatic and paradigmatic relations. All embedding vectors are used in the experiment without vocabulary truncation or post-processing.
In addition, we evaluate embedding compression on a CNN model pre-trained with the IMDB data set. Different from the prior case, the embedding from CNN is trained with supervised class labels. We compress the embedding and retrain the model to evaluate performance. This way enables us to compare with other compression methods fairly.
we name as NLB. The pruning method is set to prune 95% of the words for a similar compression ratio. The DCCL method is similarly configured. We run NLB with its default setting. We train the DCCL method for 200 epochs and set the batch size to be 1024 for GloVe and 64 for HDC. For our method, we set the iteration numberto be 50 since early stopping works sufficiently well. We set the same iteration number for ITQ. We also set the parameter to be 2 for HDC, and 14 for Glove embedding. Note that we perform all vector operations in real domain on the platform  and .
Table 1 lists the experiment configurations with method name, dimension, embedding value type, and compression ratio. The baseline means the original embedding. Our method starts with “IIQ,” followed by the compression ratio. The “dimension” column gives the number of vectors and the vector dimension. For DCCL, we list the parameters and that determine the compression ratio. Note that we use single precision for real values. The last column shows the compression ratio, which is the the size of the original embedding over that of the compressed one. Thus, the compression from real value to binary is 32. Moreover, we also apply dimension reduction in IIQ so that higher compression ratio is possible.
5.1 Word Similarity
The task measures Spearman’s rank correlation between word vector similarity and human rated similarity. A higher correlation means a better quality of the word embedding. The similarity between two words is computed as the cosine of the corresponding vectors, i.e., , where and are two word vectors. Out-of-vocabulary (OOV) words are replaced by the mean vector.
In this experiment, seven data sets are used, including MEN  with 3000 pairs of words obtained from Amazon crowdsourcing; MTurk  with 287 pairs, focusing on word semantic relatedness; RG65  with 65 pairs, an early published dataset; RW  with 2034 pairs of rare words selected based on frequencies; SimLex999 
with 999 pairs, aimed at genuine similarity estimation; TR9856 with 9856 pairs, containing many acronyms and name entities; and WS353  with 353 pairs of mostly verbs and nouns. The experiment is conducted on the platform .
Table 2 summarizes the results. The performance of IIQ degrades as the compression ratio increases. This is expected, since a higher compression ratio leads to more loss of information. In addition, our IIQ method consistently achieves better results than ITQ, DCCL, NLB and the pruning method. Particularly, one sees that on the Men data set, IIQ even outperforms the baseline embedding Glove. Another observation is that on TR9856, a higher compression ratio surprisingly yields better results for IIQ. We speculate that the cause is the multi-word term relations unique to TR9856. Interestingly, the pruning method results in negative correlation in SimLex999 for the GloVe embedding. This means that pruning too many small values inside word embedding can drastically destroy the embedding quality.
The task is to cluster words into different categories. The performance is measured by purity, which is defined as the fraction of correctly classified words. We run the experiment using agglomerative clustering and k-means clustering, and select the highest purity as the final result for each embedding. This experiment is conducted on the platform where OOV words are replaced by the mean vector.
Four data sets are used in this experiment: Almuhareb-Poesio (AP)  with 402 words in 21 categories; BLESS  with 200 nouns (animate or inanimate) in 17 categories; Battig  with 5231 words in 56 taxonomic categories; and ESSLI2008 Workshop  with 45 verbs in 9 semantic categories.
Table 3 lists evaluation results for GloVe and HDC embeddings. One sees that the proposed IIQ method works better than ITQ, DCCL, and the pruning method on all data sets. But NLB sometimes achieves the best result for example on Battig. For ESSLI, IIQ even outperforms the original GloVe and HDC embedding.
5.3 Topic Classification
In this experiment, we perform topic classification by using sentence embedding. The embedding is computed as the average of the corresponding word vectors. The average of binary embedding is fed to the classifier in single precision. Missing words are treated as zero and so are OOV words. In this task, we train a Multi-Layer Perceptron (MLP) as the classifier for each method. Due to the different size of embeddings, we train 10 epochs for all Glove embeddings and 4 epochs for all HDC embedding. Five-fold cross validation is used to report classification accuracy.
Four data sets are selected from , including movie review (MR), customer review (CR), opinion-polarity (MPQA), and subjectivity (SUBJ). Similar performance is achieved by using the original embedding. The experiment is conducted on the platform of .
Table 4 shows the results for each method. Similar to the previous tasks, the proposed IIQ method consistently performs better than ITQ, pruning, and DCCL. The only exception is that for MPQA and SUBJ, DCCL and NLB achieves the best result for the GloVe embedding respectively. As the compression ratio increases, IIQ encounters performance degrade.
5.4 Sentiment Analysis
In this experiment, we evaluate over the embedding input to a pre-trained Convolutional Neural Network (CNN) model on the IMDB data set. The CNN model follows the Keras tutorial 
. We train 50,000 embedding vectors in 300 dimensions. The model is composed of an embedding layer, followed by a dropout layer with probability 0.2, a 1D convolution layer with 250 filters and kernel size 3, a 1D max pooling layer, a fully connected layer with hidden dimension 250, a dropout layer with probability 0.2, a ReLU activation layer, and a single output fully connected layer with sigmoid activation. Moreover, we use adam optimizer with learning rate 0.0001, sentence length 400, batch size 128, and train for 20 epochs. Input embedding fed into CNN is kept fixed (not trainable).
The data set contains 25,000 movie reviews for training and another 25,000 for testing. We randomly separate 5,000 reviews from the training set as validation data. The model with the best performance on the validation set is kept as the final model for measuring test accuracy. Moreover, all results are averaged from 10 runs for each embedding. The baseline model is the pre-trained CNN model with 87.89% accuracy. Table 5 summarizes the configurations for this experiment. All configurations are similar to the previous experiments. The DCCL method is now configured with and to achieve a similar compression ratio.
We present in Fig. 2
the result of each embedding. The histogram shows the average accuracy of 10 runs experiments for each method and the error bar shows the standard deviation. One sees that among all compression methods, IIQ achieves the least performance degrade. IIQ with compression ratio 64 is the best.
We visualize the binary IIQ embedding in Fig. 3 The nearest and furthest 100 word vectors are shown. The distance is calculated by the dot product. Fig. 3(a) shows the IIQ-compressed GloVe embedding and Fig. 3(b) shows the IIQ-compressed HDC embedding. The y axis lists every 10 words and the x axis is the dimension of the embedding. One sees that similar word vectors have similar patterns in many dimensions. A white column means that the dimension is zero for all words. A black column means one. Moreover, there is obvious difference between nearest and furthest words.
This paper presents an isotropic iterative quantization (IIQ) method for compressing word embeddings. While it is based on the ITQ method in image retrieval, it also maintains the embedding isotropy. We evaluate the proposed method on GloVe and HDC embeddings and show that it is effective for word similarity, categorization, and several other downstream tasks. For pre-trained embeddings that are less isotropic (e.g., GloVe), IIQ performs better than ITQ owing to the improvement on isotropy. These findings are based on a 32-fold (and higher) compression ratio. The results point to promising deployment of trained neural network models with word embeddings on resource constrained platforms in real life.
Online embedding compression for text classification using low rank matrix factorization.
Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 6196–6203. Cited by: §2.
-  (2009) A study on similarity and relatedness using distributional and wordnet-based approaches. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 19–27. Cited by: §5.1.
-  (2005) Concept learning and categorization from the web. In proceedings of the annual meeting of the Cognitive Science society, Cited by: §5.2.
-  (2016) A latent variable model approach to pmi-based word embeddings. Transactions of the Association for Computational Linguistics 4, pp. 385–399. Cited by: §3.2.
-  (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. Cited by: §1.
-  (2011) How we blessed distributional semantic evaluation. In Proceedings of the GEMS 2011 Workshop on GEometrical Models of Natural Language Semantics, pp. 1–10. Cited by: §5.2.
-  (1969) Category norms of verbal items in 56 categories a replication and extension of the connecticut category norms.. Journal of experimental Psychology 80 (3p2), pp. 1. Cited by: §5.2.
-  (2014) Multimodal distributional semantics. Journal of Artificial Intelligence Research 49, pp. 1–47. Cited by: §5.1.
-  Keras documentation, convolution1d for text classification. Note: https://keras.io/examples/imdb_cnn/Accessed: 2019-08 Cited by: §5.4.
-  (2018) SentEval: an evaluation toolkit for universal sentence representations. arXiv preprint arXiv:1803.05449. Cited by: §5, §5.3.
-  (2014) Deep convolutional neural networks for sentiment analysis of short texts. In Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers, pp. 69–78. Cited by: §1.
-  (2013) Iterative quantization: a procrustean approach to learning binary codes for large-scale image retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence 35 (12), pp. 2916–2929. Cited by: §1, §2, §3.1, §4, §5.
-  (2017) Binary paragraph vectors. In Proceedings of the 2nd Workshop on Representation Learning for NLP, pp. 121–130. Cited by: §1, §1, §2.
-  (2015) Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149. Cited by: §2.
-  (2015) Simlex-999: evaluating semantic models with (genuine) similarity estimation. Computational Linguistics 41 (4), pp. 665–695. Cited by: §5.1.
-  (2015) Word-embeddings-benchmarks. GitHub. Note: https://github.com/kudkudak/word-embeddings-benchmarks Cited by: §5, §5.1, §5.2.
-  (2016) Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759. Cited by: §1.
-  (2014) Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882. Cited by: §2.
-  (2016) Neural architectures for named entity recognition. arXiv preprint arXiv:1603.01360. Cited by: §1.
-  (2008) Bridging the gap between semantic theory and computational simulations: proceedings of the esslli workshop on distributional lexical semantics. In Proceedings of the esslli workshop on distributional lexical semantics, Cited by: §5.2.
-  (2015) Tr9856: a multi-word term relatedness benchmark. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), Vol. 2, pp. 419–424. Cited by: §5.1.
-  (2013) Better word representations with recursive neural networks for morphology. In Proceedings of the Seventeenth Conference on Computational Natural Language Learning, pp. 104–113. Cited by: §5.1.
-  (2011-06) Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Portland, Oregon, USA, pp. 142–150. External Links: Cited by: §5.4.
-  (2013) Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781. Cited by: §1.
-  (2013) Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pp. 3111–3119. Cited by: §1.
-  (2017) All-but-the-top: simple and effective postprocessing for word representations. arXiv preprint arXiv:1702.01417. Cited by: §1, §1, §3.2, §4.
-  (2014) Glove: global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543. Cited by: §1, §5.
-  (2011) A word at a time: computing word relatedness using temporal semantic analysis. In Proceedings of the 20th international conference on World wide web, pp. 337–346. Cited by: §5.1.
-  (1965) Contextual correlates of synonymy. Communications of the ACM 8 (10), pp. 627–633. Cited by: §5.1.
-  (2013) Low-rank matrix factorization for deep neural network training with high-dimensional output targets. In 2013 IEEE international conference on acoustics, speech and signal processing, pp. 6655–6659. Cited by: §2.
-  (2016) Compression of neural machine translation models via pruning. arXiv preprint arXiv:1606.09274. Cited by: §2, §5.
-  (2017) Compressing word embeddings via deep compositional code learning. arXiv preprint arXiv:1711.01068. Cited by: §1, §2.
-  (2018) Compressing word embeddings via deep compositional code learning. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, External Links: Cited by: §5.
-  (2015) Learning word representations by jointly modeling syntagmatic and paradigmatic relations. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 136–145. Cited by: §5.
Near-lossless binarization of word embeddings. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 7104–7111. Cited by: §5.
-  (2012) Baselines and bigrams: simple, good sentiment and topic classification. In Proceedings of the 50th annual meeting of the association for computational linguistics: Short papers-volume 2, pp. 90–94. Cited by: §5.3.